数据挖掘之特征提取_数据特征提取

作者：羊村懒王 | 2024-06-15 05:43:54

踩

数据特征提取

回想起去年数模的惨痛经历，发现自己在数据挖掘上还是存在很多漏洞。然后我翻了翻去年的博客，重新学习了一遍又有了新的收获。之前在特征工程上做的太过于粗糙，仔细研究了一下其中的特征提取，借鉴了网上一些博客，进行了整合和优化。下面我们开始新的特征提取路程啦！！！

现有的特征提取方法可大致分为三个方向：

Filter：过滤法，按照发散性或者相关性对各个特征进行评分，设定阈值或者待选择阈值的个数，选择特征。
Wrapper：包装法，根据目标函数（通常是预测效果评分），每次选择若干特征，或者排除若干特征。
Embedded：嵌入法，先使用某些机器学习的算法和模型进行训练，得到各个特征的权值系数，根据系数从大到小选择特征。类似于Filter方法，但是是通过训练来确定特征的优劣。

一、首先Filter 过滤法一般作为数据挖掘的最开始Preprocessing。当特征值都是离散型变量的时候这种方法才能用，如果是连续型变量，就需要将连续变量离散化之后才能用。

1. 移除低方差的特征 (Removing features with low variance)

下面的例子就是通过VarianceThreshold函数设定阈值进行特征的筛选，将一些特征方差小的进行排除掉


from  sklearn.feature_selection import VarianceThreshold
x = [[0,1,0],[0,0,1],[1,1,0],[0,0,2]]
model = VarianceThreshold(threshold=0.2)
x_filter = model.fit_transform(x)
print(x_filter)

下面是运行结果：


[[1 0]
 [0 1]
 [1 0]
 [0 2]]

可以看到VarianceThreshold筛选掉了第一列，因为第一列中0出现的次数太多，导致它的方差很小就通不过过滤器。

2. 单变量特征选择 (Univariate feature selection)

对于分类问题(y离散)，可采用：
　　　　卡方检验，f_classif, mutual_info_classif，互信息
　　对于回归问题(y连续)，可采用：
　　　　皮尔森相关系数，f_regression, mutual_info_regression，最大信息系数

卡方检验（chi2）：

经典的卡方检验是检验定性自变量对定性因变量的相关性。比如，我们可以对样本进行一次chi2chi2 测试来选择最佳的两项特征：


from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.datasets import load_iris
iris = load_iris()
x,y = iris.data,iris.target
print(x.shape)
x_filter = SelectKBest(chi2,k=2).fit_transform(x,y)
print(x_filter.shape)

结果如下：


(150, 4)
(150, 2)

二、递归特征消除 (Recursive Feature Elimination)

递归消除特征法使用一个基模型来进行多轮训练，每轮训练后，移除若干权值系数的特征，再基于新的特征集进行下一轮训练。

　　sklearn官方解释：对特征含有权重的预测模型(例如，线性模型对应参数coefficients)，RFE通过递归减少考察的特征集规模来选择特征。首先，预测模型在原始特征上训练，每个特征指定一个权重。之后，那些拥有最小绝对值权重的特征被踢出特征集。如此往复递归，直至剩余的特征数量达到所需的特征数量。

　　RFECV 通过交叉验证的方式执行RFE，以此来选择最佳数量的特征：对于一个数量为d的feature的集合，他的所有的子集的个数是2的d次方减1(包含空集)。指定一个外部的学习算法，比如SVM之类的。通过该算法计算所有子集的validation error。选择error最小的那个子集作为所挑选的特征。


from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
print(x[:5])
x_filter = RFE(estimator=LogisticRegression(),
   n_features_to_select=2).fit_transform(x,y)
print(x_filter[:5])

结果如下：


[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[[3.5 0.2]
 [3.  0.2]
 [3.2 0.2]
 [3.1 0.2]
 [3.6 0.2]]

三、Embedded用的最多能力也最强，这里举一个xgboost结合SelectFromModel的例子来分析如何进行选择最优的特征。在下面这个例子中，我们首先将拆分出训练集和测试集，然后在训练集上训练XGBoost模型，用测试集来验证模型的准确率。此外，基于训练XGBoost得到的feature_impoerance，通过SelectFromModel进行特征选择，并比较不同特征重要性阈值下的准确率。


# use feature importance for feature selection
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
	# select features using threshold
	selection = SelectFromModel(model, threshold=thresh, prefit=True)
	select_X_train = selection.transform(X_train)
	# train model
	selection_model = XGBClassifier()
	selection_model.fit(select_X_train, y_train)
	# eval model
	select_X_test = selection.transform(X_test)
	y_pred = selection_model.predict(select_X_test)
	predictions = [round(value) for value in y_pred]
	accuracy = accuracy_score(y_test, predictions)
	print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

结果如下所示：


Accuracy: 77.95%
Thresh=0.071, n=8, Accuracy: 77.95%
Thresh=0.073, n=7, Accuracy: 76.38%
Thresh=0.084, n=6, Accuracy: 77.56%
Thresh=0.090, n=5, Accuracy: 76.38%
Thresh=0.128, n=4, Accuracy: 76.38%
Thresh=0.160, n=3, Accuracy: 74.80%
Thresh=0.186, n=2, Accuracy: 71.65%
Thresh=0.208, n=1, Accuracy: 63.78%

从结果中可以看出随着特征重要性阈值的增加，选择特征数量的减少，模型的准确率也在下降
我们必须在模型复杂度（特征数量）和准确率做一个权衡，但是有些情况，特征数量的减少反而会是准确率升高，因为这些被剔除特征是噪声

这里感谢原博客：https://www.cnblogs.com/stevenlk/p/6543628.html给予的帮助。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/羊村懒王/article/detail/721121