赞
踩
回想起去年数模的惨痛经历,发现自己在数据挖掘上还是存在很多漏洞。然后我翻了翻去年的博客,重新学习了一遍又有了新的收获。之前在特征工程上做的太过于粗糙,仔细研究了一下其中的特征提取,借鉴了网上一些博客,进行了整合和优化。下面我们开始新的特征提取路程啦!!!
现有的特征提取方法可大致分为三个方向:
下面的例子就是通过VarianceThreshold函数设定阈值进行特征的筛选,将一些特征方差小的进行排除掉
- from sklearn.feature_selection import VarianceThreshold
- x = [[0,1,0],[0,0,1],[1,1,0],[0,0,2]]
- model = VarianceThreshold(threshold=0.2)
- x_filter = model.fit_transform(x)
- print(x_filter)
下面是运行结果:
- [[1 0]
- [0 1]
- [1 0]
- [0 2]]
可以看到VarianceThreshold筛选掉了第一列,因为第一列中0出现的次数太多,导致它的方差很小就通不过过滤器。
对于分类问题(y离散),可采用:
卡方检验,f_classif, mutual_info_classif,互信息
对于回归问题(y连续),可采用:
皮尔森相关系数,f_regression, mutual_info_regression,最大信息系数
卡方检验(chi2):
经典的卡方检验是检验定性自变量对定性因变量的相关性。比如,我们可以对样本进行一次chi2chi2 测试来选择最佳的两项特征:
- from sklearn.feature_selection import chi2
- from sklearn.feature_selection import SelectKBest
- from sklearn.datasets import load_iris
- iris = load_iris()
- x,y = iris.data,iris.target
- print(x.shape)
- x_filter = SelectKBest(chi2,k=2).fit_transform(x,y)
- print(x_filter.shape)
结果如下:
- (150, 4)
- (150, 2)
sklearn官方解释:对特征含有权重的预测模型(例如,线性模型对应参数coefficients),RFE通过递归减少考察的特征集规模来选择特征。首先,预测模型在原始特征上训练,每个特征指定一个权重。之后,那些拥有最小绝对值权重的特征被踢出特征集。如此往复递归,直至剩余的特征数量达到所需的特征数量。
RFECV 通过交叉验证的方式执行RFE,以此来选择最佳数量的特征:对于一个数量为d的feature的集合,他的所有的子集的个数是2的d次方减1(包含空集)。指定一个外部的学习算法,比如SVM之类的。通过该算法计算所有子集的validation error。选择error最小的那个子集作为所挑选的特征。
- from sklearn.feature_selection import RFE
- from sklearn.linear_model import LogisticRegression
- print(x[:5])
- x_filter = RFE(estimator=LogisticRegression(),
- n_features_to_select=2).fit_transform(x,y)
- print(x_filter[:5])
结果如下:
- [[5.1 3.5 1.4 0.2]
- [4.9 3. 1.4 0.2]
- [4.7 3.2 1.3 0.2]
- [4.6 3.1 1.5 0.2]
- [5. 3.6 1.4 0.2]]
- [[3.5 0.2]
- [3. 0.2]
- [3.2 0.2]
- [3.1 0.2]
- [3.6 0.2]]
- # use feature importance for feature selection
- from numpy import loadtxt
- from numpy import sort
- from xgboost import XGBClassifier
- from sklearn.model_selection import train_test_split
- from sklearn.metrics import accuracy_score
- from sklearn.feature_selection import SelectFromModel
- # load data
- dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
- # split data into X and y
- X = dataset[:,0:8]
- Y = dataset[:,8]
- # split data into train and test sets
- X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
- # fit model on all training data
- model = XGBClassifier()
- model.fit(X_train, y_train)
- # make predictions for test data and evaluate
- y_pred = model.predict(X_test)
- predictions = [round(value) for value in y_pred]
- accuracy = accuracy_score(y_test, predictions)
- print("Accuracy: %.2f%%" % (accuracy * 100.0))
- # Fit model using each importance as a threshold
- thresholds = sort(model.feature_importances_)
- for thresh in thresholds:
- # select features using threshold
- selection = SelectFromModel(model, threshold=thresh, prefit=True)
- select_X_train = selection.transform(X_train)
- # train model
- selection_model = XGBClassifier()
- selection_model.fit(select_X_train, y_train)
- # eval model
- select_X_test = selection.transform(X_test)
- y_pred = selection_model.predict(select_X_test)
- predictions = [round(value) for value in y_pred]
- accuracy = accuracy_score(y_test, predictions)
- print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

结果如下所示:
- Accuracy: 77.95%
- Thresh=0.071, n=8, Accuracy: 77.95%
- Thresh=0.073, n=7, Accuracy: 76.38%
- Thresh=0.084, n=6, Accuracy: 77.56%
- Thresh=0.090, n=5, Accuracy: 76.38%
- Thresh=0.128, n=4, Accuracy: 76.38%
- Thresh=0.160, n=3, Accuracy: 74.80%
- Thresh=0.186, n=2, Accuracy: 71.65%
- Thresh=0.208, n=1, Accuracy: 63.78%
这里感谢原博客:https://www.cnblogs.com/stevenlk/p/6543628.html给予的帮助。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。