递归式特征消除：Recursivefeatureelimination

更新时间:2023-07-31 20:46:18 阅读：评论：0

##简述

特征的选取⽅式⼀共有三种，在sklearn实现了的包裹式(wrapper)特诊选取只有两个递归式特征消除的⽅法，如下：

recursive feature elimination ( RFE )通过学习器返回的 coef_ 属性或者 feature_importances_ 属性来获得每个特征的重要程度。然后，从当前的特征集合中移除最不重要的特征。在特征集合上不断的重复递归这个步骤，直到最终达到所需要的特征数量为⽌。

RFECV通过交叉验证来找到最优的特征数量。如果减少特征会造成性能损失，那么将不会去除任何特征。这个⽅法⽤以选取单模型特征相当不错，但是有两个缺陷，⼀，计算量⼤。⼆，随着学习器（评估器）的改变，最佳特征组合也会改变，有些时候会造成不利影响。

RFE

性能升降问题

PFE ⾃⾝的特性，使得我们可以⽐较好的进⾏⼿动的特征选择，但是同样的他也存在原模型在去除特征

后的数据集上的性能表现要差于原数据集，这和⼀样，同样是因为去除的特征中保留有有效信息的原因。下⾯的代码就很好的展⽰了这种现象。

from sklearn.feature_lection import RFE, RFECV

from sklearn.svm import LinearSVC

from sklearn.datats import load_iris

from sklearn import model_lection

iris = load_iris()

X, y = iris.data, iris.target幼儿诗歌

## 特征提取

estimator = LinearSVC()

lector = RFE(estimator=estimator, n_features_to_lect=2)

X_t = lector.fit_transform(X, y)

### 切分测试集与验证集

X_train, X_test, y_train, y_test = ain_test_split(X, y,

test_size=0.25, random_state=0, stratify=y)

X_train_t, X_test_t, y_train_t, y_test_t = ain_test_split(X_t, y,

test_size=0.25, random_state=0,

stratify=y)

## 测试与验证

醪糟粉子

clf = LinearSVC()

clf_t = LinearSVC()

clf.fit(X_train, y_train)

clf_t.fit(X_train_t, y_train_t)

print("Original DataSet: test score=%s"%(clf.score(X_test, y_test)))

print("Selected DataSet: test score=%s"%(clf_t.score(X_test_t, y_test_t)))

Original DataSet: test score=0.973684210526

Selected DataSet: test score=0.947368421053

从上⾯的代码我们可以看出，原模型的性能在使⽤RFE后确实下降了，如同⽅差过滤，单变量特征选取⼀样，这种⽅式看来使⽤这个⽅法我们也需要谨慎⼀些啊。

⼀些重要的属性与参数

n_features_to_lect ：选出的特征整数时为选出特征的个数，None时选取⼀半

step ：整数时，每次去除的特征个数，⼩于1时，每次去除权重最⼩的特征

print("N_features %s"% lector.n_features_)# 保留的特征数

print("Support is %s"% lector.support_)# 是否保留

print("Ranking %s"% lector.ranking_)# 重要程度排名

N_features 2

Support is [Fal True Fal True]

Ranking [3 1 2 1]

RFECV

原理与特性

使⽤交叉验证来保留最佳性能的特征。在REF的基础上对不同的特征组合进⾏交叉验证，学习器本⾝不变，通过计算其决策系数之和，最终得到不同特征对于score的重要程度，然后保留最佳的特征组合。其分割⽅式类似于随机森林中的列上⼦采样。

⼀些重要的属性与参数

step ：整数时，每次去除的特征个数，⼩于1时，每次去除权重最⼩的特征

scoring ：字符串类型，选择sklearn中的scorer作为输⼊对象

cv ：

默认为3折

整数为cv数

object：⽤作交叉验证⽣成器的对象

An iterable yielding train/test splits.

对于迭代器或者没有输⼊（None）, 如果 y 是⼆进制或者多类，则使⽤ del_lection.StratifiedKFold. 如果学习器是个分类器或者如果 y 不是⼆进制或者多类，使⽤ del_lection.KFold.

如果你对于前⾯的花不太理解，那么你可以看⼀下下⾯的例⼦，或者⾃⼰动⼿尝试⼀下

例⼦⼀

对于前⾯RFE中的数据集进⾏验证，应当应该保留那些特征：

iris = load_iris()

X = iris.data

y = iris.target

estimator = LinearSVC()

lector = RFECV(estimator=estimator, cv=3)

lector.fit(X, y)

print("N_features %s"% lector.n_features_)

print("Support is %s"% lector.support_)煮饺子的方法

print("Ranking %s"% lector.ranking_)

print("Grid Scores %s"% id_scores_)

N_features 4

Support is [ True True True True]

Ranking [1 1 1 1]

Grid Scores [ 0.91421569 0.94689542 0.95383987 0.96691176]

好吧，看来都应该保留

例⼦⼆

RFECV的强⼤作⽤：

import matplotlib.pyplot as plt

奖励通知from sklearn.svm import SVC

del_lection import StratifiedKFold

from sklearn.feature_lection import RFECV

from sklearn.datats import make_classification

stares

# Build a classification task using 3 informative features

X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,

n_redundant=2, n_repeated=0, n_class=8,

n_clusters_per_class=1, random_state=0)

# Create the RFE object and compute a cross-validated score.

svc = SVC(kernel="linear")

# The "accuracy" scoring is proportional to the number of correct

# classifications

rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),

scoring='accuracy')

rfecv.fit(X, y)

print("Optimal number of features : %d"% rfecv.n_features_)

print("Ranking of features : %s"% rfecv.ranking_)

# Plot number of features VS. cross-validation scores

plt.figure()

plt.xlabel("Number of features lected")

plt.ylabel("Cross validation score (nb of correct classifications)")

plt.plot(range(1,id_scores_)+1), id_scores_)

plt.show()

Optimal number of features : 3

Ranking of features : [ 5 1 12 19 15 6 17 1 2 21 23 11 16 10 13 22 8 14 1 20 7 9 3 4 18]

（划重点了，咳咳）

通过RFECV我们得知，原来只需要三个特征就好了，⾸先这确实符合我们构造的数据，同时这也向我们展⽰了RFECV的强⼤潜⼒，看来它

将成为我们之后进⾏特征选取的⼀个重要助⼿()/~

三个特殊的多类⽐较特征选择

假阳性率（fal positive rate） SelectFpr

伪发现率（fal discovery rate） SelectFdr

或者族系误差（family wi error） SelectFwe

其实际意义请参考

下⾯是代码展⽰

from sklearn.feature_lection import SelectFdr,f_classif,SelectFpr,SelectFwe,chi2,mutual_info_classif

iris = load_iris()

X = iris.data

y = iris.target

lector1 = SelectFpr(score_func = mutual_info_classif,alpha=0.5)

# alpha是预期错误发现率的上限，默认是0.5,score_func 默认为 f_classif

lector1.fit(X, y)

print("\nScores of features %s"% lector1.scores_)

print("p-values of feature scores is %s"% lector1.pvalues_)

# print("Shape after transform is ",ansform(X).shape)

lector2 = SelectFdr(score_func = f_classif,alpha=4.37695696e-80)# alpha是预期错误发现率的上限四月份

lector2.fit(X, y)

print("\nScores of features %s"% lector2.scores_)

print("p-values of feature scores is %s"% lector2.pvalues_)

print("Shape after transform is ",ansform(X).shape)

lector3 = SelectFwe(score_func = chi2,alpha=1)# alpha是预期错误发现率的上限

lector3.fit(X, y)

print("\nScores of features %s"% lector3.scores_)

print("p-values of feature scores is %s"% lector3.pvalues_)

print("Shape after transform is ",ansform(X).shape)

输出:

Scores of features [ 0.54158942 0.21711645 0.99669173 0.99043692]

p-values of feature scores is None

Scores of features [ 119.26450218 47.3644614 1179.0343277 959.32440573]

p-values of feature scores is [ 1.66966919e-31 1.32791652e-16 3.05197580e-91 4.37695696e-85]

Shape after transform is (150, 2)

Scores of features [ 10.81782088 3.59449902 116.16984746 67.24482759]

p-values of feature scores is [ 4.47651499e-03 1.65754167e-01 5.94344354e-26 2.50017968e-15]

Shape after transform is (150, 4)

通⽤RFE:GenericUnivariateSelect

在学习了前⾯的RFE之后,sklearn还封装了⼀个通⽤的RFE:GenericUnivariateSelect，它可以通过超

参数来设置我们需要的RFE,⼀共是三个超参数灰常简单易⽤。

score_func ：评价函数（和前⾯的意思⼀样）

mode ： sklearn 封装的模型

param ：之前sklearn中封装的模型都有⼀个相应的控制阈值的超参数 param，此处意义相同

下⾯是⼀个简单的⼩例⼦

from sklearn.feature_lection import GenericUnivariateSelect

iris = load_iris()

行政单位会计准则X = iris.data

y = iris.target

estimator = LinearSVC()

lector = GenericUnivariateSelect(score_func=f_classif,mode='fpr',param=0.5)

# mode : {'percentile', 'k_best', 'fpr', 'fdr', 'fwe'}

lector.fit(X, y)

print("\nScores of features %s"% lector.scores_)

print("p-values of feature scores is %s"% lector.pvalues_)

print("Shape after transform is ",ansform(X).shape)

print("Support is ",_support())

print("Params is ",_params())

Scores of features [ 119.26450218 47.3644614 1179.0343277 959.32440573]

p-values of feature scores is [ 1.66966919e-31 1.32791652e-16 3.05197580e-91 4.37695696e-85] Shape after transform is (150, 4)

Support is [ True True True True]西汉都城在哪里

Params is {'mode': 'fpr', 'param': 0.5, 'score_func': <function f_classif at 0x7f6ecee7d7b8>}

参考资料

本文发布于:2023-07-31 20:46:18，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1103477.html

上一篇：AE创建各种类型的featureclass代码

下一篇：Beyond Bags of Features Spatial Pyramid Matching for Recognizing Natural Scene Categories

标签：特征验证保留性能去除学习

留言与评论（共有 0 条评论）