特征重要性--feature_importance
feature_importance的特征重要性
There are indeed veral ways to get feature "importances". As often, there is no strict connsus about what this word means.
In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately ). It is sometimes called "gini importance"
or "mean decrea impurity" and is defined as the total decrea in node impurity (weighted by the probability of reaching that node (which is
approximated by the proportion of samples reaching that node)) averaged over all trees of the enmble.
In the literature or in some other packages, you can also find feature importances implemented as the "mean decrea accuracy".
Basically, the idea is to measure the decrea in accuracy on OOB data when you randomly permute the values for that feature. If the decrea is low, then the feature is not important, and vice-versa.
(Note that both algorithms are available in the randomForest R package.)
[1]: Breiman, Friedman, "Classification and regression trees", 1984.
其实sklearn的作者说并没有给出feature importance的具体定义。
feature importance有两种常⽤实现思路:
(1) mean decrea in node impurity:
feature importance is calculated by looking at the splits of each tree.
The importance of the splitting variable is proportional to the improvement to the gini index given by that split and it is accumulated (for each variable) over all the trees in the forest.
就是计算每棵树的每个划分特征在划分准则(gini或者entropy)上的提升,然后对聚合所有树得到特征权重
(2) mean decrea in accuracy:
This method, propod in the original paper, pass the OOB samples down the tree and records pr
ediction accuracy.
A variable is then lected and its values in the OO
B samples are randomly permuted. OOB samples are pasd down the tree and accuracy is computed again.
香茶菜的功效与作用
A decrea in accuracy obtained by this permutation is averaged over all trees for each variable and it provides the importance of that variable (the higher the decreas the higher the importance). 简单来说,如果该特征⾮常的重要,那么稍微改变⼀点它的值,就会对模型造成很⼤的影响。
⾃⼰造数据太⿇烦,可以直接在OOB数据集对该维度的特征数据进⾏打乱,重新训练测试,打乱前的准确率减去打乱后的准确率就是该特征的重要度。该⽅法⼜叫permute。
岳西县政府网以random forest为例
以random forest为例,feature importance特性有助于模型的可解释性。简单考虑下,就算在解释性很强的决策树模型中,如果树过于庞⼤,⼈类也很难解释它做出的结果。
随机森林通常会有上百棵树组成,更加难以解释。好在我们可以找到那些特征是更加重要的,从⽽辅
助我们解释模型。更加重要的是可以剔除⼀些不重要的特征,降低杂讯。⽐起pca降维后的结果,更具有⼈类的可理解性。
sklearn中实现的是第⼀种⽅法【mean decrea in node impurity】实现的。使⽤iris数据集我们来看下效果。
⾸先,对于单棵决策树,权重是怎么计算的呢?
iris = load_iris()
X = iris.data
y = iris.target
dt = DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=3)
dt.fit(X, y)
使⽤Graphviz画出⽣成的决策树社会哲学
sklearn中tree的代码是⽤Cython编写的,具体这部分的源码在compute_feature_importances⽅法中
根据上⾯⽣成的树,我们计算特征2和特征3的权重
特征2: 1.585*150 - 0*50 - 1.0*100 = 137.35
特征3: 1.0*100 - 0.445*54 - 0.151*46 = 69.024
归⼀化之后得到[0, 0, 0.665, 0.335] 我们算的结果和sklearn输出的结果相同。
得到每课树的特征重要度向量之后,就加和平均得到结果,具体代码如下:
def feature_importances_(lf):
"""Return the feature importances (the higher, the more important the
feature).
Returns
-------
feature_importances_ : array, shape = [n_features]
韩湘子"""
check_is_fitted(lf, 'n_outputs_')
小小牵牛花if lf.estimators_ is None or len(lf.estimators_) == 0:
rai ValueError("Estimator not fitted, "
"call `fit` before `feature_importances_`.")
all_importances = Parallel(n_jobs=lf.n_jobs, backend="threading")(
delayed(getattr)(tree, 'feature_importances_')
for tree in lf.estimators_)
return sum(all_importances) / len(lf.estimators_)
View Code
利⽤permutation importance挑选变量
为什么引⼊置换重要性:
决策树适合于寻找也可以解释的⾮线性预测规则,尽管它们的不稳定性和缺乏平滑性引起了⼈们的关注(Hastie 等,)。RandomForest(RF;
Breiman,)分类器旨在克服这些问题,并且由于将决策树的可解释性与现代学习算法(例如⼈⼯神经⽹络和SVM)的性能相结合⽽受到⼴泛欢迎。RF的作者提出了两种⽤于特征排名的度量,即变量重要性(VI)和基尼重要性(GI)。最近的⼀项研究表明,如果预测变量是分类的,则这两种⽅法都偏向于采⽤更多类别的变量(Strobl 等,)。这篇⽂章的作者将偏爱归因于使⽤bootstrap采样和吉尼分裂准则来训练分类树和回归树
(CART; Breiman 等,)。在⽂献中,多年来已经报道了由基尼系数引起的偏差(Bourguignon,; Pyatt 等,),并且通常不仅会影响分类变量,还会影响分组变量(即,将变量聚类的值分成多个独⽴的组,例如多峰⾼斯分布)。在⽣物学中,预测因⼦通常具有分类或分组的值(例如,微阵列和序列突变)。Strobl 等。()提出了⼀种新的算法(cforest),⽤于基于条件推理树构建RF模型(Hothorn 等,),并计算可校正偏差的VI 值。
关于⽣物学数据的学习通常具有⼤量特征和少量可⽤样本的特征。通常的做法是在模型拟合之前滤除不重要的特征,例如,通过拒绝与结果最少相关的特征。互信息(MI)是在这种情况下经常使⽤的⼀种关联度量(Guyon和Eliff,)。它与基尼指数密切相关,并且事实证明,它也偏向于具有更多类别的变量(Achard 等,)。
们在构建树类模型(XGBoost、LightGBM等)时,如果想要知道哪些变量⽐较重要的话。可以通过模型的feature_importances_⽅法来获取特征重要性。
学术会议邀请函例如LightGBM的feature_importances_可以通过特征的分裂次数或利⽤该特征分裂后的增益来衡量。⼀般情况下,不同的衡量准则得到的特征重要性顺序会有差异。
⼀般是通过多种评价标准来交叉选择特征,当⼀个特征在不同的评价标准下都是⽐较重要的,那么该特征对label有较好的预测能⼒。
为⼤家介绍⼀种评价特征重要性的⽅法:PermutationImportance。酸辣木瓜丝
⽂档对该⽅法介绍如下:
eli5 provides a way to compute feature importances for any black-box estimator by measuring how score decreas when a feature is not available;莱特兄弟的故事
the method is also known as “permutation importance” or “Mean Decrea Accuracy (MDA)”.
若将⼀个特征置为随机数,模型效果下降很多,说明该特征⽐较重要;反之则不是。
下⾯为⼤家举⼀个简单的例⼦,我们利⽤不同模型来挑选变量(RF、LightGBM、LR)。并挑选出来重要性排在前30的变量(总变量200+)进⾏建模。
参考博客:
1.和
2.