pythonxgboost输出变量重要性_xgboost特征重要性指标:
weight,ga。。。
官⽅解释
Python中的xgboost可以通过get_fscore获取特征重要性,先看看官⽅对于这个⽅法的说明:
get_score(fmap=’’, importance_type=‘weight’)
Get feature importance of each feature. Importance type can be defined as:
‘weight’: the number of times a feature is ud to split the data across all trees.
‘gain’: the average gain across all splits the feature is ud in.
‘cover’: the average coverage across all splits the feature is ud in.
‘total_gain’: the total gain across all splits the feature is ud in.
‘total_cover’: the total coverage across all splits the feature is ud in.
看释义不直观,下⾯通过训练⼀个简单的模型,输出这些重要性指标,再结合释义进⾏解释。
代码实践
⾸先构造10个样例的样本,每个样例有两维特征,标签为0或1,⼆分类问题:
import numpy as np
sample_num = 10
feature_num = 2
np.random.ed(0)
data = np.random.randn(sample_num, feature_num)
np.random.ed(0)
label = np.random.randint(0, 2, sample_num)
输出data和label:
array([[ 1.76405235, 0.40015721],我的未来不是梦演讲稿
[ 0.97873798, 2.2408932 ],
[ 1.86755799, -0.97727788],
[ 0.95008842, -0.15135721],
[-0.10321885, 0.4105985 ],
[ 0.14404357, 1.45427351],
[ 0.76103773, 0.12167502],
[ 0.44386323, 0.33367433],
[ 1.49407907, -0.20515826],
[ 0.3130677 , -0.85409574]])
# label:
array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1])
训练,这⾥为了便于下⾯计算,将树深度设为3(‘max_depth’: 3),只⽤⼀棵树(num_boost_round=1):
import xgboost as xgb
train_data = xgb.DMatrix(data, label=label)
params = {'max_depth': 3}
bst = ain(params, train_data, num_boost_round=1)嫩豆腐汤的做法
输出重要性指标:银行业
for importance_type in ('weight', 'gain', 'cover', 'total_gain', 'total_cover'):
print('%s: ' % importance_type, _score(importance_type=importance_type))
结果:
weight: {'f0': 1, 'f1': 2}
gain: {'f0': 0.265151441, 'f1': 0.375000015}
cover: {'f0': 10.0, 'f1': 4.0}
total_gain: {'f0': 0.265151441, 'f1': 0.75000003}
total_cover: {'f0': 10.0, 'f1': 8.0}
画出唯⼀的⼀棵树图:
<_graphviz(bst, num_trees=0)
下⾯就结合这张图,解释下各指标含义:
weight: {‘f0’: 1, ‘f1’: 2}
在所有树中,某特征被⽤来分裂节点的次数,在本例中,可见分裂第1个节点时⽤到f0,分裂第2,3个节点时⽤到f1,所以weight_f0 = 1, weight_f1 = 2。
total_cover: {‘f0’: 10.0, ‘f1’: 8.0}
怎样学习象棋
第1个节点,f0被⽤来对所有10个样例进⾏分裂,之后的节点中f0没再被⽤到,所以f0的total_cover为10.0,此时f0 >= 0.855563045的样例有5个,落⼊右⼦树;
第2个节点,f1被⽤来对上⾯落⼊右⼦树的5个样例进⾏分裂,其中f1 >= -0.178257734的样例有3个,落⼊右⼦树;
第3个节点,f1被⽤来对上⾯落⼊右⼦树的3个样例进⾏分裂。快乐的秋游
总结起来,f0在第1个节点分裂了10个样例,所以total_cover_f0 = 10,f1在第2、3个节点分别⽤于分裂5、3个样例,所以
total_cover_f1 = 5 + 3 = 8。total_cover表⽰在所有树中,某特征在每次分裂节点时处理(覆盖)的所有样例的数量。dv拍摄
cover: {‘f0’: 10.0, ‘f1’: 4.0}
cover = total_cover / weight,在本例中,cover_f0 = 10 / 1,cover_f1 = 8 / 2 = 4.
total_gain: {‘f0’: 0.265151441, ‘f1’: 0.75000003}
在所有树中,某特征在每次分裂节点时带来的总增益,如果⽤熵或基尼不纯衡量分裂前后的信息量分别为i0和i1,则增益为(i0 - i1)。
gain: {‘f0’: 0.265151441, ‘f1’: 0.375000015}
gain = total_gain / weight,在本例中,gain_f0 = 0.265151441 / 1,gain_f1 = 75000003 / 2 = 375000015.
在平时的使⽤中,多⽤total_gain来对特征重要性进⾏排序。
构造xgboost分类器还有另外⼀种⽅式,这种⽅式类似于sklearn中的分类器,采⽤fit, transform形式训练模型: from xgboost import XGBClassifier
cls = XGBClassifier(ba_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.07, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=300,
n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
游山西村诗意
夏俊艾reg_alpha=0, reg_lambda=1, scale_pos_weight=1, ed=None,
silent=True, subsample=1)
# 训练模型
# cls.fit(data, label)
采⽤下⾯的⽅式获取特征重要性指标:
for importance_type in ('weight', 'gain', 'cover', 'total_gain', 'total_cover'):
print('%s: ' % importance_type, _booster().get_score(importance_type=importance_type))