⽤xgboost获取特征重要性原理及实践
xgboost根据结构分数的增益情况计算出来选择哪个特征作为分割点,⽽某个特征的重要性就是它在所有树中出现的次数之和。也就是说⼀个属性越多的被⽤来在模型中构建决策树,它的重要性就相对越⾼
2 xgboost特征重要性排序的⽅法
1. xgboost可以通过get_score获取特征重要性
wbtfor importance_type in (‘weight’, ‘gain’, ‘cover’, ‘total_gain’, ‘total_cover’):
french leaveprint(’%s: ’ % importance_type, _score(importance_type=importance_type))
weight - 该特征在所有树中被⽤作分割样本的特征的次数。
gain - 在所有树中的平均增益。
cover - 在树中使⽤该特征时的平均覆盖范围。(还不是特别明⽩)
2. 利⽤plot_importance画出各个特征的重要性排序
3. 可以通过测试多个阈值,借助衡量分类器优劣的指标,来从特征重要性中选择特征。
下⾯利⽤kaggle的heartdia数据实证分析
演讲稿 青春import numpy as np
import pandas as pd
del_lection import train_test_split,StratifiedKFold,train_test_split,GridSearchCV
ics import accuracy_score, confusion_matrix, mean_squared_error,roc_auc_score
from xgboost import plot_importance
from matplotlib import pyplot as plt
import xgboost as xgb
#利⽤ain中的get_score得到weight,gain,以及cover
武汉新东方官网params={ 'max_depth':7,
'n_estimators':80,
'learning_rate':0.1,
'nthread':4,
加拿大出国留学费用'subsample':1.0,
'colsample_bytree':0.5,
'min_child_weight' : 3,
'ed':1301}
bst = ain(params, xgtrain, num_boost_round=1)
for importance_type in ('weight', 'gain', 'cover', 'total_gain', 'total_cover'):
print('%s: ' % importance_type, _score(importance_type=importance_type))
import graphviz
xgb.plot_tree(bst)
drug addictplt.show()
得到的结果如下:
weight: {‘slope’: 2, ‘x’: 2, ‘age’: 7, ‘chol’: 13, ‘trestbps’: 9, ‘restecg’: 2}
gain: {‘slope’: 4.296458304, ‘x’: 2.208011625, ‘age’: 0.8395543860142858, ‘chol’:
0.6131722695384615, ‘trestbps’: 0.49512829022222227, ‘restecg’: 0.679761901}
cover: {‘slope’: 116.5, ‘x’: 106.0, ‘age’: 24.714285714285715, ‘chol’: 22.846153846153847, ‘trestbps’: 18.555555555555557, ‘restecg’: 18.0}
total_gain: {‘slope’: 8.592916608, ‘x’: 4.41602325, ‘age’: 5.8768807021, ‘chol’: 7.971239503999999,
‘trestbps’: 4.456154612000001, ‘restecg’: 1.359523802}
total_cover: {‘slope’: 233.0, ‘x’: 212.0, ‘age’: 173.0, ‘chol’: 297.0, ‘trestbps’: 167.0, ‘restecg’: 36.0}
from sklearn.feature_lection import SelectFromModel
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
#plot_importance;利⽤plot_importance画出各个特征的重要性排序
from xgboost import plot_importance
圣诞快乐英文怎么说
plot_importance(model)
plt.show()
得到结果如下:
promi是什么意思
#我们可以通过测试多个阈值,来从特征重要性中选择特征。具体⽽⾔,每个输⼊变量的特征重要性,本质上允许我们#通过重要性来测试每个特征⼦集。# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = np.sort(model.feature_importances_)
for thresh in thresholds:
# lect features using thresholddecency
lection = SelectFromModel(model, threshold=thresh, prefit=True)
lect_X_train = ansform(X_train)
# train model
lection_model = xgb.XGBClassifier()
lection_model.fit(lect_X_train, y_train)
# eval model
lect_X_test = ansform(X_test)
y_pred = lection_model.predict(lect_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, lect_X_train.shape[1], accuracy*100.0))
Accuracy: 84.62%
Thresh=0.025, n=13, Accuracy: 84.62%
Thresh=0.026, n=12, Accuracy: 80.22%
Thresh=0.026, n=11, Accuracy: 79.12%
Thresh=0.028, n=10, Accuracy: 76.92%
Thresh=0.032, n=9, Accuracy: 78.02%
Thresh=0.036, n=8, Accuracy: 80.22%
Thresh=0.041, n=7, Accuracy: 76.92%
Thresh=0.066, n=6, Accuracy: 76.92%
Thresh=0.085, n=5, Accuracy: 84.62%
Thresh=0.146, n=4, Accuracy: 80.22%
Thresh=0.151, n=3, Accuracy: 76.92%
Thresh=0.163, n=2, Accuracy: 74.73%
Thresh=0.174, n=1, Accuracy: 78.02%
由上述结果可以看出,随着阈值的增⼤,特征数⽬的减少,精确度有减⼩的趋势,⼀般情况下⽤交叉验证作为模型评估⽅案可能是更有⽤的策略。这在下⼀篇的xgb.cv中将实现。
>communication