首页 > 英文翻译

用xgboost获取特征重要性原理及实践

更新时间:2023-05-20 10:19:15 阅读：评论：0

⽤xgboost获取特征重要性原理及实践

xgboost根据结构分数的增益情况计算出来选择哪个特征作为分割点,⽽某个特征的重要性就是它在所有树中出现的次数之和。也就是说⼀个属性越多的被⽤来在模型中构建决策树，它的重要性就相对越⾼

2 xgboost特征重要性排序的⽅法

1. xgboost可以通过get_score获取特征重要性

wbtfor importance_type in (‘weight’, ‘gain’, ‘cover’, ‘total_gain’, ‘total_cover’):

french leaveprint(’%s: ’ % importance_type, _score(importance_type=importance_type))

weight - 该特征在所有树中被⽤作分割样本的特征的次数。

gain - 在所有树中的平均增益。

cover - 在树中使⽤该特征时的平均覆盖范围。(还不是特别明⽩)

2. 利⽤plot_importance画出各个特征的重要性排序

3. 可以通过测试多个阈值，借助衡量分类器优劣的指标，来从特征重要性中选择特征。

下⾯利⽤kaggle的heartdia数据实证分析

演讲稿青春import numpy as np

import pandas as pd

del_lection import train_test_split,StratifiedKFold,train_test_split,GridSearchCV

ics import accuracy_score, confusion_matrix, mean_squared_error,roc_auc_score

from xgboost import plot_importance

from matplotlib import pyplot as plt

import xgboost as xgb

#利⽤ain中的get_score得到weight，gain，以及cover

武汉新东方官网params={ 'max_depth':7,

'n_estimators':80,

'learning_rate':0.1,

'nthread':4,

加拿大出国留学费用'subsample':1.0,

'colsample_bytree':0.5,

'min_child_weight' : 3,

'ed':1301}

bst = ain(params, xgtrain, num_boost_round=1)

for importance_type in ('weight', 'gain', 'cover', 'total_gain', 'total_cover'):

print('%s: ' % importance_type, _score(importance_type=importance_type))

import graphviz

xgb.plot_tree(bst)

drug addictplt.show()

得到的结果如下：

weight: {‘slope’: 2, ‘x’: 2, ‘age’: 7, ‘chol’: 13, ‘trestbps’: 9, ‘restecg’: 2}

gain: {‘slope’: 4.296458304, ‘x’: 2.208011625, ‘age’: 0.8395543860142858, ‘chol’:

0.6131722695384615, ‘trestbps’: 0.49512829022222227, ‘restecg’: 0.679761901}

cover: {‘slope’: 116.5, ‘x’: 106.0, ‘age’: 24.714285714285715, ‘chol’: 22.846153846153847, ‘trestbps’: 18.555555555555557, ‘restecg’: 18.0}

total_gain: {‘slope’: 8.592916608, ‘x’: 4.41602325, ‘age’: 5.8768807021, ‘chol’: 7.971239503999999,

‘trestbps’: 4.456154612000001, ‘restecg’: 1.359523802}

total_cover: {‘slope’: 233.0, ‘x’: 212.0, ‘age’: 173.0, ‘chol’: 297.0, ‘trestbps’: 167.0, ‘restecg’: 36.0}

from sklearn.feature_lection import SelectFromModel

model = xgb.XGBClassifier()

model.fit(X_train, y_train)

#plot_importance；利⽤plot_importance画出各个特征的重要性排序

from xgboost import plot_importance

圣诞快乐英文怎么说

plot_importance(model)

plt.show()

得到结果如下：

promi是什么意思

#我们可以通过测试多个阈值，来从特征重要性中选择特征。具体⽽⾔，每个输⼊变量的特征重要性，本质上允许我们#通过重要性来测试每个特征⼦集。# make predictions for test data and evaluate

y_pred = model.predict(X_test)

predictions = [round(value) for value in y_pred]

accuracy = accuracy_score(y_test, predictions)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

# Fit model using each importance as a threshold

thresholds = np.sort(model.feature_importances_)

for thresh in thresholds:

# lect features using thresholddecency

lection = SelectFromModel(model, threshold=thresh, prefit=True)

lect_X_train = ansform(X_train)

# train model

lection_model = xgb.XGBClassifier()

lection_model.fit(lect_X_train, y_train)

# eval model

lect_X_test = ansform(X_test)

y_pred = lection_model.predict(lect_X_test)

predictions = [round(value) for value in y_pred]

accuracy = accuracy_score(y_test, predictions)

print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, lect_X_train.shape[1], accuracy*100.0))

Accuracy: 84.62%

Thresh=0.025, n=13, Accuracy: 84.62%

Thresh=0.026, n=12, Accuracy: 80.22%

Thresh=0.026, n=11, Accuracy: 79.12%

Thresh=0.028, n=10, Accuracy: 76.92%

Thresh=0.032, n=9, Accuracy: 78.02%

Thresh=0.036, n=8, Accuracy: 80.22%

Thresh=0.041, n=7, Accuracy: 76.92%

Thresh=0.066, n=6, Accuracy: 76.92%

Thresh=0.085, n=5, Accuracy: 84.62%

Thresh=0.146, n=4, Accuracy: 80.22%

Thresh=0.151, n=3, Accuracy: 76.92%

Thresh=0.163, n=2, Accuracy: 74.73%

Thresh=0.174, n=1, Accuracy: 78.02%

由上述结果可以看出，随着阈值的增⼤，特征数⽬的减少，精确度有减⼩的趋势，⼀般情况下⽤交叉验证作为模型评估⽅案可能是更有⽤的策略。这在下⼀篇的xgb.cv中将实现。

>communication

本文发布于:2023-05-20 10:19:15，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/115680.html

上一篇：sklearn的模型训练与预测

下一篇：德国弹簧协会标准(1)

标签：特征重要性选择

留言与评论（共有 0 条评论）