xgboost算模型输出的解释

更新时间:2023-05-15 07:48:51 阅读：评论：0

xgboost算模型输出的解释

杨雪

1. 问题描述

近来, 在python环境下使⽤xgboost算法作若⼲的机器学习任务, 在这个过程中也使⽤了其内置的函数来可视化树的结果, 但对leaf value的值⼀知半解; 同时, 也遇到过使⽤xgboost 内置的predict 对测试集进⾏打分预测, 发现若⼲样本集的输出分值是⼀样的. 这个问题该怎么解释呢? 通过翻阅上的相关问题, 以及搜索到的上的issue回答, 应该算初步对这个问题有了⼀定的理解。

2. 数据集

在这⾥, 使⽤经典的鸢尾花的数据来说明. 使⽤⼆分类的问题来说明, 故在这⾥只取前100⾏的数据.

from sklearn import datats

iris = datats.load_iris()

data = iris.data[:100]

print data.shape

#(100L, 4L)

#⼀共有100个样本数据, 维度为4维

label = iris.target[:100]

print label

#正好选取label为0和1的数据

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

3. 训练集与测试集

ss_validation import train_test_split

train_x, test_x, train_y, test_y = train_test_split(data, label, random_state=0)

4. Xgboost建模

4.1 模型初始化设置

import xgboost as xgb

dtrain=xgb.DMatrix(train_x,label=train_y)

dtest=xgb.DMatrix(test_x)

params={'booster':'gbtree',

'objective': 'binary:logistic',

'eval_metric': 'auc',

'max_depth':4,

'lambda':10,

'subsample':0.75,

'colsample_bytree':0.75,

'min_child_weight':2,

'eta': 0.025,

'ed':0,

'nthread':8,

'silent':1}

watchlist = [(dtrain,'train')]

4.2 建模与预测

ain(params,dtrain,num_boost_round=100,evals=watchlist)

ypred=bst.predict(dtest)

# 设置阈值, 输出⼀些评价指标

y_pred = (ypred >= 0.5)*1

from sklearn import metrics

print 'AUC: %.4f' % _auc_score(test_y,ypred)电子报刊制作软件

print 'ACC: %.4f' % metrics.accuracy_score(test_y,y_pred)

print 'Recall: %.4f' % all_score(test_y,y_pred)

print 'F1-score: %.4f' %metrics.f1_score(test_y,y_pred)

print 'Precesion: %.4f' %metrics.precision_score(test_y,y_pred)

Out[23]:

AUC: 1.0000

ACC: 1.0000

Recall: 1.0000

F1-score: 1.0000

Precesion: 1.0000

array([[13, 0],

[ 0, 12]], dtype=int64)

Yeah, 完美的模型, 完美的预测!

4.3 可视化输出

#对于预测的输出有三种⽅式

bst.predict

Signature: bst.predict(data, output_margin=Fal, ntree_limit=0, pred_leaf=Fal, pred_contribs=Fal, approx_contribs=Fal)

pred_leaf : bool

When this option is on, the output will be a matrix of (nsample, ntrees)

with each record indicating the predicted leaf index of each sample in each tree.

Note that the leaf index of a tree is unique per tree, so you may find leaf 1

希利尔讲世界史

in both tree 1 and tree 0.

pred_contribs : bool

When this option is on, the output will be a matrix of (nsample, nfeats+1)

with each record indicating the feature contributions (SHAP values) for that

prediction. The sum of all feature contributions is equal to the prediction.

Note that the bias is added as the final column, on top of the regular features.

4.3.1 得分

默认的输出就是得分, 这没什么好说的, 直接上code.

ypred = bst.predict(dtest)

ypred

Out[32]:

array([ 0.20081411, 0.80391562, 0.20081411, 0.80391562, 0.80391562,

0.80391562, 0.20081411, 0.80391562, 0.80391562, 0.80391562,

0.80391562, 0.80391562, 0.80391562, 0.20081411, 0.20081411,

0.20081411, 0.20081411, 0.20081411, 0.20081411, 0.20081411,

0.20081411, 0.80391562, 0.20081411, 0.80391562, 0.20081411], dtype=float32)

在这⾥, 就可以观察到⽂章最开始遇到的问题: 为什么得分⼏乎都是⼀样的值? 先不急, 看看另外两种输出.

4.3.2 所属的叶⼦节点

当设置pred_leaf=True的时候, 这时就会输出每个样本在所有树中的叶⼦节点

ypred_leaf = bst.predict(dtest, pred_leaf=True)

ypred_leaf

Out[33]:

array([[1, 1, 1, ..., 1, 1, 1],

[2, 2, 2, ..., 2, 2, 2],

[1, 1, 1, ..., 1, 1, 1],

...,

[1, 1, 1, ..., 1, 1, 1],

[2, 2, 2, ..., 2, 2, 2],

[1, 1, 1, ..., 1, 1, 1]])

输出的维度为[样本数, 树的数量], 树的数量默认是100, 所以ypred_leaf的维度为[100*100].

对于第⼀⾏数据的解释就是, 在xgboost所有的100棵树⾥, 预测的叶⼦节点都是1(相对于每颗树).

那怎么看每颗树以及相应的叶⼦节点的分值呢?这⾥有两种⽅法, 可视化树或者直接输出模型.

<_graphviz(bst, num_trees=0)

#可视化第⼀棵树的⽣成情况

#直接输出模型的迭代⼯程

bst.dump_model("")

booster[0]:

0:[f3<0.75] yes=1,no=2,missing=1

1:leaf=-0.019697

2:leaf=0.0214286

booster[1]:

0:[f2<2.35] yes=1,no=2,missing=1

1:leaf=-0.0212184

2:leaf=0.0212

booster[2]:

0:[f2<2.35] yes=1,no=2,missing=1

我终于赢了

1:leaf=-0.0197404

2:leaf=0.0197235

booster[3]: ……

通过上述命令就可以输出模型的迭代过程, 可以看到每颗树都有两个叶⼦节点(树⽐较简单). 然后我们对每颗树中的叶⼦节点1的value进⾏累加求和, 同时进⾏相应的函数转换, 就是第⼀个样本的预测值.

在这⾥, 以第⼀个样本为例, 可以看到, 该样本在所有树中都属于第⼀个叶⼦, 所以累加值, 得到以下值.

同样, 以第⼆个样本为例, 可以看到, 该样本在所有树中都属于第⼆个叶⼦, 所以累加值, 得到以下值.

leaf1 -1.381214

leaf2 1.410950

在使⽤xgboost模型最开始, 模型初始化的时候, 我们就设置了'objective': 'binary:logistic', 因此使⽤函数将累加的值转换为实际的打分:

f(x)=1/(1+exp(−x))

1/float(p(1.38121416))

Out[24]: 0.20081407112186503

1/float(p(-1.410950))

Out[25]: 0.8039157403338895

这就与ypred = bst.predict(dtest) 的分值相对应上了.

伤感男生头像动漫

4.3.2 特征重要性

接着, 我们看另⼀种输出⽅式, 输出的是特征相对于得分的重要性.

ypred_contribs = bst.predict(dtest, pred_contribs=True)

ypred_contribs

Out[37]:

array([[ 0. , 0. , -1.01448286, -0.41277751, 0.04604663],

[ 0. , 0. , 0.96967536, 0.39522746, 0.04604663],

[ 0. , 0. , -1.01448286, -0.41277751, 0.04604663],

[ 0. , 0. , 0.96967536, 0.39522746, 0.04604663],

[ 0. , 0. , -1.01448286, -0.41277751, 0.04604663],

[ 0. , 0. , 0.96967536, 0.39522746, 0.04604663],

小学生交通安全知识[ 0. , 0. , 0.96967536, 0.39522746, 0.04604663],

[ 0. , 0. , 0.96967536, 0.39522746, 0.04604663],

[ 0. , 0. , -1.01448286, -0.41277751, 0.04604663],

宝珠茉莉

[ 0. , 0. , -1.01448286, -0.41277751, 0.04604663],

[ 0. , 0. , 0.96967536, 0.39522746, 0.04604663],

急诊科护士[ 0. , 0. , -1.01448286, -0.41277751, 0.04604663],

[ 0. , 0. , 0.96967536, 0.39522746, 0.04604663],

[ 0. , 0. , -1.01448286, -0.41277751, 0.04604663]], dtype=float32)

输出的ypred_contribs的维度为[100,5], 通过阅读前⾯的⽂档注释就可以知道, 最后⼀列是bias, 前⾯的四列分别是每个特征对最后打分的影响因⼦, 可以看出, 前⾯两个特征是不起作⽤的.

通过这个输出, 怎么和最后的打分进⾏关联呢? 原理也是⼀样的, 还是以前两列为例.

score_a = sum(ypred_contribs[0])

print score_a

# -1.38121373579

score_b = sum(ypred_contribs[1])

print score_b

# 1.41094945744

相同的分值, 相同的处理情况.

到此, 这期关于在python上关于xgboost算法的简单实现, 以及在实现的过程中: 得分的输出、样本对应到树的节点、每个样本中单独特征对得分的影响, 以及上述三者之间的联系, 均已介绍完毕。

本文发布于:2023-05-15 07:48:51，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/638189.html

上一篇：2022春节贺岁档电影汇总

下一篇：2023年销售辞职信简单销售辞职信个人原因(14篇)

标签：输出样本模型问题打分

留言与评论（共有 0 条评论）