pythongetscoregain_机器学习的特征重要性究竟是怎么算的

更新时间:2023-05-20 10:27:57 阅读：评论：0

pythongetscoregain_机器学习的特征重要性究竟是怎么算的最近将主流模型的sklearn的代码撸了⼀遍，特别是计算特征重要性这块仔仔细细了解了⼀番；常⽤算法中xgboost、gbdt、randomforest、tree等都可以输出特征的重要性评分，下⾯着重记录xgboost和gbdt特征重要性计算过程，randomforest和gbdt差不多，就不赘述了。

1.简介

xgboost是当下流⾏的boosting算法，基学习器可以是gbtree也可以是gbliner

当基学习器是gbtree时，可以计算特征重要性；

在基础的xgboost模块中，计算特征重要性调⽤get_score()

在xgboost的sklearn API中，计算特征重要性调⽤feature_importance_;

feature_importance_依然派⽣于get_score()，所以查看xgboost的get_score()源码就可知其所以然；

gain：(某特征在整个树群作为分裂节点的信息增益之和再除以某特征出现的频次)

total_gain(同上，代码中有介绍，这⾥total_gain就是gain)

cover和total_cover

代码中频繁提到通过get_dump获取树规则，举个例⼦看看什么是树规则：

trees = _dump(with_stats=True)

for tree in trees:

print(tree)

# 以下输出了2次迭代的决策树规则，规则内包含量特征名、gain和cover，

# 源码就是提取上述3个变量值进⾏计算特征重要性

[out]:

0:[inteval<1] yes=1,no=2,missing=1,gain=923.585938,cover=7672

1:[limit<9850] yes=3,no=4,missing=3,gain=90.4335938,cover=6146.5

3:leaf=-1.86464596,cover=5525.25

4:leaf=-1.45520294,cover=621.25

2:[days<3650] yes=5,no=6,missing=5,gain=164.527832,cover=1525.5

5:leaf=-1.36227047,cover=598

6:leaf=-0.688206792,cover=927.5

0:[days<7850] yes=1,no=2,missing=1,gain=528.337646,cover=4162.56592

1:[frequency<4950] yes=3,no=4,missing=3,gain=64.1247559,cover=2678.6853

3:leaf=-0.978122056,cover=1715.49646

4:leaf=-0.653981686,cover=963.188965

2:[interval<4] yes=5,no=6,missing=5,gain=179.725327,cover=1483.88074

顺序英文

5:leaf=-0.256728679,cover=1280.68018

6:leaf=0.753442943,cover=203.200531

这⾥是源码：

# 当重要性类型选择“weight”时：

if importance_type == 'weight':

# get_dump⽤以从模型中打印输出所有树的规则信息

trees = lf.get_dump(fmap, with_stats=Fal)

fmap = {}

# 以下for循环⽤以从所有树规则中提取出特征名称

for tree in trees:

上海中级口译口试

for line in tree.split('\n'):

arr = line.split('[')

if len(arr) == 1:

continue2012年6月30日

fid = arr[1].split(']')[0].split('

# 以下语句利⽤if判断统计所有树规则中每个特征出现的频次

# 即每个特征在分裂时候被利⽤的次数

if fid not in fmap:

fmap[fid] = 1

el:

fmap[fid] += 1

return fmap

el:

# 通过以下代码知道importance_type选择total_gain时其实就是gain；# 选择total_cover时也等同于cover

短裙的英语average_over_splits = True

if importance_type == 'total_gain':

importance_type = 'gain'

average_over_splits = Fal

elif importance_type == 'total_cover':

importance_type = 'cover'

average_over_splits = Fal

# 还是先打印输出所有树规则信息

trees = lf.get_dump(fmap, with_stats=True)

importance_type += '='dste

fmap = {}

gmap = {}

for tree in trees:

for line in tree.split('\n'):

arr = line.split('[')

if len(arr) == 1:

continue

fid = arr[1].split(']')

# 该步计算g的时候利⽤了importance_type参数

# 当importance_type="gain"时，提取树规则中的gain值

# 当importance_type="cover"时，提取树规则中的cover值

g = float(fid[1].split(importance_type)[1].split(',')[0])sleder

# fid和参数weight时候的fmap是⼀样的，其实都是从树规则中提取到的特征名称列表

fid = fid[0].split('

# 这步if操作，涉及两个统计量

# ⼀是每个特征在所有树规则中出现的频次

# ⼆是每个特征在所有树节点上的信息增益之和

if fid not in fmap:

fmap[fid] = 1

gmap[fid] = g

el:

fmap[fid] += 1

gmap[fid] += g

# 将特征重要性求均值得到最终的重要性统计量，具体⽅法是：

# 当importance_type ="gain"：a特征重要性为a总的信息增益和除以树规则中a特征出现的总频次

# 当importance_type = "cover时"，a特征重要性为a总的cover和除以树规则中a特征出现的总频次

if average_over_splits:

for fid in gmap:

gmap[fid] = gmap[fid] / fmap[fid]

注意，原始的get_score()⽅法只是输出按照weight、gain、cover计算的统计值，还没有换算成百分⽐形式，所以⾛到这⾥还不是真正的重要性得分！

所以继续查看xgboost的sklearn API中feature_importance_⽅法：很容易发现，feature_importance_中做了⼀个特征归⼀化，将重要性统计量转化成量百分⽐形式，分母其实就是所有特征的重要性统计量之和。

def feature_importances_(lf):美国留学语言要求

if getattr(lf, 'booster', None) is not None and lf.booster != 'gbtree':

rai AttributeError('Feature importance is not defined for Booster type {}'

.format(lf.booster))

b = lf.get_booster()

score = b.get_score(importance_type=lf.importance_type)

all_features = [(f, 0.) for f in b.feature_names]

all_features = np.array(all_features, dtype=np.float32)

# 核⼼就是最后⼀步，有个归⼀化的过程，将重要性统计量转化成百分⽐

return all_features / all_features.sum()

另外注意，从构造函数中发现，xgboost sklearn API在计算特征重要性的时候默认importance_type="gain"，⽽原始的get_score⽅法默认importance_type="weight"

def __init__(lf, max_depth=3, learning_rate=0.1, n_estimators=100,

verbosity=1, silent=None, objective="reg:linear", booster='gbtree',

n_jobs=1, nthread=None, gamma=0, min_child_weight=1,

max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,

colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,

ba_score=0.5, random_state=0, ed=None, missing=None,

# 在这⼀步进⾏了声明

importance_type="gain", **kwargs):

2.gbdt

⾸先找到BaGradientBoosting类，得到feature_importances_⽅法源码如下：

def feature_importances_(lf):

"""Return the feature importances (the higher, the more important the

feature).

Returns

-------

feature_importances_ : array, shape = [n_features]

"""

lf._check_initialized()

total_sum = np.zeros

# 这⼀步for循环从lf.estimators_遍历每个回归树

# 主类的estimators_⽅法⽤以输出gbdt训练过程中建⽴的决策树群

for stage in lf.estimators_:

# 针对每个决策树⼦树，分别调⽤feature_importances_⽅法

# 这说明决策树群中每个⼦树都对应⼀套特征重要性

marumaru# 也说明BaGradientBoosting的feature_importances_⽅法只是对决策树的feature_importances_进⾏变换，并⾮最原始的逻辑stage_sum = sum(tree.feature_importances_

for tree in stage) / len(stage)

# 这⾥将每棵树的特征重要性数组合并加总成⼀个数组

total_sum += stage_sum

# 这⾥将合并后的数组值除以树的个数，可以看作是每个树平均的特征重要性情况

importances = total_sum / len(lf.estimators_)

return importances

既然没有得到想要的，继续从tree中找feature_importances_源码，发现tree的feature_importances_来⾃于

tree_.compute_feature_importances()⽅法：

cpdef compute_feature_importances(lf, normalize=True):

"""Computes the importance of each feature (aka variable)."""

cdef Node* left

cdef Node* right

cdef Node* nodes = lf.nodes

cdef Node* node = nodes

cdef Node* end_node = node + lf.node_count

cdef double normalizer = 0.

cdef np.ndarray[np.float64_t, ndim=1] importances

importances = np.zeros((lf.n_features,))

cdef DOUBLE_t* importance_data = importances.data

with nogil:

# 在计算impurity时，while和if⽤以过滤掉决策树中的根节点和叶⼦节点

while node != end_node:

if node.left_child != _TREE_LEAF:

left = &nodes[node.left_child]

right = &nodes[node.right_child]

# 遍历每个节点，该节点对应分裂特征重要性统计量=分裂前impurity减去分裂后左右⼆叉树impurity之和

# 计算impurity的差值时，每个impurity都乘以对应权重(分⽀的样本数)

gossip girl 第二季# ⼀个特征在树中可以被⽤来多次分裂，基于上⼀步的数据，等同于这⾥按照特征groupby后对其重要性统计量求和

importance_data[node.feature] += (

男生英文名大全node.weighted_n_node_samples * node.impurity -

left.weighted_n_node_samples * left.impurity -

right.weighted_n_node_samples * right.impurity)

本文发布于:2023-05-20 10:27:57，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/115685.html

上一篇：TRACKING INFORMATION ENCODING SYSTEM

下一篇：Mip maprip map texture linear addressing memory o

标签：特征重要性规则

留言与评论（共有 0 条评论）