首页 > 英文翻译

Boosting算法总结（adaboosting、GBDT、XGBoost）

更新时间:2023-05-20 09:10:57 阅读：评论：0

Boosting算法总结（adaboosting、GBDT、XGBoost）把之前学习xgb过程中查找的资料整理分享出来，⽅便有需要的朋友查看，求⼤家点赞⽀持，哈哈哈

merry是什么意思中文作者：tangg, qq:577305810

⼀、Boosting

算法

boosting算法有许多种具体算法，包括但不限于ada boosting \ GBDT \ XGBoost .

所谓 Boosting ，就是将弱分离器 f_i(x) 组合起来形成强分类器 F(x) 的⼀种⽅法。

每个⼦模型模型都在尝试增强（boost）整体的效果，通过不断的模型迭代，更新样本点的权重

Ada Boosting没有oob（out of bag ) 的样本，因此需要进⾏ train_test_split

原始数据集》某种算法拟合，会产⽣错误》根据上个模型预测结果，更新样本点权重（预测错误的结果权重增⼤）》再次使⽤模型进⾏预测》重复上述过程，继续重点训练错误的预测样本点

每⼀次⽣成的⼦模型，都是在⽣成拟合结果更好的模型，

（⽤的数据点都是相同的，但是样本点具有不同的权重值）

需要指定 Ba Estimator

ble import AdaBoostClassifier

import DecisionTreeClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2), n_estimators=500)

ada_clf.fit(X_train, y_train)

ada_clf.score(X_test, y_test)

Gradient Boosting ⼜称为 DBDT （gradient boosting decision tree ）

训练⼀个模型m1，产⽣错误e1

针对e1训练第⼆个模型m2，产⽣错误e2

针对e2训练第⼆个模型m3，产⽣错误e3

......

最终的预测模型是：m1+m2+m3+...

Gradient Boosting是基于决策树的，不⽤指定Ba Estimator

ble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(max_depth=2, n_estimators=30)

gb_clf.fit(X_train, y_train)

gb_clf.score(X_test, y_test)

这个算法的Ba Estimator是基于decision tree的

Xgboost是在GBDT的基础上进⾏改进，使之更强⼤，适⽤于更⼤范围

tromsoxgboost可以⽤来确定特征的重要程度

强烈推荐博客园上【战争热诚】写的⼀篇介绍xgboost算法的⽂章，

⾮常详细地介绍了xgboost的优点、安装、xgboost参数的含义、使⽤xgboost实例代码、保存训练好的模型、并介绍了xgboost参数调优的⼀m1+m2+m3+...

英语学习兴趣的培养般流程。

然⽽，，，我发现该作者好像也是转载的，怪不得有些地⽅看不懂，还缺少代码。不过是中⽂的有助于理解。

⽂章原⽂链接如下：

⽂中提到的数据的github仓库地址：

另外⼀篇，掘⾦上不错的⽂章：

3.1 xgboost模型参数

模型参数总体上分为3类：(this part is talked about 原⽣接⼝ params )

1. 通⽤参数

booster[default=gbtree]

有两种模型可以选择gbtree和gblinear。gbtree使⽤基于树的模型进⾏提升计算，gblinear使⽤线性模型进⾏提升计算。缺省值为gbtree

silent [default=0]

取0时表⽰打印出运⾏时信息，取1时表⽰以缄默⽅式运⾏，不打印运⾏时的信息。缺省值为0

nthread

XGBoost运⾏时的线程数。缺省值是当前系统可以获得的最⼤线程数

num_pbuffer

预测缓冲区的⼤⼩，通常设置为训练实例数。缓冲区⽤于保存最后提升步骤的预测结果

num_feature

boosting过程中⽤到的特征维数，设置为特征个数。XGBoost会⾃动设置，不需要⼿⼯设置

2. booster参数

booster参数根据选择的booster不同，⼜分为两个类别，分别介绍如下：

2.1 tree booster参数

eta [default=0.3]

为了防⽌过拟合，更新过程中⽤到的收缩步长。在每次提升计算之后，算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3

取值范围为：[0,1]

通常最后设置eta为0.01~0.2

gamma [default=0]

minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conrvative the

algorithm will be.

range: [0,∞]

模型在默认情况下，对于⼀个节点的划分只有在其loss function 得到结果⼤于0的情况下才进⾏，⽽gamma 给定了所需的最低loss function的值

gamma值使得算法更conrvation，且其值依赖于loss function ，在模型中应该进⾏调参。

max_depth [default=6]

树的最⼤深度。缺省值为6

取值范围为：[1,∞]

指树的最⼤深度

树的深度越⼤，则对数据的拟合程度越⾼（过拟合程度也越⾼）。即该参数也是控制过拟合

建议通过交叉验证（xgb.cv ) 进⾏调参

通常取值：3-10

min_child_weight [default=1]

孩⼦节点中最⼩的样本权重和。如果⼀个叶⼦节点的样本权重和⼩于min_child_weight则拆分过程结束。在现⾏回归模型中，这个参数是指建⽴每个模型所需要的最⼩样本数。该常数越⼤算法越conrvative。即调⼤这个参数能够控制过拟合。

取值范围为: [0,∞]

max_delta_step [default=0]

取值范围为：[0,∞]

如果取值为0，那么意味着⽆限制。如果取为正数，则其使得xgboost更新过程更加保守。

通常不需要设置这个值，但在使⽤logistics 回归时，若类别极度不平衡，则调整该参数可能有效果

subsample [default=1]

⽤于训练模型的⼦样本占整个样本集合的⽐例。如果设置为0.5则意味着XGBoost将随机的从整个样本集合中抽取出50%的⼦样本建⽴树模型，这能够防⽌过拟合。

取值范围为：(0,1]

colsample_bytree [default=1]

在建⽴树时对特征随机采样的⽐例(因为每⼀列是⼀个特征）。缺省值为1

取值范围：(0,1]

colsample_bylevel[default=1]

决定每次节点划分时⼦样例的⽐例

通常不使⽤，因为subsample和colsample_bytree已经可以起到相同的作⽤了

scale_pos_weight[default=0]

⼤于0的取值可以处理类别不平衡的情况。帮助模型更快收敛

Linear Booster参数

lambda [default=0]

L2 正则的惩罚系数

⽤于处理XGBoost的正则化部分。通常不使⽤，但可以⽤来降低过拟合

alpha [default=0]

L1 正则的惩罚系数

当数据维度极⾼时可以使⽤，使得算法运⾏更快。

lambda_bias

在偏置上的L2正则。缺省值为0（在L1上没有偏置项的正则，因为L1时偏置不重要）

这个参数是来控制理想的优化⽬标和每⼀步结果的度量⽅法。

objective [ default=reg:linear ]

定义学习任务及相应的学习⽬标，可选的⽬标函数如下：

“reg:linear” –线性回归。

“reg:logistic” –逻辑回归。

“binary:logistic” –⼆分类的逻辑回归问题，输出为概率。

“multi:softmax” –让XGBoost采⽤softmax⽬标函数处理多分类问题，同时需要设置参数num_class（类别个数）

“multi:softprob” –和softmax⼀样，但是输出的是ndata * nclass的向量，可以将该向量reshape成ndata⾏nclass列的矩阵。每⾏数据表⽰样本所属于每个类别的概率。

ba_score [ default=0.5 ]

the initial prediction score of all instances, global bias

eval_metric [ default according to objective ]

校验数据所需要的评价指标，不同的⽬标函数将会有缺省的评价指标

⽤户可以添加多种评价指标，对于Python⽤户要以list传递参数对给程序

The choices are listed below:

“rm”: 回归问题默认的参数

“logloss”: negative

“error”: Binary classification error rate. It is calculated as #(wrong cas)/#(all cas). For the predictions, the evaluation will

regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.分类问题默认参数

“merror”: Multiclass classification error rate. It is calculated as #(wrong cas)/#(all cas).

“mlogloss”: Multiclass logloss

“”: for ranking evaluation.

“ndcg”:

“map”:

ed [ default=0 ]

随机数的种⼦。缺省值为0

可以⽤于产⽣可重复的结果（每次取⼀样的ed即可得到相同的随机划分）

package dat

3.2 xgboost

xgboost有两⼤类接⼝，原⽣接⼝和scikit learn接⼝，这⾥只介绍基于sklearn的接⼝的使⽤

unlocked

由于是使⽤的scikitlearn的接⼝，某些参数的名称会有所区别

并且xgboost可以实现分类和回归任务

from xgboost.sklearn import XGBClassifier

clf = XGBClassifier(

silent=0, # 设置成1则没有运⾏信息输出，最好是设置为0，是否在运⾏时打印消息

# nthread = 4 # CPU 线程数默认最⼤

learning_rate=0.3 , # 如同学习率

min_child_weight = 1,osco

# 这个参数默认为1，是每个叶⼦⾥⾯h的和⾄少是多少，对正负样本不均衡时的0-1分类⽽⾔

# 假设h在0.01附近，min_child_weight为1 意味着叶⼦节点中最少需要包含100个样本

# 这个参数⾮常影响结果，控制叶⼦节点中⼆阶导的和的最⼩值，该参数值越⼩，越容易过拟合

max_depth=6, # 构建树的深度，越⼤越容易过拟合

gamma = 0,# 树的叶⼦节点上做进⼀步分区所需的最⼩损失减少，越⼤越保守，⼀般0.1 0.2这样⼦

subsample=1, # 随机采样训练样本，训练实例的⼦采样⽐

# max_delta_step=0, # 最⼤增量步长，我们允许每个树的权重估计

colsample_bytree=1, # ⽣成树时进⾏的列采样

reg_lambda=1, #控制模型复杂度的权重值的L2正则化项参数，参数越⼤，模型越不容易过拟合

# reg_alpha=0, # L1正则项参数

# scale_pos_weight =1 # 如果取值⼤于0的话，在类别样本不平衡的情况下有助于快速收敛，平衡正负权重

# objective = 'multi:softmax', # 多分类问题，指定学习任务和响应的学习⽬标

# num_class = 10, # 类别数，多分类与multisoftmax并⽤

n_estimators=100, # 树的个数

ed = 1000, # 随机种⼦

# eval_metric ='auc'

牙医的英文)

鸢尾花数据集的xgboost分类实例

这是多分类问题，实例化

from sklearn.datats import load_iris

import xgboost as xgb

from xgboost import plot_importance

from matplotlib import pyplot as plt

del_lection import train_test_split

ics import accuracy_score

# 加载样本数据集

iris = load_iris()

X,y = iris.data,iris.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=12343)

# 训练模型

model = xgb.XGBClassifier(max_depth=5,learning_rate=0.1,n_estimators=160,silent=True,objective= 'multi:softmax' ) model.fit(X_train,y_train)

# 对测试集进⾏预测

y_pred = model.predict(X_test)

#计算准确率

accuracy = accuracy_score(y_test,y_pred)

print( 'accuracy:%2.f%%' %(accuracy*100))

# 显⽰重要特征

plot_importance(model)

plt.show()

import xgboost as xgb

from xgboost import plot_importance

from matplotlib import pyplot as plt

del_lection import train_test_split

from sklearn.datats import load_boston

# 导⼊数据集

boston = load_boston()

X ,y = boston.data,boston.target

# Xgboost训练过程

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

model = xgb.XGBRegressor(max_depth=5,learning_rate=0.1,n_estimators=160,silent=True,objective='reg:gamma') model.fit(X_train,y_train)

# 对测试集进⾏预测

russiaans = model.predict(X_test)

# 显⽰重要特征

plot_importance(model)

plt.show()

3.3 参数调优的⼀般⽅法

调参步骤：

1，选择较⾼的学习速率（learning rate）。⼀般情况下，学习速率的值为0.1.但是，对于不同的问题，理想的学习速率有时候会在

0.05~0.3之间波动。选择对应于此学习速率的理想决策树数量。Xgboost有⼀个很有⽤的函数“cv”，这个函数可以在每⼀次迭代中使⽤交叉验证，并返回理想的决策树数量。

2，对于给定的学习速率和决策树数量，进⾏决策树特定参数调优（max_depth , min_child_weight , gamma ,

subsample,colsample_bytree）在确定⼀棵树的过程中，我们可以选择不同的参数。

3，Xgboost的正则化参数的调优。（lambda , alpha）。这些参数可以降低模型的复杂度，从⽽提⾼模型的表现。

4，降低学习速率，确定理想参数。

具体调参步骤请看接下来的这个实例

⼆、XGBOOST实例（分类+调参）

应⽤XGBoost做⼀个简单的⼆分类问题：

jupyter格式的⽂件⼀并上传在此仓库中

预测待测样本是否会在5年内患糖尿病

数据前8列为特征，最后⼀列为是否患糖尿病（0 1）

第⼀部分：默认的xgboost配置

1.导⼊必须的包

import pandas as pd

import numpy as np

from numpy import loadtxt

president hufrom xgboost import XGBClassifier

del_lection import train_test_split

ics import accuracy_score

del_lection import cross_val_score

后续调参会⽤到这个函数来⽐较调参的效果

# 查看训练出来的模型(完成fit 步骤之后)

#在训练集测试集上的交叉验证成绩

def cv_score_train_test(model):

吊唁信

num_cv = 5

score_list = ["neg_log_loss","accuracy","f1", "roc_auc"]

train_scores = []

test_scores = []

for score in score_list:

train_scores.append(cross_val_score(model, X_train, y_train, cv=num_cv, scoring=score).mean())

test_scores.append(cross_val_score(model, X_test, y_test, cv=num_cv, scoring=score).mean())

scores = np.array((train_scores + test_scores)).reshape(2, -1)

scores_df = pd.DataFrame(scores, index=['Train', 'Test'], columns=score_list)

print(scores_df)

2. 数据基本处理

分出变量和标签

datat = loadtxt('pima-indians-diabetes.csv', delimiter=",")

X = datat[:,0:8] #左开右闭

Y = datat[:,8]

将数据分为训练集和测试集

测试集⽤来预测，训练集⽤来学习模型

ed = 7

test_size = 0.33

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=ed)

3. 使⽤XGBOOST封转好的分类器

本文发布于:2023-05-20 09:10:57，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/115616.html

上一篇：AutoGluon-教程1-简单的入门模型

下一篇：python分类变量xgboost_xgboost多分类标签怎么设置？

标签：模型参数样本学习训练权重预测数据

留言与评论（共有 0 条评论）