首页 > 美文阅读

xgboost的调参步骤

更新时间:2023-05-15 07:16:38 阅读：评论：0

xgboost的调参步骤

参数含义是否需要调

参

booster[默认gbtree]迭代模型 gbtree或gbliner否

silent[默认0]1时不输出信息否

nthread[默认最⼤可能线

程数]

否eta[默认0.3]学习率是0.01-0.2

min_child_weight[默认1]最⼩叶⼦节点样本权重和；回归问题⾥min_child_weight代表的意思是，儅⼀個節點下的樣本數⼩於給定的閾

值時，則停⽌分裂

是

明天会更美

max_depth[默认6]树的最⼤深度是3-10

max_leaf_nodes树上最⼤叶⼦数量

gamma[默认0]节点分类所需最⼩损失函数下降值是max_delta_step[默认0]每棵树权重改变的最⼤步长，⼀般不需要否subsample对于每棵树随机采样的⽐例，0.5-1

colsample_bytree[默认

控制每棵随机采样的列数的占⽐0.5-1

colsample_bylevel[默认

控制树的每⼀级的每⼀次分裂，对列数的采样的占⽐。

lambda[默认1]L2正则化项的权重

alpha权重的L1正则化项

scale_pos_weight调节正负样本不均衡问题；⽤于加快收敛

objective[默认reg:linear]linear/binary/multi/softprob

eval_metric度量⽅式回归默认rm分类默认error

调参的步骤是

1. 选择较⾼的学习速率(learning rate)。⼀般情况下，学习速率的值为0.1。但是，对于不同的问题，理想的学习速率有时候会在0.05

到0.3之间波动。选择对应于此学习速率的理想决策树数量。XGBoost有⼀个很有⽤的函数“cv”，这个函数可以在每⼀次迭代中使⽤交叉验证，并返回理想的决策树数量。

2. 对于给定的学习速率和决策树数量，进⾏决策树特定参数调优(max_depth, min_child_weight, gamma, subsample,

colsample_bytree)。远程桌面怎么打开

3. xgboost的正则化参数的调优。(lambda, alpha)。这些参数可以降低模型的复杂度，从⽽提⾼模型的表现。最动人的情话

4. 降低学习速率，确定理想参数。

定义⼀个函数⽅便后续的交叉验证

import pandas as pd

import numpy as np

import xgboost as xgb

from xgboost.sklearn import XGBClassifier

from sklearn import cross_validation, metrics

id_arch import GridSearchCV #⽹格搜索

import matplotlib.pylab as plt

%matplotlib inline

def modelfit(alg, dtrain, predictors,uTrainCV=True, cv_folds=5, early_stopping_rounds=50):

if uTrainCV:

xgb_param = _xgb_params()

xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)

xgtest = xgb.DMatrix(dtest[predictors].values)

cvresult = xgb.cv(xgb_param, xgtrain, num_boost__params()['n_estimators'], nfold=cv_folds, metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=Fal)

alg.t_params(n_estimators=cvresult.shape[0])

#Fit the algorithm on the data

alg.fit(dtrain[predictors], dtrain['Disburd'],eval_metric='auc')

#Predict training t:

dtrain_predictions = alg.predict(dtrain[predictors])

dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]

#Print model report:

print "\nModel Report"

print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disburd'].values, dtrain_predictions)

print "AUC Score (Train): %f" % _auc_score(dtrain['Disburd'], dtrain_predprob)

# Predict on testing data:

dtest['predprob'] = alg.predict_proba(dtest[predictors])[:,1]

results = (dtest[['ID','predprob']], on='ID')

print 'AUC Score (Test): %f' % _auc_score(results['Disburd'], results['predprob'])

基金转换feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=Fal)

feat_imp.plot(kind='bar', title='Feature Importances')

plt.ylabel('Feature Importance Score')

第⼀步确定学习率和树个数

给其他参数⼀个初始值

在学习率为0.1时找出理想的决策树数⽬

xgb1 = XGBClassifier(

learning_rate =0.1,

n_estimators=1000,

max_depth=5,#3-10之间

min_child_weight=1,

gamma=0,#节点分类所需最⼩损失函数下降值

subsample=0.8,#典型值0.5-0.9

colsample_bytree=0.8,#典型值0.5-0.9

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

ed=27)

modelfit(xgb1, train, predictors)

第⼆步 max_depth 和min_weight参数调优

植树歌先粗调再细调

param_test1 = {

'max_depth':range(3,10,2),

'min_child_weight':range(1,6,2)

}

garch1 = GridSearchCV( estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1, ed=27),

param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=Fal, cv=5)

garch1.fit(train[predictors],train[target])

在找出两个值的粗略最优值后再其附近进⾏⼀次细调

param_test2 = {

'max_depth':[4,5,6],

'min_child_weight':[4,5,6]

}

garch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5, min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,ed=27),

param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=Fal, cv=5)

garch2.fit(train[predictors],train[target])

第三步 gamma参数调优

param_test3 = {

'gamma':[i/10.0 for i in range(0,5)]

}

garch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4, min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,ed=27),

param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=Fal, cv=5)

garch3.fit(train[predictors],train[target])演员片酬

第四部调整subsample和colsample_bytree参数

param_test4 = {

'subsample':[i/10.0 for i in range(6,10)],

'colsample_bytree':[i/10.0 for i in range(6,10)]

}

garch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=3, min_child_weight=4, gamma=0.1, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,ed=27),

param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=Fal, cv=5)

garch4.fit(train[predictors],train[target])

第五步正则化参数调优

⽤来减低过拟合，与gamma函数起着类似的作⽤

param_test6 = {

'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]

}

garch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4, min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,ed=27),

param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=Fal, cv=5)

garch6.fit(train[predictors],train[target])

第 6步降低学习速率

xgb4 = XGBClassifier(

learning_rate =0.01,

n_estimators=5000,

max_depth=4,

min_child_weight=6,

gamma=0,

subsample=0.8,

colsample_bytree=0.8,

reg_alpha=0.005,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

ed=27)

队礼

modelfit(xgb4, train, predictors)

xgb.cv函数

def cv(params, dtrain, num_boost_round=10, nfold=3, stratified=Fal, folds=None,

metrics=(), obj=None, feval=None, maximize=Fal, early_stopping_rounds=None,

fpreproc=None, as_pandas=True, verbo_eval=None, show_stdv=True,

谚语俗语

ed=0, callbacks=None, shuffle=True)

xgb_param 可以⽤xgb.XGBClassifier().get_xgb_params()获得

dtrain则是⽤xgb.DMatrix（x_train,y_train）获得。

num_boost_round是最⼤迭代次数，

early_stopping_rounds，测试集50 round没有提升迭代停⽌，输出最好的轮数，verbo_eval=10意思是每10轮打印⼀次评价指标，

show_stdv=Fla表⽰不打印交叉验证的标准差。

nfold表⽰⼏折

folds可以接受⼀个KFold或者StratifiedKFold对象

metrics是⼀个字符串或者列表，表⽰评价指标，⼀般都⽤‘auc’

另外xgb.cv返回的是⼀个dataframe

本文发布于:2023-05-15 07:16:38，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/637916.html

上一篇：关于爱情语录

下一篇：实用的周记开学第一周3篇

标签：学习速率默认理想参数函数迭代

留言与评论（共有 0 条评论）