xgboost的调参步骤
参数含义是否需要调
参
booster[默认gbtree]迭代模型 gbtree或gbliner否
silent[默认0]1时不输出信息否
nthread[默认最⼤可能线
程数]
否eta[默认0.3]学习率是0.01-0.2
min_child_weight[默认1]最⼩叶⼦节点样本权重和;回归问题⾥min_child_weight代表的意思是,儅⼀個節點下的樣本數⼩於給定的閾
值時,則停⽌分裂
是
明天会更美
max_depth[默认6]树的最⼤深度是3-10
max_leaf_nodes树上最⼤叶⼦数量
gamma[默认0]节点分类所需最⼩损失函数下降值是max_delta_step[默认0]每棵树权重改变的最⼤步长,⼀般不需要否subsample对于每棵树随机采样的⽐例,0.5-1
colsample_bytree[默认
1]
控制每棵随机采样的列数的占⽐0.5-1
colsample_bylevel[默认
1]
控制树的每⼀级的每⼀次分裂,对列数的采样的占⽐。
lambda[默认1]L2正则化项的权重
alpha权重的L1正则化项
scale_pos_weight调节正负样本不均衡问题;⽤于加快收敛
objective[默认reg:linear]linear/binary/multi/softprob
eval_metric度量⽅式回归默认rm分类默认error
ed
调参的步骤是
1. 选择较⾼的学习速率(learning rate)。⼀般情况下,学习速率的值为0.1。但是,对于不同的问题,理想的学习速率有时候会在0.05
到0.3之间波动。选择对应于此学习速率的理想决策树数量。XGBoost有⼀个很有⽤的函数“cv”,这个函数可以在每⼀次迭代中使⽤交叉验证,并返回理想的决策树数量。
2. 对于给定的学习速率和决策树数量,进⾏决策树特定参数调优(max_depth, min_child_weight, gamma, subsample,
colsample_bytree)。远程桌面怎么打开
3. xgboost的正则化参数的调优。(lambda, alpha)。这些参数可以降低模型的复杂度,从⽽提⾼模型的表现。最动人的情话
4. 降低学习速率,确定理想参数。
定义⼀个函数⽅便后续的交叉验证
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics
id_arch import GridSearchCV #⽹格搜索
import matplotlib.pylab as plt
%matplotlib inline
def modelfit(alg, dtrain, predictors,uTrainCV=True, cv_folds=5, early_stopping_rounds=50):
if uTrainCV:
xgb_param = _xgb_params()
xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
xgtest = xgb.DMatrix(dtest[predictors].values)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost__params()['n_estimators'], nfold=cv_folds, metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=Fal)
alg.t_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
alg.fit(dtrain[predictors], dtrain['Disburd'],eval_metric='auc')
#Predict training t:
dtrain_predictions = alg.predict(dtrain[predictors])
dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
#Print model report:
print "\nModel Report"
print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disburd'].values, dtrain_predictions)
print "AUC Score (Train): %f" % _auc_score(dtrain['Disburd'], dtrain_predprob)
# Predict on testing data:
dtest['predprob'] = alg.predict_proba(dtest[predictors])[:,1]
results = (dtest[['ID','predprob']], on='ID')
print 'AUC Score (Test): %f' % _auc_score(results['Disburd'], results['predprob'])
基金转换feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=Fal)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
第⼀步 确定学习率和树个数
给其他参数⼀个初始值
在学习率为0.1时找出理想的决策树数⽬
xgb1 = XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=5,#3-10之间
min_child_weight=1,
gamma=0,#节点分类所需最⼩损失函数下降值
subsample=0.8,#典型值0.5-0.9
colsample_bytree=0.8,#典型值0.5-0.9
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
ed=27)
modelfit(xgb1, train, predictors)
第⼆步 max_depth 和min_weight参数调优
植树歌先粗调再细调
param_test1 = {
'max_depth':range(3,10,2),
'min_child_weight':range(1,6,2)
}
garch1 = GridSearchCV( estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, ed=27),
param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=Fal, cv=5)
garch1.fit(train[predictors],train[target])
在找出两个值的粗略最优值后再其附近进⾏⼀次细调
param_test2 = {
'max_depth':[4,5,6],
'min_child_weight':[4,5,6]
}
garch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5, min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,ed=27),
param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=Fal, cv=5)
garch2.fit(train[predictors],train[target])
第三步 gamma参数调优
param_test3 = {
'gamma':[i/10.0 for i in range(0,5)]
}
garch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4, min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,ed=27),
param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=Fal, cv=5)
garch3.fit(train[predictors],train[target])演员片酬
第四部 调整subsample和colsample_bytree参数
param_test4 = {
'subsample':[i/10.0 for i in range(6,10)],
'colsample_bytree':[i/10.0 for i in range(6,10)]
}
garch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=3, min_child_weight=4, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,ed=27),
param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=Fal, cv=5)
garch4.fit(train[predictors],train[target])
第五步 正则化参数调优
⽤来减低过拟合,与gamma函数起着类似的作⽤
param_test6 = {
'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
garch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4, min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,ed=27),
param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=Fal, cv=5)
garch6.fit(train[predictors],train[target])
第 6步 降低学习速率
xgb4 = XGBClassifier(
learning_rate =0.01,
n_estimators=5000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.005,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
ed=27)
队礼
modelfit(xgb4, train, predictors)
xgb.cv函数
def cv(params, dtrain, num_boost_round=10, nfold=3, stratified=Fal, folds=None,
metrics=(), obj=None, feval=None, maximize=Fal, early_stopping_rounds=None,
fpreproc=None, as_pandas=True, verbo_eval=None, show_stdv=True,
谚语俗语
ed=0, callbacks=None, shuffle=True)
xgb_param 可以⽤xgb.XGBClassifier().get_xgb_params()获得
dtrain则是⽤xgb.DMatrix(x_train,y_train)获得。
num_boost_round是最⼤迭代次数,
early_stopping_rounds,测试集50 round没有提升迭代停⽌,输出最好的轮数,verbo_eval=10意思是每10轮打印⼀次评价指标,
show_stdv=Fla表⽰不打印交叉验证的标准差。
nfold表⽰⼏折
folds可以接受⼀个KFold或者StratifiedKFold对象
metrics是⼀个字符串或者列表,表⽰评价指标,⼀般都⽤‘auc’
另外xgb.cv返回的是⼀个dataframe