LightGBM调参练习——使⽤⾃带数据集
调参流程
1.选择较⾼的学习率,⼤概0.1附近,为了加快收敛速度。
2.对决策树基本参数调参
3.正则化参数调参
4.降低学习率,为了最后提⾼准确率。
1.学习率和迭代次数
learning_rate = 0.1
→boosting/boost/boosting_type,学习器类型,通常选取gbdt
→n_estimators/num_iterations/num_round/num_boost_round,迭代次数,设置较⼤。在cv结果中查看最优次数。→初始值,意义不⼤,为了⽅便确认其他参数。
根据具体项⽬要求定的参数:
'boosting_type'/'boosting':'gbdt'
'objective':'binary'
'metric':'auc'
具体含义直接查询
初始值:
'max_depth':5# 由于数据集不是很⼤,所以选择了⼀个适中的值,其实4-10都⽆所谓。
'num_leaves':30# 由于lightGBM是leaves_wi⽣长,官⽅说法是要⼩于2^max_depth
'subsample'/'bagging_fraction':0.8# 数据采样
'colsample_bytree'/'feature_fraction':0.8# 特征采样
其他⼀些模版,参见:
import lightgbm as lgb
六级多少分才算过了 print("LGB test")
clf = lgb.LGBMClassifier(
boosting_type='gbdt', num_leaves=55, reg_alpha=0.0, reg_lambda=1,
max_depth=15, n_estimators=6000, objective='binary',
subsample=0.8, colsample_bytree=0.8, subsample_freq=1,
learning_rate=0.06, min_child_weight=1, random_state=20, n_jobs=4
)
clf.fit(X_train, y_train)
pre=clf.predict(testdata)
print("starting ")
clf = lgb.LGBMClassifier(
boosting_type='gbdt', num_leaves=50, reg_alpha=0.0, reg_lambda=1,
max_depth=-1, n_estimators=1500, objective='binary',
subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
learning_rate=0.05, min_child_weight=50, random_state=2018, n_jobs=100
)
clf.fit(X_train, y_train, eval_t=[(X_train, y_train)], eval_metric='auc',early_stopping_rounds=1000)
pre1=clf.predict(X_test)
⽤LightGBM的cv函数进⾏确定
import pandas as pd
import lightgbm as lgb
from sklearn.datats import load_breast_cancer
del_lection import train_test_split
df = load_breast_cancer()
X = df.data
y = df.target
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
params ={
'boosting_type':'gbdt',
'objective':'binary',
'metric':'auc',
'learning_rate':0.1,
'num_leaves':30,
'max_depth':5,
'subsample':0.8,
'colsample_bytree':0.8
}
data_train = lgb.Datat(X_train,y_train)
cv_results = lgb.cv(params,data_train,num_boost_round=1000,nfold=5,stratified=Fal,shuffle=True,metrics='auc',early_stopping_rounds=50,ed=0) print('best n_estimator is {}'.format(len(cv_results['auc-mean'])))
print('best cv score is {}'.format(pd.Series(cv_results['auc-mean']).max()))
#运⾏结果
best n_estimator is52
best cv score is0.9885844075545289
根据以上结果,取n_estimators=52
2.确定max_depth和num_leaves
这俩是提⾼精确度的重要参数。
做法:引⼊sklearn⾥的GridSearchCV()函数。
当然⽤贝叶斯优化也可以(待学)
max_depth: 指定了每棵树的最⼤深度或者它能够⽣长的层数上限。
num_leaves⽤来设置组成每棵树的叶⼦的数量
⼆者理论上的关系是:
设置时,num_leaves必须设置为⼀个⼩于2^(max_depth)的值。否则,他将可能会导致过拟合。LightGBM的num_leaves和max_depth 这两个参数之间没有直接的联系。因此,我们⼀定不要把两者联系在⼀起。
del_lection import GridSearchCV
完整代码
del_lection import GridSearchCV
params_test1 ={'max_depth':range(3,8,1),'num_leaves':range(5,100,5)}
model = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',learning_rate=0.1,n_estimators=52,max_depth=6,bagging_fraction=0.8,feature_fractio n=0.8)
garch11 = GridSearchCV(estimator=model,param_grid=params_test1,scoring='roc_auc',cv=5)
garch11.fit(X_train,y_train)
print('每次运⾏迭代结果{0}'.format(garch11.param_grid))
print('参数的最佳取值{0}'.format(garch11.best_params_))
print('最佳模型得分{0}'.format(garch11.best_score_))
#运⾏结果
每次运⾏迭代结果{'max_depth':range(3,8),'num_leaves':(5,100,5)}
参数的最佳取值{'max_depth':3,'num_leaves':100}
最佳模型得分0.9900505209889217
根据结果,取max_depth=3,num_leaves=100
结果有误,忘了在'num_leaves'后⾯写range了。这句得修改为(上⾯已修改):
params_test1 ={'max_depth':range(3,8,1),'num_leaves':range(5,100,5)}
每次运⾏迭代结果{'max_depth':range(3,8),'num_leaves':range(5,100,5)}
参数的最佳取值{'max_depth':3,'num_leaves':10}
最佳模型得分0.9900505209889217
borghe根据结果,取max_depth=3,num_leaves=10
另⼏个参数含义:
bagging_fraction = 0.8,它被⽤来执⾏更快的结果装袋;
feature_fraction = 0.8, 设置每⼀次迭代所使⽤的特征⼦集
⼏个BUG:
n_jobs=-1不⽤写。不然出错。
我⾃⼰傻了⼀下犯的错:忘了训练模型garch11.fit(X_train,y_train)
写了params_test1 = {‘max_depth’:range(3,8,1),‘num_leaves’:(5,100,5)}
应该是:params_test1 = {‘max_depth’:range(3,8,1),‘num_leaves’:range(5,100,5)}
3.确定min_data_in_leaf和max_bin in
min_data_in_leaf:
default=20
type=int
alias=min_data_per_leaf , min_data,min_child_samples
含义:⼀个叶⼦上数据的最⼩数量. 可以⽤来处理过拟合.
max_bin:
default=255
上海口译考试type=int
⼯具箱的最⼤数特征值决定了容量
⼯具箱的最⼩数特征值可能会降低训练的准确性, 但是可能会增加⼀些⼀般的影响(处理过度学习)
LightGBM 将根据 max_bin ⾃动压缩内存。 例如, 如果 maxbin=255, 那么 LightGBM 将使⽤ uint8t 的特性值
讽刺的意思params_test2 ={'max_bin':range(5,256,10),'min_data_in_leaf':range(1,102,10)}
model = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',learning_rate=0.1,n_estimators=52,max_depth=3,num_leaves=10,bagging_fraction=0. 8,feature_fraction=0.8)
garch12 = GridSearchCV(estimator=model,param_grid=params_test2,scoring='roc_auc',cv=5)
garch12.fit(X_train,y_train)
print('每次运⾏迭代结果{0}'.format(garch12.param_grid))
print('参数的最佳取值{0}'.format(garch12.best_params_))
print('最佳模型得分{0}'.format(garch12.best_score_))
#运算结果
每次运⾏迭代结果{'max_bin':range(5,256,10),'min_data_in_leaf':range(1,102,10)}
参数的最佳取值{'max_bin':35,'min_data_in_leaf':101}
最佳模型得分0.9919433294366571
取min_data_in_leaf=101,max_bin in=35
4.确定feature_fraction、bagging_fraction、bagging_freq
feature_fraction:指定每次迭代所需要的特征部分
default=1.0
type=double,
feature_fraction < 1.0,
alias=sub_feature, colsample_bytree
如果 feature_fraction ⼩于 1.0, LightGBM 将会在每次迭代中随机选择部分特征.
例如, 如果设置为 0.8, 将会在每棵树训练之前选择 80% 的特征可以⽤来加速训练,可以⽤来处理过拟合
bagging_fraction:指定每次迭代所需要的数据部分,并且它通常是被⽤来提升训练速度和避免过拟合的。
default=1.0
type=double,
alias=sub_row, subsample
类似于 feature_fraction, 但是它将在不进⾏重采样的情况下随机选择部分数据
可以⽤来加速训练
可以⽤来处理过拟合
Note: 为了启⽤ bagging, bagging_freq 应该设置为⾮零值
bagging_freq
英语入门学习视频default=0
type=int
alias=subsample_freq
bagging 的频率, 0 意味着禁⽤ bagging. k 意味着每 k 次迭代执⾏bagging
Note: 为了启⽤ bagging, bagging_fraction 设置适当
params_test3 ={
'feature_fraction':[0.6,0.7,0.8,0.9,1.0],
'bagging_fraction':[0.6,0.7,0.8,0.9,1.0],
'bagging_freq':range(0,81,10)
}
model = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',learning_rate=0.1,n_estimators=52,max_depth=3,num_leaves=10,max_bin=35,min_da ta_in_leaf=101)
garch13 = GridSearchCV(estimator=model,param_grid=params_test3,scoring='roc_auc',cv=5)
garch13.fit(X_train,y_train)
print('每次运⾏迭代结果{0}'.format(garch13.param_grid))
print('参数的最佳取值{0}'.format(garch13.best_params_))
print('最佳模型得分{0}'.format(garch13.best_score_))
#运算结果:
每次运⾏迭代结果{'feature_fraction':[0.6,0.7,0.8,0.9,1.0],'bagging_fraction':[0.6,0.7,0.8,0.9,1.0],'bagging_freq':range(0,81,10)}
参数的最佳取值{'bagging_fraction':0.9,'bagging_freq':30,'feature_fraction':0.8}
tundra最佳模型得分0.9927539170199572
5.确定lambda_l1和lambda_l2
sawparams_test4 ={
'lambda_l1':[1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0],
'lambda_l2':[1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]
witnesth}
model = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',learning_rate=0.1,n_estimators=52,max_depth=3,num_leaves=10,max_bin=35,min_da ta_in_leaf=101,bagging_fraction=0.9,bagging_freq=30, feature_fraction=0.8)
garch14 = GridSearchCV(estimator=model,param_grid=params_test4,scoring='roc_auc',cv=5)
garch14.fit(X_train,y_train)
print('每次运⾏迭代结果{0}'.format(garch14.param_grid))
print('参数的最佳取值{0}'.format(garch14.best_params_))
print('最佳模型得分{0}'.format(garch14.best_score_))
#运⾏结果
每次运⾏迭代结果{'lambda_l1':[1e-05,0.001,0.1,0.0,0.1,0.3,0.5,0.7,0.9,1.0],'lambda_l2':[1e-05,0.001,0.1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]}
参数的最佳取值{'lambda_l1':1e-05,'lambda_l2':1e-05}
最佳模型得分0.9927539170199572
6.确定 min_split_gain
min_split_gain
default=0
type=double
alias=min_gain_to_split
执⾏切分的最⼩增益
waitparams_test5 ={'min_split_gain':[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]}
model = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',learning_rate=0.1,n_estimators=52,max_depth=3,num_leaves=10,max_bin=35,min_da ta_in_leaf=101,bagging_fraction=0.9,bagging_freq=30, feature_fraction=0.8,
lambda_l1=1e-05,lambda_l2=1e-05)
garch15 = GridSearchCV(estimator=model,param_grid=params_test5,scoring='roc_auc',cv=5)
garch15.fit(X_train,y_train)
print('每次运⾏迭代结果{0}'.format(garch15.param_grid))
print('参数的最佳取值{0}'.format(garch15.best_params_))
print('最佳模型得分{0}'.format(garch15.best_score_))
#运⾏结果
每次运⾏迭代结果{'min_split_gain':[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]}
受欢迎的英文参数的最佳取值{'min_split_gain':0.2}
最佳模型得分0.9929603153687704