超参数的调优(lightgbm)
超参数的优化过程:通过⾃动化
⽬的:使⽤带有策略的启发式搜索(informed arch)在更短的时间内找到最优超参数,除了初始设置之外,并不需要额外的⼿动操作。
实践部分
贝叶斯优化问题有四个组成部分:
1. ⽬标函数:我们想要最⼩化的对象,这⾥指带超参数的机器学习模型的验证误差
2. 域空间:待搜索的超参数值
3. 优化算法:构造代理模型和选择接下来要评估的超参数值的⽅法
4. 结果的历史数据:存储下来的⽬标函数评估结果,包含超参数和验证损失
通过以上四个步骤,我们可以对任意实值函数进⾏优化(找到最⼩值)。这是⼀个强⼤的抽象过程,除了机器学习超参数的调优,它还能帮我们解决其他许多问题。
代码⽰例
什么是不平衡的分类问题?
hyperropt1125.py
- 导⼊库
梦想在飞
学生管理工作总结import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import aborn as sns
import lightgbm as lgb
del_lection import KFold
MAX_EVALIS =500
N_FLODS =10
观察数据,数据集已经⽤ORIGIN把数据⾃动分为train和test集。
print(data.ORIGIN.value_counts())
train 5822
test 4000
Name: ORIGIN, dtype: int64
复习⼀下,随意画个图
串串狗多少钱一只plt.show()
CARAVAN: target
- 处理数据集
⼀定要细⼼啊,不然BUG真的莫名其妙。
#导⼊后先划分训练集和测试集
train = data[data['ORIGIN']=='train']
test = data[data['ORIGIN']=='test']
#抽取标签
train_labels = np.array(train['CARAVAN'].astype(np.int32)).reshape((-1,)) test_labels = np.array(test['CARAVAN'].astype(np.int32)).reshape((-1,))
去掉标签们,留下特征
train = train.drop(columns =['ORIGIN','CARAVAN'])
test = test.drop(columns =['ORIGIN','CARAVAN'])
features = np.array(train)
test_features = np.array(test)
labels = train_labels[:]
x网print('Train shape: {}'.format(train.shape))
print("Test shape :{}".format(test.shape))
train.head()
#运⾏结果
Train shape:(5822,85)
Test shape :(4000,85)
不分⾏列,改成1串
- 标签分布
plt.hist(labels, edgecolor ='k')
plt.xlabel('Label')
plt.ylabel('Count')
plt.title('Counts of Labels')
plt.show()
可以看出,这是个不平衡的分类问题。
因此,选⽤梯度提升模型,验证⽅法采⽤ROC AUC。(具体见原⽂了)
本⽂采⽤的是LightGBM。
-模型与其默认值
from lightgbm import LGBMClassifier
model = LGBMClassifier()#Model with default hyperparameters
print(model)
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
将模型与训练集拟合,采⽤roc_auc验证。
ics import roc_auc_score
from timeit import default_timer as timer成年人正常体温
start = timer()
model.fit(features,labels)
train_time = timer()- start
predictions = model.predict_proba(test_features)[:,1]
auc = roc_auc_score(test_labels,predictions)
print('The baline score on the test t is {:.4f}.'.format(auc))
print('The baline training time is {:.4f} conds'.format(train_time))
#results
The baline score on the test t is0.7092.
The baline training time is0.1888 conds
Due to the small size of the datat (less than 6000 obrvations), hyperparameter tuning will have a modest but noticeable effect on the performance (a better investment of time might be to gather more data!)
测试代码运⾏时间
Random Search
import random
随机搜索也有四个部分:
路虎全部车型介绍Domain: values over which to arch
Optimization algorithm: pick the next values at random! (yes this qualifies as an algorithm)
Objective function to minimize: in this ca our metric is cross validation ROC AUC
Results history that tracks the hyperparameters tried and the cross validation metric
让我们来康康哪些参数要Tuning
- Domain for Random Search
Random arch and Bayesian optimization 都是从domain搜索hyperparameters,对于random (or grid arch),这种domain被称为hyperparameter grid,并且对hyperparameter使⽤离散值。
print(LGBMClassifier())
#Results爆笑e族
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
基于默认值,以下可构建hyperparameter grid。在此之前选值怎么⼯作得更好并不好说,因此对于⼤部分参数我们均采⽤默认值或以默认值为中⼼上下浮动的值。
**注意:**subsample_dist是subsample的参数,但boosting_type=goss不⽀持随机的 subsampling。
假如boosting_type选择其他值,那么这个subsample_dist可以放到param_grid⾥让我们随机搜索。
param_grid ={
'class_weight':[None,'balanced'],
'boosting_type':['gbdt','goss','dart'],
'num_leaves':list(range(30,150)),
'learning_rate':list(np.logspace(np.log(0.005), np.log(0.2), ba = np.exp(1), num =1000)),
'subsample_for_bin':list(range(20000,300000,20000)),
'min_child_samples':list(range(20,500,5)),
'reg_alpha':list(np.linspace(0,1)),
'reg_lambda':list(np.linspace(0,1)),
'colsample_bytree':list(np.linspace(0.6,1,10))
}
# Subsampling (only applicable with 'goss')
subsample_dist =list(np.linspace(0.5,1,100))
让我们来康康 learning_rate 和 the num_leaves的分布,学习率是典型的对数分布,参见Quora这篇
becau it can vary over veral orders of magnitude
因此,可以采⽤np.logspace 搜索。
np.logspace returns values evenly spaced over a log-scale (so if we take the log of the resulting values, the distribution will be uniform
plt.hist(param_grid['learning_rate'],color='g',edgecolor ='k') plt.xlabel('Learning Rate', size =14)
plt.ylabel('Count', size =14)
plt.title('Learning Rate Distribution', size =18)
plt.show()
在0,005与0.2之间的值较多。建议在这个范围⾥选值。再来看看num_leaves的表现
plt.hist(param_grid['num_leaves'], color ='b', edgecolor ='k') plt.xlabel('Learning Number of Leaves', size =14)
plt.ylabel('Count', size =14)
plt.title('Number of Leaves Distribution', size =18)辣炒海鲜
plt.show()