首页 > 美文阅读

超参数的调优（lightgbm）

更新时间:2023-05-15 07:17:56 阅读：评论：0

超参数的调优（lightgbm）

超参数的优化过程：通过⾃动化

⽬的：使⽤带有策略的启发式搜索（informed arch）在更短的时间内找到最优超参数，除了初始设置之外，并不需要额外的⼿动操作。

实践部分

贝叶斯优化问题有四个组成部分：

1. ⽬标函数：我们想要最⼩化的对象，这⾥指带超参数的机器学习模型的验证误差

2. 域空间：待搜索的超参数值

3. 优化算法：构造代理模型和选择接下来要评估的超参数值的⽅法

4. 结果的历史数据：存储下来的⽬标函数评估结果，包含超参数和验证损失

通过以上四个步骤，我们可以对任意实值函数进⾏优化（找到最⼩值）。这是⼀个强⼤的抽象过程，除了机器学习超参数的调优，它还能帮我们解决其他许多问题。

代码⽰例

什么是不平衡的分类问题？

hyperropt1125.py

- 导⼊库

梦想在飞

学生管理工作总结import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import aborn as sns

import lightgbm as lgb

del_lection import KFold

MAX_EVALIS =500

N_FLODS =10

观察数据，数据集已经⽤ORIGIN把数据⾃动分为train和test集。

print(data.ORIGIN.value_counts())

train 5822

test 4000

Name: ORIGIN, dtype: int64

复习⼀下，随意画个图

串串狗多少钱一只plt.show()

CARAVAN： target

- 处理数据集

⼀定要细⼼啊，不然BUG真的莫名其妙。

#导⼊后先划分训练集和测试集

train = data[data['ORIGIN']=='train']

test = data[data['ORIGIN']=='test']

#抽取标签

train_labels = np.array(train['CARAVAN'].astype(np.int32)).reshape((-1,)) test_labels = np.array(test['CARAVAN'].astype(np.int32)).reshape((-1,))

去掉标签们，留下特征

train = train.drop(columns =['ORIGIN','CARAVAN'])

test = test.drop(columns =['ORIGIN','CARAVAN'])

features = np.array(train)

test_features = np.array(test)

labels = train_labels[:]

x网print('Train shape: {}'.format(train.shape))

print("Test shape :{}".format(test.shape))

train.head()

#运⾏结果

Train shape:(5822,85)

Test shape :(4000,85)

不分⾏列，改成1串

- 标签分布

plt.hist(labels, edgecolor ='k')

plt.xlabel('Label')

plt.ylabel('Count')

plt.title('Counts of Labels')

plt.show()

可以看出，这是个不平衡的分类问题。

因此，选⽤梯度提升模型，验证⽅法采⽤ROC AUC。（具体见原⽂了）

本⽂采⽤的是LightGBM。

-模型与其默认值

from lightgbm import LGBMClassifier

model = LGBMClassifier()#Model with default hyperparameters

print(model)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,

importance_type='split', learning_rate=0.1, max_depth=-1,

min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,

n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,

random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,

subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

将模型与训练集拟合，采⽤roc_auc验证。

ics import roc_auc_score

from timeit import default_timer as timer成年人正常体温

start = timer()

model.fit(features,labels)

train_time = timer()- start

predictions = model.predict_proba(test_features)[:,1]

auc = roc_auc_score(test_labels,predictions)

print('The baline score on the test t is {:.4f}.'.format(auc))

print('The baline training time is {:.4f} conds'.format(train_time))

#results

The baline score on the test t is0.7092.

The baline training time is0.1888 conds

Due to the small size of the datat (less than 6000 obrvations), hyperparameter tuning will have a modest but noticeable effect on the performance (a better investment of time might be to gather more data!)

测试代码运⾏时间

Random Search

import random

随机搜索也有四个部分：

路虎全部车型介绍Domain: values over which to arch

Optimization algorithm: pick the next values at random! (yes this qualifies as an algorithm)

Objective function to minimize: in this ca our metric is cross validation ROC AUC

Results history that tracks the hyperparameters tried and the cross validation metric

让我们来康康哪些参数要Tuning

- Domain for Random Search

Random arch and Bayesian optimization 都是从domain搜索hyperparameters，对于random (or grid arch)，这种domain被称为hyperparameter grid，并且对hyperparameter使⽤离散值。

print(LGBMClassifier())

#Results爆笑e族

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,

importance_type='split', learning_rate=0.1, max_depth=-1,

min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,

n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,

random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,

subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

基于默认值，以下可构建hyperparameter grid。在此之前选值怎么⼯作得更好并不好说，因此对于⼤部分参数我们均采⽤默认值或以默认值为中⼼上下浮动的值。

**注意：**subsample_dist是subsample的参数，但boosting_type=goss不⽀持随机的 subsampling。

假如boosting_type选择其他值，那么这个subsample_dist可以放到param_grid⾥让我们随机搜索。

param_grid ={

'class_weight':[None,'balanced'],

'boosting_type':['gbdt','goss','dart'],

'num_leaves':list(range(30,150)),

'learning_rate':list(np.logspace(np.log(0.005), np.log(0.2), ba = np.exp(1), num =1000)),

'subsample_for_bin':list(range(20000,300000,20000)),

'min_child_samples':list(range(20,500,5)),

'reg_alpha':list(np.linspace(0,1)),

'reg_lambda':list(np.linspace(0,1)),

'colsample_bytree':list(np.linspace(0.6,1,10))

}

# Subsampling (only applicable with 'goss')

subsample_dist =list(np.linspace(0.5,1,100))

让我们来康康 learning_rate 和 the num_leaves的分布，学习率是典型的对数分布，参见Quora这篇

becau it can vary over veral orders of magnitude

因此，可以采⽤np.logspace 搜索。

np.logspace returns values evenly spaced over a log-scale (so if we take the log of the resulting values, the distribution will be uniform

plt.hist(param_grid['learning_rate'],color='g',edgecolor ='k') plt.xlabel('Learning Rate', size =14)

plt.ylabel('Count', size =14)

plt.title('Learning Rate Distribution', size =18)

plt.show()

在0,005与0.2之间的值较多。建议在这个范围⾥选值。再来看看num_leaves的表现

plt.hist(param_grid['num_leaves'], color ='b', edgecolor ='k') plt.xlabel('Learning Number of Leaves', size =14)

plt.ylabel('Count', size =14)

plt.title('Number of Leaves Distribution', size =18)辣炒海鲜

plt.show()

本文发布于:2023-05-15 07:17:56，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/637929.html

上一篇：美国耶鲁大学持续(3篇)

下一篇：早上好祝福语(集锦15篇)

标签：参数模型学习搜索数据

留言与评论（共有 0 条评论）