首页 > 美文鉴赏

lightGBM的使用

更新时间:2023-05-15 08:05:48 阅读：评论：0

lightGBM的使⽤⽬录

light GBM是微软开源的⼀种使⽤基于树的学习算法的梯度提升框架。

⽂档地址：

源码地址：

中⽂⽂档地址：

论⽂地址：

参考博客：

载⼊数据

LGB可以load以下类型数据。

libsvm/tsv/csv/txt format file

NumPy 2D array(s), pandas DataFrame, H2O DataTable’s Frame, SciPy spar matrix LightGBM binary file

load libsvm text file or a LightGBM binary file

train_data = lgb.Datat('train.svm.bin')

load a numpy array

data = np.random.rand(500, 10) # 500 entities, each contains 10 features

label = np.random.randint(2, size=500) # binary target

train_data = lgb.Datat(data, label=label)

load a scpiy.spar.csr_matrix array

csr = scipy.spar.csr_matrix((dat, (row, col)))

train_data = lgb.Datat(csr)

Saving Datat into a LightGBM binary file（可以提⾼运⾏速度）

train_data = lgb.Datat('')

train_data.save_binary('train.bin')

Create validation data:

validation_data = ate_valid('validation.svm')

在构建数据集时，需要把类别转化为整型数。同时设置free_raw_data=True（默认是true).

挥汗如雨的意思参数设置

⼀般会初始化权重

w = np.random.rand(500, )

train_data = lgb.Datat(data, label=label, weight=w)

booster参数

param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}

param['metric'] = 'auc' or 'binary_logloss'

核⼼参数

config , default = "", type = string, alias: config_file

土豆焖茄子path of config file

Note: can be ud only in CLI version

task , default = train, type = enum, options: train, predict, convert_model, refit, alias: task_type

train, for training, alias: training

predict, for prediction, alias: prediction, test

convert_model, for converting model file into if-el format, e more information in

refit, for refitting existing models with new data, alias: refit_tree

Note: can be ud only in CLI version; for language-specific packages you can u the correspondent functions

赚钱投资

objective , default = regression, type = enum,

options: regression, regression_l1, huber, fair, poisson, quantile, mape, gammma, tweedie, binary, multiclass, multiclassova, xentropy, xentlambda, lambdarank, alias: objective_type, app, application

regression application

regression_l2, L2 loss, alias: regression, mean_squared_error, m, l2_root, root_mean_squared_error, rm

regression_l1, L1 loss, alias: mean_absolute_error, mae

huber,

fair,

poisson,

quantile,

mape, , alias: mean_absolute_percentage_error

gamma, Gamma regression with log-link. It might be uful, e.g., for modeling insurance claims verity, or for any target that might be tweedie, Tweedie regression with log-link. It might be uful, e.g., for modeling total loss in insurance, or for any target that might be

binary, binary classification (or logistic regression). Requires labels in {0, 1}; e cross-entropy application for general probability labels in [0, 1]

multi-class classification application

multiclass, objective function, alias: softmax

multiclassova, binary objective function, alias: multiclass_ova, ova, ovr

num_class should be t as well

cross-entropy application

xentropy, objective function for cross-entropy (with optional linear weights), alias: cross_entropy

xentlambda, alternative parameterization of cross-entropy, alias: cross_entropy_lambda

label is anything in interval [0, 1]

lambdarank, application

时光如流水label should be int type in lambdarank tasks, and larger number reprents the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect)做人要厚道

can be ud to t the gain (weight) of int label

all values in label must be smaller than number of elements in label_gain

boosting , default = gbdt, type = enum, options: gbdt, gbrt, rf, random_forest, dart, goss, alias: boosting_type, boost

gbdt, traditional Gradient Boosting Decision Tree, alias: gbrt

rf, Random Forest, alias: random_forest

dart,

goss, Gradient-bad One-Side Sampling

data , default = "", type = string, alias: train, train_data, train_data_file, data_filename

path of training data, LightGBM will train from this data

Note: can be ud only in CLI version

valid , default = "", type = string, alias: test, valid_data, valid_data_file, test_data, test_data_file, valid_filenames

path(s) of validation/test data, LightGBM will output metrics for the data

support multiple validation data, parated by ,

Note: can be ud only in CLI version

num_iterations , default = 100, type = int, alias: num_iteration, n_iter, num_tree, num_trees, num_round, num_rounds, num_boost_round, n_estimators, constraints: num_iterations >= 0

number of boosting iterations

Note: internally, LightGBM constructs num_class * num_iterations trees for multi-class classification

problems

learning_rate , default = 0.1, type = double, alias: shrinkage_rate, eta, constraints: learning_rate > 0.0

shrinkage rate

in dart, it also affects on normalization weights of dropped trees

num_leaves , default = 31, type = int, alias: num_leaf, max_leaves, max_leaf, constraints: num_leaves > 1

max number of leaves in one tree

tree_learner , default = rial, type = enum, options: rial, feature, data, voting, alias: tree, tree_type, tree_learner_type rial, single machine tree learner

feature, feature parallel tree learner, alias: feature_parallel

data, data parallel tree learner, alias: data_parallel

voting, voting parallel tree learner, alias: voting_parallel

refer to to get more details

num_threads , default = 0, type = int, alias: num_thread, nthread, nthreads, n_jobs

轴对称图形剪纸number of threads for LightGBM

0 means default number of threads in OpenMP

for the best speed, t this to the number of real CPU cores, not the number of threads (most CPUs u to generate 2 threads per CPU core)

do not t it too large if your datat is small (for instance, do not u 64 threads for a datat with 10,000 rows)

be aware a task manager or any similar CPU monitoring tool might report that cores not being fully utilized. This is normal

for parallel learning, do not u all CPU cores becau this will cau poor performance for the network communication

device_type , default = cpu, type = enum, options: cpu, gpu, alias: device

device for the tree learning, you can u GPU to achieve the faster learning

Note: it is recommended to u the smaller max_bin (e.g. 63) to get the better speed up

Note: for the faster speed, GPU us 32-bit float point to sum up by default, so this may affect the accuracy for some tasks. You can t gpu_u_dp=true to enable 64-bit float point, but it will slow down the training

Note: refer to to build LightGBM with GPU support

ed , default = None, type = int, alias: random_ed, random_state

this ed is ud to generate other eds, e.g. data_random_ed, feature_fraction_ed, etc.

by default, this ed is unud in favor of default values of other eds

this ed has lower priority in comparison with other eds, which means that it will be overridden, if you t other eds explicitly learning control参数

max_depth , default = -1, type = int

limit the max depth for tree model. This is ud to deal with over-fitting when #data is small. Tree still grows leaf-wi

< 0 means no limit

min_data_in_leaf , default = 20, type = int, alias: min_data_per_leaf, min_data, min_child_samples, constraints: min_data_in_leaf >= 0 minimal number of data in one leaf. Can be ud to deal with over-fitting

min_sum_hessian_in_leaf , default = 1e-3, type = double, alias: min_sum_hessian_per_leaf, min_sum_hessian, min_hessian, min_child_weight, constraints: min_sum_hessian_in_leaf >= 0.0

minimal sum hessian in one leaf. Like min_data_in_leaf, it can be ud to deal with over-fitting

bagging_fraction , default = 1.0, type = double, alias: sub_row, subsample, bagging, constraints: 0.0 < bagging_fraction <= 1.0 like feature_fraction, but this will randomly lect part of data without resampling

can be ud to speed up training

can be ud to deal with over-fitting

Note: to enable bagging, bagging_freq should be t to a non zero value as well

bagging_freq , default = 0, type = int, alias: subsample_freq

frequency for bagging

0 means disable bagging; k means perform bagging at every k iteration

Note: to enable bagging, bagging_fraction should be t to value smaller than 1.0 as well

bagging_ed , default = 3, type = int, alias: bagging_fraction_ed

random ed for bagging

feature_fraction , default = 1.0, type = double, alias: sub_feature, colsample_bytree, constraints: 0.0 < feature_fraction <= 1.0 LightGBM will randomly lect part of features on each iteration if feature_fraction smaller than 1.0. For example, if you t it to 0.8, LightGBM will lect 80% of features before training each tree

can be ud to speed up training

战胜的近义词can be ud to deal with over-fitting

feature_fraction_ed , default = 2, type = int

random ed for feature_fraction

early_stopping_round , default = 0, type = int, alias: early_stopping_rounds, early_stopping

will stop training if one metric of one validation data doesn’t improve in last early_stopping_round rounds

<= 0 means disable

max_delta_step , default = 0.0, type = double, alias: max_tree_output, max_leaf_output

ud to limit the max output of tree leaves

<= 0 means no constraint

the final max output of leaves is learning_rate * max_delta_step

lambda_l1 , default = 0.0, type = double, alias: reg_alpha, constraints: lambda_l1 >= 0.0

L1 regularization

lambda_l2 , default = 0.0, type = double, alias: reg_lambda, lambda, constraints: lambda_l2 >= 0.0

L2 regularization

min_gain_to_split , default = 0.0, type = double, alias: min_split_gain, constraints: min_gain_to_split >= 0.0

the minimal gain to perform split

drop_rate , default = 0.1, type = double, alias: rate_drop, constraints: 0.0 <= drop_rate <= 1.0

ud only in dart

dropout rate: a fraction of previous trees to drop during the dropout

max_drop , default = 50, type = int

ud only in dart

max number of dropped trees during one boosting iteration

<=0 means no limit

skip_drop , default = 0.5, type = double, constraints: 0.0 <= skip_drop <= 1.0

ud only in dart

probability of skipping the dropout procedure during a boosting iteration

新疆枣xgboost_dart_mode , default = fal, type = bool

ud only in dart

t this to true, if you want to u xgboost dart mode

uniform_drop , default = fal, type = bool

ud only in dart

t this to true, if you want to u uniform drop

drop_ed , default = 4, type = int

ud only in dart

random ed to choo dropping models

top_rate , default = 0.2, type = double, constraints: 0.0 <= top_rate <= 1.0

ud only in goss

the retain ratio of large gradient data

other_rate , default = 0.1, type = double, constraints: 0.0 <= other_rate <= 1.0

ud only in goss

the retain ratio of small gradient data

min_data_per_group , default = 100, type = int, constraints: min_data_per_group > 0

minimal number of data per categorical group

max_cat_threshold , default = 32, type = int, constraints: max_cat_threshold > 0

ud for the categorical features

limit the max threshold points in categorical features

cat_l2 , default = 10.0, type = double, constraints: cat_l2 >= 0.0

ud for the categorical features

L2 regularization in categorcial split

cat_smooth , default = 10.0, type = double, constraints: cat_smooth >= 0.0

ud for the categorical features

this can reduce the effect of nois in categorical features, especially for categories with few data

max_cat_to_onehot , default = 4, type = int, constraints: max_cat_to_onehot > 0

when number of categories of one feature smaller than or equal to max_cat_to_onehot, one-vs-other split algorithm will be ud

top_k , default = 20, type = int, alias: topk, constraints: top_k > 0

ud in

t this to larger value for more accurate result, but it will slow down the training speed

monotone_constraints , default = None, type = multi-int, alias: mc, monotone_constraint

ud for constraints of monotonic features

1 means increasing, -1 means decreasing, 0 means non-constraint

you need to specify all features in order. For example, mc=-1,0,1 means decreasing for 1st feature, non-constraint for 2nd feature and increasing for the 3rd feature

feature_contri , default = None, type = multi-double, alias: feature_contrib, fc, fp, feature_penalty

ud to control feature’s split gain, will u gain[i] = max(0, feature_contri[i]) * gain[i] to replace the s

plit gain of i-th feature

you need to specify all features in order

forcedsplits_filename , default = "", type = string, alias: fs, forced_splits_filename, forced_splits_file, forced_splits

path to a .json file that specifies splits to force at the top of every decision tree before best-first learning commences

.json file can be arbitrarily nested, and each split contains feature, threshold fields, as well as left and right fields reprenting subsplits categorical splits are forced in a one-hot fashion, with left reprenting the split containing the feature value and right reprenting other values

Note: the forced split logic will be ignored, if the split makes gain wor

e as an example

refit_decay_rate , default = 0.9, type = double, constraints: 0.0 <= refit_decay_rate <= 1.0

decay rate of refit task, will u leaf_output = refit_decay_rate * old_leaf_output + (1.0 - refit_decay_rate) * new_leaf_output to refit trees ud only in refit task in CLI version or as argument in refit function in language-specific package

本文发布于:2023-05-15 08:05:48，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/898801.html

上一篇：CS231n计算机视觉作业1-Q1-写一个K近邻分类器（如何开始作业）

下一篇：贝克抑郁量表第2版中文版在我国大学生中的因子结构

标签：需要数据集时算法梯度提升

留言与评论（共有 0 条评论）