lightGBM的使⽤⽬录
light GBM是微软开源的⼀种使⽤基于树的学习算法的梯度提升框架。
⽂档地址:internship
源码地址:
中⽂⽂档地址:
论⽂地址:
参考博客:
载⼊数据
LGB可以load以下类型数据。
libsvm/tsv/csv/txt format file
NumPy 2D array(s), pandas DataFrame, H2O DataTable’s Frame, SciPy spar matrix LightGBM binary file
load libsvm text file or a LightGBM binary file
train_data = lgb.Datat('train.svm.bin')
load a numpy array
data = np.random.rand(500, 10) # 500 entities, each contains 10 features
label = np.random.randint(2, size=500) # binary target
train_data = lgb.Datat(data, label=label)
贝乐学科英语怎么样
load a scpiy.spar.csr_matrix array
歌舞青春3csr = scipy.spar.csr_matrix((dat, (row, col)))
train_data = lgb.Datat(csr)
Saving Datat into a LightGBM binary file(可以提⾼运⾏速度)
train_data = lgb.Datat('')
train_data.save_binary('train.bin')
Create validation data:
validation_data = ate_valid('validation.svm')
在构建数据集时,需要把类别转化为整型数。同时设置free_raw_data=True( 默认是true).
参数设置
⼀般会初始化权重
w = np.random.rand(500, )
train_data = lgb.Datat(data, label=label, weight=w)
booster参数
param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}
param['metric'] = 'auc' or 'binary_logloss'
华尔街英语培训中心
核⼼参数
config , default = "", type = string, alias: config_file
path of config file
Note: can be ud only in CLI version
task , default = train, type = enum, options: train, predict, convert_model, refit, alias: task_type
train, for training, alias: training
predict, for prediction, alias: prediction, test
convert_model, for converting model file into if-el format, e more information in
refit, for refitting existing models with new data, alias: refit_tree
Note: can be ud only in CLI version; for language-specific packages you can u the correspondent functions
objective , default = regression, type = enum,
options: regression, regression_l1, huber, fair, poisson, quantile, mape, gammma, tweedie, binary, multiclass, multiclassova, xentropy, xentlambda, lambdarank, alias: objective_type, app, application
regression application
regression_l2, L2 loss, alias: regression, mean_squared_error, m, l2_root, root_mean_squared_error, rm
regression_l1, L1 loss, alias: mean_absolute_error, mae
huber,
fair,
poisson,
quantile,
mape, , alias: mean_absolute_percentage_error
gamma, Gamma regression with log-link. It might be uful, e.g., for modeling insurance claims verity, or for any target that might be tweedie, Tweedie regression with log-link. It might be uful, e.g., for modeling total loss in insurance, or for any target that might be
binary, binary classification (or logistic regression). Requires labels in {0, 1}; e cross-entropy application for general probability labels in [0, 1]
multi-class classification application
multiclass, objective function, alias: softmax
multiclassova, binary objective function, alias: multiclass_ova, ova, ovr
num_class should be t as well
cross-entropy application
xentropy, objective function for cross-entropy (with optional linear weights), alias: cross_entropy
xentlambda, alternative parameterization of cross-entropy, alias: cross_entropy_lambda
label is anything in interval [0, 1]
lambdarank, application
上海英语培训班哪家好label should be int type in lambdarank tasks, and larger number reprents the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect)
can be ud to t the gain (weight) of int label
all values in label must be smaller than number of elements in label_gain
boosting , default = gbdt, type = enum, options: gbdt, gbrt, rf, random_forest, dart, goss, alias: boosting_type, boost
gbdt, traditional Gradient Boosting Decision Tree, alias: gbrt
rf, Random Forest, alias: random_forest
dart,
goss, Gradient-bad One-Side Sampling
data , default = "", type = string, alias: train, train_data, train_data_file, data_filename
path of training data, LightGBM will train from this data
broomstickNote: can be ud only in CLI version
valid , default = "", type = string, alias: test, valid_data, valid_data_file, test_data, test_data_file, valid_filenames
path(s) of validation/test data, LightGBM will output metrics for the data
support multiple validation data, parated by ,
Note: can be ud only in CLI version
num_iterations , default = 100, type = int, alias: num_iteration, n_iter, num_tree, num_trees, num_round, num_rounds, num_boost_round, n_estimators, constraints: num_iterations >= 0
number of boosting iterations
Note: internally, LightGBM constructs num_class * num_iterations trees for multi-class classification
problems
learning_rate , default = 0.1, type = double, alias: shrinkage_rate, eta, constraints: learning_rate > 0.0
shrinkage rate
in dart, it also affects on normalization weights of dropped trees
num_leaves , default = 31, type = int, alias: num_leaf, max_leaves, max_leaf, constraints: num_leaves > 1
max number of leaves in one tree
tree_learner , default = rial, type = enum, options: rial, feature, data, voting, alias: tree, tree_type, tree_learner_type rial, single machine tree learner
feature, feature parallel tree learner, alias: feature_parallel
data, data parallel tree learner, alias: data_parallel
voting, voting parallel tree learner, alias: voting_parallel
refer to to get more details
num_threads , default = 0, type = int, alias: num_thread, nthread, nthreads, n_jobs
number of threads for LightGBM
0 means default number of threads in OpenMP
for the best speed, t this to the number of real CPU cores, not the number of threads (most CPUs u to generate 2 threads per CPU core)
do not t it too large if your datat is small (for instance, do not u 64 threads for a datat with 10,000 rows)
be aware a task manager or any similar CPU monitoring tool might report that cores not being fully utilized. This is normal
for parallel learning, do not u all CPU cores becau this will cau poor performance for the network communication
device_type , default = cpu, type = enum, options: cpu, gpu, alias: device
device for the tree learning, you can u GPU to achieve the faster learning
Note: it is recommended to u the smaller max_bin (e.g. 63) to get the better speed up
Note: for the faster speed, GPU us 32-bit float point to sum up by default, so this may affect the accuracy for some tasks. You can t gpu_u_dp=true to enable 64-bit float point, but it will slow down the training
Note: refer to to build LightGBM with GPU support
ed , default = None, type = int, alias: random_ed, random_state
红楼梦英文this ed is ud to generate other eds, e.g. data_random_ed, feature_fraction_ed, etc.
by default, this ed is unud in favor of default values of other eds
this ed has lower priority in comparison with other eds, which means that it will be overridden, if you t other eds explicitly learning control参数
max_depth , default = -1, type = int
limit the max depth for tree model. This is ud to deal with over-fitting when #data is small. Tree still grows leaf-wi
< 0 means no limit
min_data_in_leaf , default = 20, type = int, alias: min_data_per_leaf, min_data, min_child_samples, constraints: min_data_in_leaf >= 0 minimal number of data in one leaf. Can be ud to deal with over-fitting
min_sum_hessian_in_leaf , default = 1e-3, type = double, alias: min_sum_hessian_per_leaf, min_sum_hessian, min_hessian, min_child_weight, constraints: min_sum_hessian_in_leaf >= 0.0
好听的圣诞节歌曲minimal sum hessian in one leaf. Like min_data_in_leaf, it can be ud to deal with over-fitting
bagging_fraction , default = 1.0, type = double, alias: sub_row, subsample, bagging, constraints: 0.0 < bagging_fraction <= 1.0 like feature_fraction, but this will randomly lect part of data without resampling
can be ud to speed up training
can be ud to deal with over-fitting
Note: to enable bagging, bagging_freq should be t to a non zero value as well
bagging_freq , default = 0, type = int, alias: subsample_freq
frequency for bagging
0 means disable bagging; k means perform bagging at every k iteration
Note: to enable bagging, bagging_fraction should be t to value smaller than 1.0 as well
bagging_ed , default = 3, type = int, alias: bagging_fraction_ed
random ed for bagging
feature_fraction , default = 1.0, type = double, alias: sub_feature, colsample_bytree, constraints: 0.0 < feature_fraction <= 1.0 LightGBM will randomly lect part of features on each iteration if feature_fraction smaller than 1.0. For example, if you t it to 0.8, LightGBM will lect 80% of features before training each tree
can be ud to speed up training
can be ud to deal with over-fitting
feature_fraction_ed , default = 2, type = int
random ed for feature_fraction
early_stopping_round , default = 0, type = int, alias: early_stopping_rounds, early_stopping
will stop training if one metric of one validation data doesn’t improve in last early_stopping_round rounds
<= 0 means disable
max_delta_step , default = 0.0, type = double, alias: max_tree_output, max_leaf_output
ud to limit the max output of tree leaves
<= 0 means no constraint
the final max output of leaves is learning_rate * max_delta_step
lambda_l1 , default = 0.0, type = double, alias: reg_alpha, constraints: lambda_l1 >= 0.0
L1 regularization
lambda_l2 , default = 0.0, type = double, alias: reg_lambda, lambda, constraints: lambda_l2 >= 0.0
L2 regularization
min_gain_to_split , default = 0.0, type = double, alias: min_split_gain, constraints: min_gain_to_split >= 0.0
the minimal gain to perform split
drop_rate , default = 0.1, type = double, alias: rate_drop, constraints: 0.0 <= drop_rate <= 1.0
ud only in dart
dropout rate: a fraction of previous trees to drop during the dropout
max_drop , default = 50, type = int
ud only in dart
max number of dropped trees during one boosting iteration
<=0 means no limit
skip_drop , default = 0.5, type = double, constraints: 0.0 <= skip_drop <= 1.0
ud only in dart
probability of skipping the dropout procedure during a boosting iteration
恬噪xgboost_dart_mode , default = fal, type = bool
ud only in dart
t this to true, if you want to u xgboost dart mode
uniform_drop , default = fal, type = bool
ud only in dart
t this to true, if you want to u uniform drop
drop_ed , default = 4, type = int
ud only in dart
random ed to choo dropping models
top_rate , default = 0.2, type = double, constraints: 0.0 <= top_rate <= 1.0
ud only in goss
the retain ratio of large gradient data
other_rate , default = 0.1, type = double, constraints: 0.0 <= other_rate <= 1.0
ud only in goss
the retain ratio of small gradient data
min_data_per_group , default = 100, type = int, constraints: min_data_per_group > 0
jobs是什么意思minimal number of data per categorical group
max_cat_threshold , default = 32, type = int, constraints: max_cat_threshold > 0
ud for the categorical features
limit the max threshold points in categorical features
cat_l2 , default = 10.0, type = double, constraints: cat_l2 >= 0.0
ud for the categorical features
L2 regularization in categorcial split
cat_smooth , default = 10.0, type = double, constraints: cat_smooth >= 0.0
ud for the categorical features
this can reduce the effect of nois in categorical features, especially for categories with few data
max_cat_to_onehot , default = 4, type = int, constraints: max_cat_to_onehot > 0
when number of categories of one feature smaller than or equal to max_cat_to_onehot, one-vs-other split algorithm will be ud
top_k , default = 20, type = int, alias: topk, constraints: top_k > 0
ud in
t this to larger value for more accurate result, but it will slow down the training speed
monotone_constraints , default = None, type = multi-int, alias: mc, monotone_constraint
ud for constraints of monotonic features
1 means increasing, -1 means decreasing, 0 means non-constraint
you need to specify all features in order. For example, mc=-1,0,1 means decreasing for 1st feature, non-constraint for 2nd feature and increasing for the 3rd feature
feature_contri , default = None, type = multi-double, alias: feature_contrib, fc, fp, feature_penalty
ud to control feature’s split gain, will u gain[i] = max(0, feature_contri[i]) * gain[i] to replace the s
plit gain of i-th feature
you need to specify all features in order
forcedsplits_filename , default = "", type = string, alias: fs, forced_splits_filename, forced_splits_file, forced_splits
path to a .json file that specifies splits to force at the top of every decision tree before best-first learning commences
.json file can be arbitrarily nested, and each split contains feature, threshold fields, as well as left and right fields reprenting subsplits categorical splits are forced in a one-hot fashion, with left reprenting the split containing the feature value and right reprenting other values
Note: the forced split logic will be ignored, if the split makes gain wor
e as an example
refit_decay_rate , default = 0.9, type = double, constraints: 0.0 <= refit_decay_rate <= 1.0
decay rate of refit task, will u leaf_output = refit_decay_rate * old_leaf_output + (1.0 - refit_decay_rate) * new_leaf_output to refit trees ud only in refit task in CLI version or as argument in refit function in language-specific package