lightGBM训练rank记录

更新时间:2023-05-20 10:22:30 阅读：评论：0

lightGBM训练rank记录

⽂章⽬录

调参

num_leaves和max_depth

num_leaves这是控制树模型复杂度的主要参数. 理论上, 借鉴 depth-wi 树, 我们可以设置 num_leaves = 2^(max_depth)但是, 这种简单的转化在实际应⽤中表现不佳. 这是因为, 当叶⼦数⽬相同时, leaf-wi 树要⽐ depth-wi 树深得多, 这就有可能导致过拟合. 因此, 当我们试着调整 num_leaves 的取值时, 应该让其⼩于 2^(max_depth). 举个例⼦, 当 max_depth=6 时(这⾥译者认为例⼦中, 树的最⼤深度应为7), depth-wi 树可以达到较⾼的准确率.但是如果设置 num_leaves 为 127 时, 有可能会导致过拟合, ⽽将其设置为 70 或

80 时可能会得到⽐ depth-wi 树更⾼的准确率. 其实, depth 的概念在 leaf-wi 树中并没有多⼤作⽤, 因为并不存在⼀个从 leaves

到 depth 的合理映射.

min_data_in_leaf和min_sum_hessian_in_leaf

min_data_in_leaf 这是处理 leaf-wi 树的过拟合问题中⼀个⾮常重要的参数. 它的值取决于训练数据的样本个树和 num_leaves 将其设置的较⼤可以避免⽣成⼀个过深的树, 但有可能导致⽋拟合. 实际应⽤中, 对于⼤数据集, 设置其为⼏百或⼏千就⾜够了. 你也可以利⽤max_depth 来显式地限制树的深度.

min_sum_hessian_in_leaf default = 1e-3, type = double, alias: min_sum_hessian_per_leaf, min_sum_hessian, min_hessian, min_child_weight, constraints: min_sum_hessian_in_leaf >= 0.0

monotonic constraints单调约束

It is often the ca in a modeling problem or project that the functional form of an acceptable model is constrained in some way. This may happen due to business considerations, or becau of the type of scientific question being

investigated. In some cas, where there is a very strong prior belief that the true relationship has some quality,

constraints can be ud to improve the predictive performance of the model. A common type of constraint in this situation is that certain features bear a monotonic relationship to the predicted respon:

group_column和ignore_column

group_column default = “”, type = int or string, alias: group, group_id, query_column, query, query_id ud to specify the query/group id column u number for index, e.g. query=0 means column_0 is the query id add a prefix name: for column name, e.g. query=name:query_id

Note: works only in ca of loading data directly from file

Note: data should be grouped by query_id, for more information, e Query Data

Note: index starts from 0 and it doesn’t count the label column when passing type is int, e.g. when label is column_0 and query_id is column_1, the correct parameter is query=0

ignore_column default = “”, type = multi-int or string, alias: ignore_feature, blacklist

ud to specify some ignoring columns in training

u number for index, e.g. ignore_column=0,1,2 means column_0, column_1 and column_2 will be ignored

counterbore

add a prefix name: for column name, e.g. ignore_column=name:c1,c2,c3 means c1, c2 and c3 will be ignored

Note: works only in ca of loading data directly from file

Note: index starts from 0 and it doesn’t count the label column when passing type is int

Note: despite the fact that specified columns will be completely ignored during the training, they still should have a valid format allowing LightGBM to load file successfully

categorical_feature

categorical_feature , default = “”, type = multi-int or string, alias: cat_feature, categorical_column, cat_column ud to specify categorical features

u number for index, e.g. categorical_feature=0,1,2 means column_0, column_1 and column_2 are categorical features add a prefix name: for column name, e.g. categorical_feature=name:c1,c2,c3 means c1, c2 and c3 are categorical features

Note: only supports categorical with int type (not applicable for data reprented as pandas DataFrame in Python-package)

Note: index starts from 0 and it doesn’t count the label column when passing type is int

Note: all values should be less than Int32.MaxValue (2147483647)

Note: using large values could be memory consuming. Tree decision rule works best when categorical features are prented by concutive integers starting from zero

Note: all negative values will be treated as missing values

Note: the output cannot be monotonically constrained with respect to a categorical feature

lambda_l1和lambda_l2

xbgoost中参数正则项包含叶⼦数和叶⼦权重

bagging_fraction和bagging_freq

bagging_fraction, default = 1.0, type = double, alias: sub_row, subsample, bagging, constraints: 0.0 < bagging_fraction <= 1.0

编号英文like feature_fraction, but this will randomly lect part of data without resampling

can be ud to speed up training

can be ud to deal with over-fitting

Note: to enable bagging, bagging_freq should be t to a non zero value as well

bagging_freq , default = 0, type = int, alias: subsample_freq

frequency for bagging

0 means disable bagging; k means perform bagging at every k iteration. Every k-th iteration, LightGBM will randomly

lect bagging_fraction * 100 % of the data to u for the next k iterations

Note: to enable bagging, bagging_fraction should be t to value smaller than 1.0 as well

feature_fraction , default = 1.0, type = double, alias: sub_feature, colsample_bytree, constraints: 0.0 < feature_fraction <= 1.0

LightGBM will randomly lect a subt of features on each iteration (tree) if feature_fraction is smaller than 1.0. For example, if you t it to 0.8, LightGBM will lect 80% of features before training each tree

can be ud to speed up training

can be ud to deal with over-fitting

关于类别特征 Categorical Feature Support

LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories as described here. This often performs better than one-hot encoding.

U categorical_feature to specify the categorical features. Refer to the parameter categorical_feature in Parameters.

Categorical features must be encoded as non-negative integers (int) less than Int32.MaxValue (2147483647). It is best to u a contiguous range of integers started from zero.

U min_data_per_group, cat_smooth to deal with over-fitting (when #data is small or #category is large).

For a categorical feature with high cardinality (#category is large), it often works best to treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or by embedding the categories in a low-

dimensional numeric space.

lambdaRank label

The label should be of type int, such that larger numbers correspond to higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).

U label_gain to t the gain(weight) of int label.

U lambdarank_truncation_level to truncate the max DCG.

可以⽤label_gain设置标签的增益，增益数值的设定是⼀个待研究的问题

label_gain , default = 0,1,3,7,15,31,63,…,2^30-1, type = multi-double

ud only in lambdarank application

relevant gain for labels. For example, the gain of label 2 is 3 in ca of default label gains parate by ,

调参

学习率和迭代次数

先把学习率先定⼀个较⾼的值，取 learning_rate = 0.1，然后通过cross valid和early_stopping_round计算最佳迭代次数

n_estimators/num_iterations/num_round/num_boost_round。我们可以先将该参数设成⼀个较⼤的数，然后在cv结果中查看最优的迭代次数，具体如代码。

通过early_stopping_round（如果⼀次验证数据的⼀个度量在最近的early_stopping_round 回合中没有提⾼，模型将停⽌训练）

可以设置cv的返回：模型或者评估结果，评估结果类似{'ndcg@10-mean':[, , , ], 'ndcg@10-stdv':[, , , ]}

max_depth和num_leaves

vuvuzela

max_bin和min_data_in_leaf

feature_fraction bagging_fraction bagging_freq

lambda_l1和lambda_l2

monotonic constraints

learning_rate

特征重要性

代码

class GridSearch(object):

def__init__(lf):

lf.gbm =None

lf.optimal_params ={}

def grid_arch(lf, grids, paral=Fal, is_print=True):

"""⼀组参数搜索 girds:{feature_name:[]}"""

res_d ={}

grid_l =list(grids.items())

params.update(lf.optimal_params)

# 特征并⾏，暂时⽤不到

if paral:

pass

# 特征⾃由组合

el:

# 所有可能的参数组合

param_l =[[x]for x in grid_l[0][1]]# 初始化第⼀个参数

for i in range(1,len(grid_l)):

param_l =[l +[p]for l in param_l for p in grid_l[i][1]]

name_l =[tup[0]for tup in grid_l]

for i in range(len(param_l)):

_d =dict(zip(name_l, param_l[i]))

params_copy = py()

params_copy.update(_d)

# k = 'lambda_l1: 0.001,lambda_l2: 0.1'

k =','.join([str(tup[0])+": "+str(tup[1])for tup in zip(name_l, param_l[i])])

print("第{}个组合：{}".format(i, k))

lf.gbm = ain(params_copy, train_data, num_boost_round=400, valid_ts=[valid_data],

early_stopping_rounds=40, verbo_eval=20)

res_d[k]=(lf.gbm.best_score['valid_0']['ndcg@10'], lf.gbm.best_score['valid_0']['ndcg@40']) # [(k,(ndcg@10, ndcg@30))] k = 'lambda_l1: 0.001,lambda_l2: 0.1'

ndcg_10 =sorted(res_d.items(), key=lambda kv: kv[1][0], rever=True)

lf.optimal_params.update(dict([x.split(': ')for x in ndcg_10[0][0].split(',')]))

print(lf.optimal_params)

ndcg_40 =sorted(res_d.items(), key=lambda kv: kv[1][1], rever=True)

if is_print:

for k, v in ndcg_10:

print(k, v)

print("-"*40)

for k, v in ndcg_40:

print(k, v)

return ndcg_10, ndcg_40

def all_parameters_arch(lf, grids_l):

"""⾃动化搜索所有可能的参数组"""

# with open('train_record', 'w') as f:

for grids in grids_l:

insideout

# grids.update(lf.optimal_params) # update上⼀步的最优参数

ndcg_10, ndcg_30 = lf.grid_arch(grids, paral=Fal, is_print=Fal)

for k, v in ndcg_10:艾薇儿音乐

print('{}\t{},{}\n'.format(k, v[0], v[1]))

for k, v in ndcg_30:

print('{}\t{},{}\n'.format(k, v[1], v[0]))

print("-"*40+'\n')

print(lf.optimal_params)

def print_feature_importance(lf):

"""打印特征的重要度

"""

importances = lf.gbm.feature_importance(importance_type='split')

feature_names = lf.gbm.feature_name()

sum=0.

for value in importances:

sum+= value

name_impo =sorted(list(zip(feature_names, importances)), key=lambda x: x[1], rever=True)

for name, impo in name_impo:

print('{} : {} : {}'.format(name, impo, impo /sum))

grids_l =[

# {'max_depth': list(range(3, 8, 1)), 'num_leaves': list(range(5, 100, 5))},

# {'max_bin': list(range(5, 256, 10)), 'min_data_in_leaf': list(range(50, 1001, 50))},

{'feature_fraction':[0.6,0.7,0.8,0.9,1.0],'bagging_fraction':[0.6,0.7,0.8,0.9,1.0],'bagging_freq':range(0,81,10)},

{'lambda_l1':[1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0],'lambda_l2':[1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]}, {'min_split_gain':[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]},

{'min_split_gain':[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]},

{'learning_rate':[0.01,0.03,0.05,0.07,0.1,0.3,0.5,0.7,1.0]}

]

if __name__ =="__main__":

t =int(sys.argv[1])

# t = 1

# categorical_feature='name:u_fs,up_x'会报错，因为没有输⼊feature names

train_data = lgb.Datat("data/train.csv", categorical_feature=[18,31], free_raw_data=Fal)

valid_data = lgb.Datat("data/valid.csv", categorical_feature=[18,31], free_raw_data=Fal)

if t ==0:

cv_res = lgb.cv(params, train_data, num_boost_round=1000, nfold=5, stratified=Fal, early_stopping_rounds=50) print("iteration num: {}".format(len(cv_res['ndcg@10-mean'])))

print("ndcg@10:{} ndcg@40: {} ".format(max(cv_res['ndcg@10-mean']),max(cv_res['ndcg@40-mean'])))

elif t ==1:# grid arch

gs = GridSearch()

小学英语课件gs.grid_arch({'learning_rate':[0.01,0.03,0.05,0.07,0.1,0.3,0.5,0.7,1.0]})

gs.print_feature_importance()

elif t ==2:

gs = GridSearch()

gs.all_parameters_arch(grids_l)

启动⽇志，多关注⼀下，涉及到参数的设置

comforter

主要包含：

参数的设置，通常会有⼀些警告

助你一臂之力数据的整体分析，rank⾥，类别数、query数、每个query的平均数据等等

Construct bin mappers from text data time 0.33 conds

Number of queries in train3.csv: 118403. Average number of rows per query: 33.868660.

Datat::GetMultiBinFromSparFeatures: spar rate 0.769392

5英尺11英寸Datat::GetMultiBinFromAllFeatures: spar rate 0.492899

init for col-wi cost 0.265492 conds, init for row-wi cost 0.571826 conds

Auto-choosing row-wi multi-threading, the overhead of testing was 0.318483 conds.

You can t `force_row_wi=true` to remove the overhead.

And if memory is not enough, you can t `force_col_wi=true`.

Using Spar Multi-Val Bin

lectitem

Total Bins 5951

Number of data points in the train t: 4010151, number of ud features: 31

gbm常⽤属性

gbm = ain(params, train_data, num_boost_round=400, valid_ts=[valid_data])

gbm.best_score

>>> defaultdict(<class'collections.OrderedDict'>,{'valid_0': OrderedDict([('ndcg@10',0.49198254166476096),('ndcg@30',0.5681340145051615)])}) predict

预测时，是否有query列不影响最终计算结果

>>>ypred4 = gbm.predict('test4.csv',data_has_header=True)

[LightGBM][Warning] Feature (sid)is misd in data file. If it is weight/query/group/ignore_column, you can ignore this warning.

>>>ypred4[:10]

array([0.54267792,0.39272917,0.31842769,0.10324354,-0.05312303,

0.10855625,0.0766676,0.1336972,1.57561062,0.14458557])

部署

模型⽂件

本文发布于:2023-05-20 10:22:30，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/706235.html

上一篇：Television device

下一篇：A context-dependent attention system for a social robot Content Areas robotics, cognitive m

标签：可能数据参数模型

留言与评论（共有 0 条评论）