首页 > 美文鉴赏

机器学习之基于xgboost的特征筛选

更新时间:2023-05-15 07:23:13 阅读：评论：0

机器学习之基于xgboost的特征筛选

本⽂主要是基于xgboost进⾏特征选择，很多⼈都知道在后⾯的模型选择时，xgboost模型是⼀个⾮常热门的模型。但其实在前⾯特征选择部分，基于xgboost进⾏特征筛选也⼤有可为。

#coding=utf-8

import pandas as pd

import xgboost as xgb

import os,random,pickle

os.mkdir('featurescore')

train = pd.read_csv('../../data/train/train_x_rank.csv')

train_target = pd.read_csv('../../data/train/train_master.csv',encoding='gb18030')[['Idx','target']]

train = pd.merge(train,train_target,on='Idx')

train_y = train.target

train_x = train.drop(['Idx','target'],axis=1)

dtrain = xgb.DMatrix(train_x, label=train_y)

test = pd.read_csv('../../data/test/test_x_rank.csv')

test_Idx = test.Idx

test = test.drop('Idx',axis=1)

dtest = xgb.DMatrix(test)

train_test = pd.concat([train,test])

_csv('rank_feature.csv',index=None)

print print(train_test.shape)

"""

params={

'booster':'gbtree',

'objective': 'rank:pairwi',

'scale_pos_weight': float(len(train_y)-sum(train_y))/float(sum(train_y)),

'eval_metric': 'auc',

'gamma':0.1,

'max_depth':6,

修改文件时间'lambda':500,

'subsample':0.6,

'colsample_bytree':0.3,

'min_child_weight':0.2,

'eta': 0.04,

'ed':1024,

'nthread':8

}

xgb.cv(params,dtrain,num_boost_round=1100,nfold=10,metrics='auc',show_progress=3,ed=1024)#733

"""

def pipeline(iteration,random_ed,gamma,max_depth,lambd,subsample,colsample_bytree,min_child_weight):

叠小纸船params={

'booster':'gbtree',

'objective': 'rank:pairwi',

'scale_pos_weight': float(len(train_y)-sum(train_y))/float(sum(train_y)),

'eval_metric': 'auc',

'gamma':gamma,

'max_depth':max_depth,

'lambda':lambd,

'subsample':subsample,

'colsample_bytree':colsample_bytree,

'min_child_weight':min_child_weight,

'eta': 0.2,

欢子的歌曲

'ed':random_ed,

'nthread':8

}

watchlist = [(dtrain,'train')]

model = ain(params,dtrain,num_boost_round=700,evals=watchlist)

#model.save_model('./model/xgb{0}.model'.format(iteration))

#predict test t

#test_y = model.predict(dtest)

#test_result = pd.DataFrame(test_Idx,columns=["Idx"])

#test_result["score"] = test_y

#_csv("./preds/xgb{0}.csv".format(iteration),index=None,encoding='utf-8')

#save feature score

feature_score = _fscore()

feature_score = sorted(feature_score.items(), key=lambda x:x[1],rever=True)

fs = []

for (key,value) in feature_score:

fs.append("{0},{1}\n".format(key,value))

石将军石勇with open('./featurescore/feature_score_{0}.csv'.format(iteration),'w') as f:

f.writelines("feature,score\n")

f.writelines(fs)

if __name__ == "__main__":

random_ed = range(10000,20000,100)

gamma = [i/1000.0 for i in range(0,300,3)]

儿童挑食max_depth = [5,6,7]

lambd = range(400,600,2)

subsample = [i/1000.0 for i in range(500,700,2)]

colsample_bytree = [i/1000.0 for i in range(550,750,4)]

min_child_weight = [i/1000.0 for i in range(250,550,3)]

random.shuffle(random_ed)

random.shuffle(gamma)

random.shuffle(max_depth)

random.shuffle(lambd)

random.shuffle(subsample)

random.shuffle(colsample_bytree)

微信红包秒抢random.shuffle(min_child_weight)

with open('params.pkl','w') as f:

pickle.dump((random_ed,gamma,max_depth,lambd,subsample,colsample_bytree,min_child_weight),f)

for i in range(36):

pipeline(i,random_ed[i],gamma[i],max_depth[i%3],lambd[i],subsample[i],colsample_bytree[i],min_child_weight[i])

因为xgboost的参数选择⾮常重要，因此进⾏了参数shuffle的操作。最后可以基于以上不同参数组合的xgboost所得到的feature和socre，再进⾏score平均操作，筛选出⾼得分的特征。

import pandas as pd

import os

files = os.listdir('featurescore')

fs = {}

for f in files:

t = pd.read_csv('featurescore/'+f)

t.index = t.feature

t = t.drop(['feature'],axis=1)

d = t.to_dict()['score']

for key in d:

if fs.has_key(key):

fs[key] += d[key]

el:

fs[key] = d[key]

韩国海带汤>牧云陆

fs = sorted(fs.items(), key=lambda x:x[1],rever=True)

t = []

for (key,value) in fs:

t.append("{0},{1}\n".format(key,value))

with open('rank_feature_score.csv','w') as f:

f.writelines("feature,score\n")

f.writelines(t)

这⾥得出了每个特征的总分，每个都除以36就是平均分了。最后按照平均分取出topn就可以。我的理解是这样⼦。然后觉得这种⽅法太耗时了。

本文发布于:2023-05-15 07:23:13，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/898752.html

上一篇：xgboost时间序列预测matlab,xgboost的遗传算法调参

下一篇：图像的各种采样方法

标签：特征筛选模型参数选择组合时间机器

留言与评论（共有 0 条评论）