首页 > 美文鉴赏

xgboost二分类Demo（实现性别预测）

更新时间:2023-05-15 08:12:12 阅读：评论：0

xgboost⼆分类Demo（实现性别预测）

1. xgboost 参数介绍

sklearn 中的xgboost 的 XGBClassifer 参数⼀样。

1.1 常规参数

1.1.1 booster

1. gbtree 使⽤树模型作为基分类器(default)

2. gbliner 使⽤线性模型作为基分类器

1.1.2 silent

1. 0 安静模式，不输出中间过程(default)

2. 1 输出中间过程

1.1.3 nthread

1. -1 使⽤全部cpu进⾏并⾏运算(default)

2. 1 使⽤1个cpu运算

1.1.4 scale_pos_weight

正样本的权重，默认取1。在⼆分类任务中，当正负样本⽐例失衡时，设置正样本的权重，模型效果更好（如正样本：负样本=1:10，则scale_pos_weight=10）。

1.2 模型参数

1.2.1 n_estimatores

总迭代次数（即决策树个数）。

我们一起走过的日子1.2.2 early_stopping_rounds

在验证集上连续n次迭代，分类没有提⾼后，提前终⽌训练，防⽌过拟合（⼀般取10）。

1.2.3 max_depth

树深度，默认为6，典型值3-10。值越⼤，越容易过拟合；值越⼩，越容易⽋拟合。

1.2.4 min_child_weight

当代青年的使命默认取1，值越⼤，越容易⽋拟合；值越⼩，越容易过拟合。值较⼤时，可避免模型学习到局部的特殊样本。

1.2.5 subsample

训练每棵树时，使⽤的数据占全部训练集的⽐例，默认取1，⼀般取0.8，可防⽌过拟合。

1.2.6 colsample_bytree

训练每棵树时，使⽤的feature占全部特征的⽐例，默认取1，⼀般取0.8，可防⽌过拟合。

1.3 学习任务参数

1.3.1 learning_rate

学习率，控制每次迭代更新权重时的步长，默认取0.3，⼀般0.01-0.2，值越⼩，训练越慢。

1.3.2 objective

公司祝福语

⽬标函数

1. 回归任务

reg:linear (default)

reg:logistic

2. ⼆分类

binary:logistic 概率

binary:logitraw 类别

3. 多分类

multi:softmax num_calss=n 返回类别

multi:softprob num_calss=n 返回概率

b12的作用和功效4. rank:pairwi

1.3.3 eval_metric

1. 回归任务

rm 均⽅根误差(default)

mae 平⽅绝对误差

2. 分类任务

auc roc曲线下⾯积

error 错误率（⼆分类，default）

merror 错误率（多分类）

logloss 负对数似然函数（⼆分类）

mlogloss 负对数似然函数（多分类）

1.3.4 gamma

惩罚项系数，⼀般取0.1-0.2，指定结点分裂所需的最⼩损失函数下降值。

1.3.5 alpha

对待英文

L1 正则化系数，默认取1 。

1.3.6 lambda

孕妇能吃鸡爪吗L2 正则化系数，默认取1 。

1.4 主要函数

1. 载⼊数据：load_digits()

2. 数据分割：train_test_split()

3. 建⽴模型：XGBClassifier()

4. 模型训练：fit()

5. 模型预测：predict()

6. 性能度量：accuracy_score()

7. 特征重要度：plot_importance()

2. ⼆分类模型训练

# -*- coding: utf-8 -*-

import numpy as np

import xgboost as xgb

from xgboost import plot_importance

del_lection import train_test_split

from sklearn import metrics

from sklearn import preprocessing

# from matplotlib import pyplot as plt

# 1.load data

// data = np.loadtxt('', delimiter=',')

data = pd.read_csv('', p='\t')

data.drop(["uid","random","7d_retention","life_cycle"], axis=1, inplace=True)

data.fillna(0, inplace=True)

data_num, feature_num = data.shape

淘宝开店拍照技巧

print("data_num: ", data_num)

print("feature_num: ", feature_num)

# 2.shuffle data

rng = np.random.RandomState(830041)

index =list(range(data_num))

rng.shuffle(index)

data = data[index]

# 3.split data

X, Y = data[:,0:feature_num-1], data[:, feature_num-1]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

xg_train = xgb.DMatrix(X_train, label=y_train)

xg_test = xgb.DMatrix(X_test, label=y_test)

params ={

'booster':'gbtree',

'objective':'binary:logistic',

'eval_metric':'auc',

'max_depth':10,

'lambda':10,

'subsample':0.85,

'colsample_bytree':0.85,

'min_child_weight':2,

# 'eta': 0.025,

'eta':0.1,

'ed':0,

'nthread':8,

'silent':1

}

watchlist =[(xg_train,'train'),(xg_test,'test')]

num_round =50

bst = ain(params, xg_train, num_round, watchlist)

bst.save_model('del')

pred = bst.predict(xg_test)

print('predicting, classification error=%f'

%(sum(int(pred[i])!= y_test[i]for i in range(len(y_test)))/float(len(y_test))))

y_pred =(pred >=0.5)*1

print('AUC: %.4f'% _auc_score(y_test, pred))

print('ACC: %.4f'% metrics.accuracy_score(y_test, y_pred))

十大家常汤print('Recall: %.4f'% all_score(y_test, y_pred))

print('F1-score: %.4f'% metrics.f1_score(y_test, y_pred))

print('Precesion: %.4f'% metrics.precision_score(y_test, y_pred))

fusion_matrix(y_test, y_pred))

# plot_importance(bst)

# plt.show()

# 打印特征重要度

importance = _score(importance_type='gain')

sorted_importance =sorted(importance.items(), key=lambda x: x[1], rever=True) print('feature importances[gain]:')

print(sorted_importance)

3. 统计oe与auc

sample_rate =0.6

pos =100# 正样本数

count =200# 总样本数

dtest = xgb.DMatrix('')

bst = xgb.Booster(model_file="del")

test_preds = bst.predict(dtest)

preds =0.0

predList =[]

for pred in test_preds:

pred = pred*sample_rate/(1-pred*(1-sample_rate))

preds += pred

predList.append(pred)

oe = pos/preds

auc = roc_auc_score(labelList, predList)

with open("",'w')as f:

f.write("count=%d,\tauc=%f,\toe=%f\n"%(count,auc,oe))

本文发布于:2023-05-15 08:12:12，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/898809.html

上一篇：DIVERSITY-AWARE WEIGHTED MAJORITY VOTE CLASSIFIER

下一篇：EnmbleLearning-基于集成学习的模型融合-Python实现

标签：模型分类默认权重样本迭代任务训练

留言与评论（共有 0 条评论）