天猫用户重复购买预测赛题——模型训练、验证和评测

更新时间:2023-05-20 10:07:25 阅读：评论：0

天猫⽤户重复购买预测赛题——模型训练、验证和评测

天池⼤赛⽐赛地址：

理论知识

分类是⼀个有监督的学习过程，在⼤量带标签数据的前提下，计算出未知样本的标签取值，⼆分类和多分类问题逻辑回归虽然叫回归但是属于分类算法通过将线性函数的结果映射到Sigmoid函数中预估出概率并分类Sigmoid函数

是归⼀化函数，将连续数值转化为0到1的范围，连续型–>离散型回归函数 from sklearn .linear_model import LogisticRegression from sklearn .preprocessing import StandardScaler from sklearn .model_lection import train_test_split # 需要标准化

rapexstdScaler = StandardScaler ()X = stdScaler .fit_transform (train )

X_train ,X_test ,y_train ,y_test = train_test_split (X ,target ,test_size =0.3,random_state =2020)clf = LogisticRegression (random_state =2020,slover ='lbfgs',multi_class ='multinomial'). fit (X_train ,y_train )

K近邻分类

计算样本数据中的点和当前点的距离，如欧式距离提取样本最相似数据的分类标签确定前k个点所在类别的出现频率

返回前k个点所出现频率最⾼的类别作为当前点的预测分类

from sklearn .neighbors import KNeighborsClassifier from sklearn .preprocessing import StandardScaler # 需要标准化

stdScaler = StandardScaler ()X = stdScaler .fit_transform (train )

X_train ,X_test ,y_train ,y_test = train_test_split (X ,target ,test_size =0.3,random_state =2020)clf = KNeighborsClassifier (n_neighbors =3).fit (X_train ,y_train )

⾼斯贝叶斯分类模型

from sklearn .naive_bayes import GussianNB from sklearn .preprocessing import StandardScaler # 需要标准化

stdScaler = StandardScaler ()X = stdScaler .fit_transform (train )

X_train ,X_test ,y_train ,y_test = train_test_split (X ,target ,test_size =0.3,random_state =2020)

f (x )=1+e −x

goldbach conjecture1

P (A ∣B )==

P (B )

P (A ,B )P (B )

P (B ∣A )∗P (A )

X_train ,X_test ,y_train ,y_test = train_test_split (X ,target ,test_size =0.3,random_state =2020)clf = GussianNB ().fit (X_train ,y_train )

集成学习分类模型

Bagging 抽取m个样本进⾏训练多个训练器结合策略

Boosting 带权重训练集训练基于学习误差率更新权重系数重新训练随机森林LightGBM

极端随机森林 Extra Tree ET

多个决策树构成

随机森林应⽤的是Bagging模型，极端随机森林使⽤所有的训练样本计算

随机森林是在⼀个随机⼦集内得到最佳的分叉属性⽽极端随机模型依靠完全随机得到分叉值

模型验证指标

指标描述⽅法 ics

Accurary 准确率accuray_score Percision 查准率precision_score Recall 查全率recall_score F1

F1值f1_score Classification Report 分类报告classification_report Confusion Matrix

混淆矩阵confusion_matrix

ROC ROC曲线roc_curve

AUC

ROC曲线下的⾯积

auc

查准率和查全率

假设有个不太准的验钞机假的会拦住真的会存起来但有时候会出问

查准率 precision：存起来的钞票中真钞的⽐例 = 存起来的真钞票 / （存起来的真钞+存起来的假钞）查全率recall：所以真钞中被存起来的⽐例 = 存起来的真钞票 / （存起来的真钞+误拦住的真钞）F1值

查准率和查全率的加权调和平均$ F = {(a^2+1)*R\over a^2-(P+R)}$当a = 1时，就是最常见的F1值，分类报告

提供查准率、查全率、F1值三种评估指标混淆矩阵

预测值=1

预测值=0真实值=1

TP(True Postive)

TN(True Negative)

F 1=P +R 2PR

预测值=1预测值=0真实值=0FP(Fal Postive)FN(Flash Negative)

ROC

adequately横坐标是FPR(Fasle Postive Rate)

纵坐标是TPR(True Postive Rate)

理想的⽬标是TPR=1，FPR=0 ROC曲线越靠拢(0,1)点，越偏离45度对⾓线效果越好

AUC曲线

ROC曲线下⽅的⾯积

1. 设置交叉验证⽅式

斩立决

# 1.简单交叉验证

del_lection import cross_val_score

ble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)

scores = cross_val_score(clf, train, target, cv=5,scoring='f1_macro')

print(scores)

# 2.使⽤ShuffleSplit切分数据

del_lection import ShuffleSplit

cv = ShuffleSplit(n_splits=5,test_size=0.3,random_state=2020)

scores = cross_val_score(clf, train, target, cv=cv)

# 3.使⽤KFold切分数据

del_lection import KFlod

kf = KFold(n_splits=5)

for k,(train_index,test_index)in enumerate(kf.split(train)):

X_train,X_test,y_train,y_test = train[train_index], train[test_index], target[train_index], target[test_index]

clf = clf.fit(X_train, y_train)

print(k, clf.score(X_test, y_test))

# 4.使⽤StratifiedKFold切分数据 label均分

del_lection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)

for k,(train_index, test_index)in enumerate(skf.split(train, target)):

X_train, X_test, y_train, y_test = train[train_index], train[test_index], target[train_index], target[test_index]

clf = clf.fit(X_train, y_train)

print(k, clf.score(X_test, y_test))

2. 模型调参

del_lection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.3, random_state=0)

clf = RandomForestClassifier(n_job=-1)

nsf

parameters ={

'n_estimators':[50,100,200],

'max_depth':[2,5]

}

clf = GridSearchCV(clf,param_grid=parameters,cv=5,scoring='precision_macro')

print(clf.cv_results)

print(clf.best_params_)

3. 不同的分类模型

# LR模型

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

# 标准化

stdScaler = StandardScaler()

X = stdScaler.fit_transform(train)

X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)

clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)

clf.score(X_test, y_test)

# KNN模型

ighbors import KNeighborsClassifier

from sklearn.preprocessing import StandardScaler

# 标准化

stdScaler = StandardScaler()

X = stdScaler.fit_transform(train)

X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)

clf = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)

clf.score(X_test, y_test)

# ⾼斯贝叶斯模型

from sklearn.naive_bayes import GaussianNB

from sklearn.preprocessing import StandardScaler

# 标准化

stdScaler = StandardScaler()

X = stdScaler.fit_transform(train)

X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)

clf = GaussianNB().fit(X_train, y_train)

clf.score(X_test, y_test)

# bagging模型

ble import BaggingClassifier

coral aighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)

clf = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)

clf = clf.fit(X_train, y_train)

clf.score(X_test, y_test)

# 随机森林模型

ble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)

clf = clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0) clf = clf.fit(X_train, y_train)

clf.score(X_test, y_test)

# ExTree模型

ble import ExtraTreesClassifier

clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)

# AdaBoost模型

ble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=100)

# GBDT模型

ble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

# lgb模型

import lightgbm

clf = lightgbm

train_matrix = clf.Datat(X_train, label=y_train)specialforces

test_matrix = clf.Datat(X_test, label=y_test)

params ={

'boosting_type':'gbdt',

'objective':'multiclass',

'metric':'multi_logloss',

'min_child_weight':1.5,

'num_leaves':2**5,

'lambda_l2':1,

'subsample':0.7,

'colsample_bytree':0.7,

'colsample_bylevel':0.7,

'learning_rate':0.03,

'ed':2020,

"num_class":2,

'silent':True,

}

num_round =10000

early_stopping_rounds =100

model = ain(params,

hhd

train_matrix,

num_round,

valid_ts=test_matrix,

early_stopping_rounds=early_stopping_rounds) pre= model.predict(X_valid,num_iteration=model.best_iteration)

# xgb模型

import xgboost

clf = xgboost

train_matrix = clf.DMatrix(X_train, label=y_train, missing=-1) test_matrix = clf.DMatrix(X_test, label=y_test, missing=-1)

z = clf.DMatrix(X_valid, label=y_valid, missing=-1)

params ={'booster':'gbtree',

'objective':'multi:softprob',

'eval_metric':'mlogloss',

'gamma':1,

'min_child_weight':1.5,

'max_depth':5,

'lambda':1,

一年级英语

'subsample':0.7,

'colsample_bytree':0.7,

'colsample_bylevel':0.7,

'eta':0.03,

'tree_method':'exact',

'ed':2020,

"num_class":2

}

num_round =10000

early_stopping_rounds =100

watchlist =[(train_matrix,'train'),

(test_matrix,'eval')]

model = ain(params,

train_matrix,

num_boost_round=num_round,

evals=watchlist,

pricked

early_stopping_rounds=early_stopping_rounds) pre = model.predict(z,ntree_limit=model.best_ntree_limit)

4. 模型融合

def stacking_reg(clf, train_x, train_y, test_x, clf_name, kf,):

valid_y_pre = np.zeros((train_y.shape[0],1))

test = np.zeros((test_x.shape[0],1))

test_y_pre_k = np.empty((splits,test_x.shape[0],1))

cv_scores =[]

for i ,(train_idx,test_idx)in enumerate(kf.split(train_x)):

tr_x = train_x[train_idx]

tr_y = train_y[train_idx]

te_x = train_x[test_idx]

te_y = train_y[test_idx]

本文发布于:2023-05-20 10:07:25，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/115669.html

上一篇：CNN经典模型汇总

下一篇：CloudCompare--安装和简单的使用方法

标签：分类模型训练样本函数

留言与评论（共有 0 条评论）