天猫用户重复购买预测赛题——模型训练、验证和评测

更新时间:2023-05-20 10:07:25 阅读: 评论:0

天猫⽤户重复购买预测赛题——模型训练、验证和评测
天猫⽤户重复购买预测赛题——模型训练、验证和评测
天池⼤赛⽐赛地址:
理论知识
分类是⼀个有监督的学习过程,在⼤量带标签数据的前提下,计算出未知样本的标签取值,⼆分类和多分类问题逻辑回归 虽然叫回归 但是属于分类算法 通过将线性函数的结果映射到Sigmoid函数中 预估出概率并分类Sigmoid函数
是归⼀化函数,将连续数值转化为0到1的范围,连续型–>离散型回归函数 from  sklearn .linear_model import  LogisticRegression from  sklearn .preprocessing import  StandardScaler from  sklearn .model_lection import  train_test_split # 需要标准化
rapexstdScaler = StandardScaler ()X = stdScaler .fit_transform (train )
X_train ,X_test ,y_train ,y_test = train_test_split (X ,target ,test_size =0.3,random_state =2020)clf = LogisticRegression (random_state =2020,slover ='lbfgs',multi_class ='multinomial'). fit (X_train ,y_train )
K近邻分类
计算样本数据中的点和当前点的距离,如欧式距离提取样本最相似数据的分类标签确定前k个点所在类别的出现频率
返回前k个点所出现频率最⾼的类别 作为当前点的预测分类
from  sklearn .neighbors import  KNeighborsClassifier from  sklearn .preprocessing import  StandardScaler # 需要标准化
stdScaler = StandardScaler ()X = stdScaler .fit_transform (train )
X_train ,X_test ,y_train ,y_test = train_test_split (X ,target ,test_size =0.3,random_state =2020)clf = KNeighborsClassifier (n_neighbors =3).fit (X_train ,y_train )
⾼斯贝叶斯分类模型
from  sklearn .naive_bayes import  GussianNB from  sklearn .preprocessing import  StandardScaler # 需要标准化
stdScaler = StandardScaler ()X = stdScaler .fit_transform (train )
X_train ,X_test ,y_train ,y_test = train_test_split (X ,target ,test_size =0.3,random_state =2020)
f (x )=1+e −x
goldbach conjecture1
P (A ∣B )==
P (B )
P (A ,B )P (B )
P (B ∣A )∗P (A )
X_train ,X_test ,y_train ,y_test = train_test_split (X ,target ,test_size =0.3,random_state =2020)clf = GussianNB ().fit (X_train ,y_train )
集成学习分类模型
Bagging 抽取m个样本 进⾏训练 多个训练器结合策略
Boosting 带权重训练集 训练 基于学习误差率 更新权重系数 重新训练随机森林LightGBM
极端随机森林 Extra Tree ET
多个决策树构成
随机森林 应⽤的是Bagging模型,极端随机森林使⽤所有的训练样本计算
随机森林 是在⼀个随机⼦集内得到最佳的分叉属性 ⽽极端随机模型依靠完全随机得到分叉值
模型验证指标
指标描述⽅法 ics
Accurary 准确率accuray_score Percision 查准率precision_score Recall 查全率recall_score F1
F1值f1_score Classification Report 分类报告classification_report Confusion Matrix
混淆矩阵confusion_matrix
ROC ROC曲线roc_curve
AUC
ROC曲线下的⾯积
auc
查准率和查全率
假设有个不太准的验钞机 假的会拦住 真的会存起来 但有时候会出问
查准率 precision:存起来的钞票中 真钞的⽐例 = 存起来的真钞票 / (存起来的真钞+存起来的假钞)查全率recall:所以真钞中被 存起来的⽐例 = 存起来的真钞票 / (存起来的真钞+误拦住的真钞)F1值
查准率和查全率的加权调和平均$ F = {(a^2+1)*R\over a^2-(P+R)}$当a = 1时,就是最常见的F1值,分类报告
提供查准率、查全率、F1值 三种评估指标混淆矩阵
预测值=1
预测值=0真实值=1
TP(True Postive)
TN(True Negative)
F 1=P +R 2PR
预测值=1预测值=0真实值=0FP(Fal Postive)FN(Flash Negative)
ROC
adequately横坐标是FPR(Fasle Postive Rate)
纵坐标是TPR(True Postive Rate)
理想的⽬标是TPR=1,FPR=0 ROC曲线越靠拢(0,1)点,越偏离45度对⾓线效果越好
AUC曲线
ROC曲线下⽅的⾯积
1. 设置交叉验证⽅式
斩立决
# 1.简单交叉验证
del_lection import cross_val_score
ble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)
scores = cross_val_score(clf, train, target, cv=5,scoring='f1_macro')
print(scores)
# 2.使⽤ShuffleSplit切分数据
del_lection import ShuffleSplit
cv = ShuffleSplit(n_splits=5,test_size=0.3,random_state=2020)
scores = cross_val_score(clf, train, target, cv=cv)
# 3.使⽤KFold切分数据
del_lection import KFlod
kf = KFold(n_splits=5)
for k,(train_index,test_index)in enumerate(kf.split(train)):
X_train,X_test,y_train,y_test = train[train_index], train[test_index], target[train_index], target[test_index]
clf = clf.fit(X_train, y_train)
print(k, clf.score(X_test, y_test))
# 4.使⽤StratifiedKFold切分数据 label均分
del_lection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for k,(train_index, test_index)in enumerate(skf.split(train, target)):
X_train, X_test, y_train, y_test = train[train_index], train[test_index], target[train_index], target[test_index]
clf = clf.fit(X_train, y_train)
print(k, clf.score(X_test, y_test))
2. 模型调参
del_lection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.3, random_state=0)
clf = RandomForestClassifier(n_job=-1)
nsf
parameters ={
'n_estimators':[50,100,200],
'max_depth':[2,5]
}
clf = GridSearchCV(clf,param_grid=parameters,cv=5,scoring='precision_macro')
print(clf.cv_results)
print(clf.best_params_)
3. 不同的分类模型
# LR模型
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# 标准化
stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
clf.score(X_test, y_test)
# KNN模型
ighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
# 标准化
stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)
clf = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
clf.score(X_test, y_test)
# ⾼斯贝叶斯模型
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
# 标准化
stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)
clf = GaussianNB().fit(X_train, y_train)
clf.score(X_test, y_test)
# bagging模型
ble import BaggingClassifier
coral aighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)
# 随机森林模型
ble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0) clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)
# ExTree模型
ble import ExtraTreesClassifier
clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
# AdaBoost模型
ble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=100)
# GBDT模型
ble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
# lgb模型
import lightgbm
clf = lightgbm
train_matrix = clf.Datat(X_train, label=y_train)specialforces
test_matrix = clf.Datat(X_test, label=y_test)
params ={
'boosting_type':'gbdt',
'objective':'multiclass',
'metric':'multi_logloss',
'min_child_weight':1.5,
'min_child_weight':1.5,
'num_leaves':2**5,
'lambda_l2':1,
'subsample':0.7,
'colsample_bytree':0.7,
'colsample_bylevel':0.7,
'learning_rate':0.03,
'ed':2020,
"num_class":2,
'silent':True,
}
num_round =10000
early_stopping_rounds =100
model = ain(params,
hhd
train_matrix,
num_round,
valid_ts=test_matrix,
early_stopping_rounds=early_stopping_rounds) pre= model.predict(X_valid,num_iteration=model.best_iteration)
# xgb模型
import xgboost
clf = xgboost
train_matrix = clf.DMatrix(X_train, label=y_train, missing=-1) test_matrix = clf.DMatrix(X_test, label=y_test, missing=-1)
z = clf.DMatrix(X_valid, label=y_valid, missing=-1)
params ={'booster':'gbtree',
'objective':'multi:softprob',
'eval_metric':'mlogloss',
'gamma':1,
'min_child_weight':1.5,
'max_depth':5,
'lambda':1,
一年级英语
'subsample':0.7,
'colsample_bytree':0.7,
'colsample_bylevel':0.7,
'eta':0.03,
'tree_method':'exact',
'ed':2020,
"num_class":2
}
num_round =10000
early_stopping_rounds =100
watchlist =[(train_matrix,'train'),
(test_matrix,'eval')]
model = ain(params,
train_matrix,
num_boost_round=num_round,
evals=watchlist,
pricked
early_stopping_rounds=early_stopping_rounds) pre = model.predict(z,ntree_limit=model.best_ntree_limit)
4. 模型融合
def stacking_reg(clf, train_x, train_y, test_x, clf_name, kf,):
valid_y_pre = np.zeros((train_y.shape[0],1))
test = np.zeros((test_x.shape[0],1))
test_y_pre_k = np.empty((splits,test_x.shape[0],1))
cv_scores =[]
for i ,(train_idx,test_idx)in enumerate(kf.split(train_x)):
tr_x = train_x[train_idx]
tr_y = train_y[train_idx]
te_x = train_x[test_idx]
te_y = train_y[test_idx]

本文发布于:2023-05-20 10:07:25,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/90/115669.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:分类   模型   训练   样本   函数
相关文章
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图