【Stacking改进】基于随机采样与精度加权的Stacking算法

更新时间:2023-06-18 23:25:32 阅读：评论：0

摘要

近年来，⼈⼯智能的强势崛起让我们领略到⼈⼯智能技术的巨⼤潜⼒，机器学习也被⼴泛应⽤于各个领域，并取得不错的成果。本⽂以Kaggle竞赛Hou Prices的房价数据为实验样本，借鉴Bagging的⾃助采样法和k折交叉验证法，构建⼀种基于伪随机采样的Stacking集成模型，⽤于房价预测。⾸先利⽤GBDT对数据集进⾏简单训练，并得到各个特征重要性。接着对数据集进⾏多次随机采样，然后根据特征重要性进⾏属性扰动，组成多个训练数据⼦集和验证数据⼦集。⽤这些数据⼦集训练基模型，并计算验证集的均⽅根误差和预测结果，根据误差分配权重。根据各个基模型预测结果组成第⼆层的元模型，最后在测试数据集上进⾏房价预测。实验结果表明，基于随机采样和精度加权的Stacking集成模型的均⽅根误差⼩于所有基分类器和同结构的经典Stacking集成⽅法。

Stacking算法理论基础

Stacking是⼀种分层模型集成框架，在1992年被Wolpert提出。Stacking集成可以有多层的情况，但通常会设计两层，第⼀层由多种基模型组成，输⼊为原始训练集，⽽输出为各种基模型的预测值，⽽第⼆层

只有⼀个元模型，对第⼀层的各种模型的预测值和真实值进⾏训练，从⽽得到完成的集成模型。同理，预测测试集的过程也要先经过所有基模型的预测，组成第⼆层的特征，再⽤第⼆层的元模型预测出最终的结果。为了防⽌模型过拟合的情况，⼀般Stacking算法在第⼀层训练基模型时会结合k折交叉验证法。以五折交叉验证法为例，Stacking算法的过程如下图所⽰。

传统Stacking代码

# 定义⼀个交叉评估函数 Validation function

n_folds =5

def rmsle_cv(model):

kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)

rm= np.sqrt(-cross_val_score(model, train.values, target_variable, scoring="neg_mean_squared_error", cv = kf))

return(rm)

# 堆叠模型(Stacking Averaged Models) score: 0.1087 (0.0061)

class StackingAveragedModels(BaEstimator, RegressorMixin, TransformerMixin):

def__init__(lf, ba_models, meta_model, n_folds=5):

lf.ba_models = ba_models

lf.n_folds = n_folds

爱国主义教育主题班会def fit(lf, X, y):

lf.ba_models_ =[list()for x in lf.ba_models]# 4×5 list 存放训练好的模型

kfold = KFold(n_splits=lf.n_folds, shuffle=True, random_state=156)

# 训练基准模型，基于基准模型训练的结果导出成特征

# that are needed to train the cloned meta-model

out_of_fold_predictions = np.zeros((X.shape[0],len(lf.ba_models)))

for i, model in enumerate(lf.ba_models):

for train_index, holdout_index in kfold.split(X, y):#分为预测holdout_index和训练train_index

instance = clone(model)

lf.ba_models_[i].append(instance)

instance.fit(X[train_index], y[train_index])

y_pred = instance.predict(X[holdout_index])

out_of_fold_predictions[holdout_index, i]= y_pred

# 将基准模型预测数据作为特征⽤来给meta_model训练

return lf

def predict(lf, X):

meta_features = np.column_stack([

for ba_models in lf.ba_models_ ])

# meta_features = np.c_[meta_features,X[:,1]]

a_model_.predict(meta_features)

from sklearn.linear_model import LinearRegression

meta_model = KRR

stacked_averaged_models = StackingAveragedModels(ba_models =(ENet, GBoost, KRR, lasso),

meta_model = meta_model,

n_folds=10)

score = rmsle_cv(stacked_averaged_models)

print("Stacking Averaged models score: {:.4f} ({:.4f})".an(), score.std()))

Stacking Averaged models score:0.1087(0.0061)

Stacking改进

改进思路

本⽂对传统的Stacking算法进⾏研究改进，以Kaggle竞赛Hou Prices的房价数据为实验样本，通过Kaggle测试得分验证改进⽅法的可⾏性。本⽂以以下的⽅式对传统Stacking算法进⾏改进：

1. 使⽤⽆放回抽样得到数据⼦集。⽬前传统的Stacking算法是采⽤五折交叉验证法，将训练数据分成5等份，依次选择其中⼀份作为验

证⼦集，⽽其他四份作为训练⼦集⽤于模型训练，⽤训练好的基模型去预测验证⼦集，预测结果作为第⼆层的特征。⽽本⽂的模型则选择随机不放回抽样，⽐如连续20次随机抽取80%的样本，组成20个独⽴的数据⼦集。

2. 根据概率随机选取特征。数据集经过数据处理和特征⼯⼚后会产⽣很多特征，尤其是对离散特征进⾏独热编码，使得特征空间会变得

⾮常⼤，⽽且存在很多冗余特征。因此本⽂利⽤GBDT对数据集进⾏简单训练，并得到各个特征重要性，组成总和为1的概率列表。利⽤这个概率列表随机选取特征，可过滤冗余特征，构造效率更⾼、消耗更低的独⽴的预测模型。

3. 根据训练集的测试精度进⾏测试集的权重分配。传统的Stacking算法是采⽤五折交叉验证法，将数据集划分成五等份，由五组数据⼦

集构成5个基模型，在第⼀层预测测试集时，取5个基模型的预测结果的平均值作为第⼆层的特征。这⾥可能存在数据划分不均，⽽导致预测效果不佳的情况。因此本⽂根据基模型的测试精度对预测结果进⾏加权平均，得到结果作为第⼆层的特征。

改进Stacking代码

subsample函数是对数据集进⾏样本与特征的采样，并记录采样情况，因为预测的时候需要对测试集的特征进⾏相同的采样。

改进的算法有三个超参数：

1. n_tree：基模型个数T

2. ratio_sample：样本采样⽐例a

3. ratio_feature：特征采样⽐例b

⾸先先⽤GBoost进⾏简单训练，得到特征重要性列表

在stacking框架第⼀层，对每⼀个基模型进⾏T次拷贝，根据样本采样⽐例a对样本进⾏T次随机采样，然后再根据特征采样⽐例b和特征重要性列表进⾏特征选择，得到T个训练⼦集和T个验证⼦集。⽤训练⼦集分别对基模型进⾏训练，然后将对相应的验证⼦集的预测结果作为第⼆层元模型的输⼊特征。同时根据验证集的预测值与真实值的误差，给T个基模型分配权重。误差越⼤，权重越低。

最新儿童发型权重计算公式：

import random

from numpy import median

def subsample(datat_x, ratio_sample, ratio_feature, i, list_fearure):# 创建数据集的随机⼦样本

"""random_forest(评估算法性能，返回模型得分)

Args:

datat 训练数据集

ratio 训练数据集的样本⽐例，特征⽐例

Returns:孟姜女哭长城

sample 随机抽样的训练样本序列号

test_list 随机抽样后的剩下的测试样本序列号

feature 随机抽样的特征序列号

"""

random.ed(i)# 固定随机值

sample =list()

# 训练样本的按⽐例抽样。

# round() ⽅法返回浮点数x的四舍五⼊值。

n_sample =round(len(datat_x)* ratio_sample)

n_feature =round(datat_x.shape[1]* ratio_feature)

sample = random.sample(range(len(datat_x)), n_sample)

# feature = random.sample(range(datat_x.shape[1]), n_feature)

feature = np.random.choice(a=range(datat_x.shape[1]), size=n_feature, replace=Fal, p=list_fearure)

test_list =list(t(range(len(datat_x)))-t(sample))

return sample, test_list, feature

# RF堆叠模型(RF Stacking Averaged Models) 0.1158 (0.0048)

class RfStackingAveragedModels(BaEstimator, RegressorMixin, TransformerMixin):

def__init__(lf, ba_models, meta_model, n_tree=20, ratio_sample=1, ratio_feature=1,list_fearure=[]):花月美人

lf.ba_models = ba_models

我与企业共成长

lf.n_tree = n_tree

lf.ratio_sample = ratio_sample

lf.ratio_feature = ratio_feature

lf.list_fearure = list_fearure

def fit(lf, X, y):

lf.ba_models_ =[list()for x in lf.ba_models]# 4×5 list 存放训练好的模型

lf.list_weight =[list()for x in lf.ba_models]# 4×20 list 存放预测的权重

lf.list_feature =[list()for x in lf.ba_models]# 4×20 list 存放特征

n_tree = lf.n_tree

ratio_sample = lf.ratio_sample

ratio_feature = lf.ratio_feature

list_fearure = lf.list_fearure

# 训练基准模型，基于基准模型训练的结果导出成特征

# that are needed to train the cloned meta-model

rf_predictions =[list()for x in lf.ba_models]

rf_y =[]

rf_x0 =[]

rf_x1 =[]

for i, model in enumerate(lf.ba_models):

out__predictions = np.zeros((X.shape[0], n_tree))

for j in range(n_tree):

train_list, test_list, feature = subsample(X, ratio_sample, ratio_feature, j, list_fearure)

instance = clone(model)

lf.ba_models_[i].append(instance)

instance.fit(X[np.ix_(train_list, feature)], y[train_list])

y_pred = instance.predict(X[np.ix_(test_list, feature)])

# list(t(feature)|t(list_fearure))

rf_predictions[i].extend(y_pred)

如何释放c盘空间if i ==0:

d(test_list)

d(feature)

d(y[test_list])

m = mean_squared_error(y_pred, y[test_list])

lf.list_weight[i].append(m)

out__predictions[:, j]= instance.predict(X[np.ix_(range(X.shape[0]), feature)])

lf.list_feature[i].append(feature)

# 权重计算

sum_weight =sum(lf.list_weight[i])

num_weight =len(lf.list_weight[i])

mid_weight = median(lf.list_weight[i])

for j in range(num_weight):

lf.list_weight[i][j]=(sum_weight - lf.list_weight[i][j])/ sum_weight /(num_weight-1)

# 将基准模型预测数据作为特征⽤来给meta_model训练

rf_x = pd.DataFrame(rf_predictions)

rf_x = rf_x.T

# rf_x['area'] = rf_x0

return lf

def predict(lf, X):

meta_features = np.column_stack([

for i, ba_models in enumerate(lf.ba_models_)])

# meta_features = np.c_[meta_features,X[:,1]]

a_model_.predict(meta_features)

GBoost.fit(train,target_variable)

list_fearure = GBoost.feature_importances_

list_non =o(list_fearure)[0])

meta_model = GBoost

rf_stacked_averaged_models = RfStackingAveragedModels(ba_models =(ENet, KRR, lasso),

meta_model = meta_model,

n_tree=20, ratio_sample=0.8,

ratio_feature=0.6,

list_fearure=list_fearure)

score = rmsle_cv(rf_stacked_averaged_models)

print("RF Averaged models score: {:.4f} ({:.4f})".an(), score.std()))

RF Averaged models score:0.1158(0.0048)

基模型参数设置如下

# LASSO回归(LASSO Regression) Lasso score: 0.1101 (0.0058)

lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

score = rmsle_cv(lasso)

print("\nLasso score: {:.4f} ({:.4f})\n".an(), score.std()))

# 岭回归（Kernel Ridge Regression） Lasso score: 0.1152 (0.0043)

KRR = make_pipeline(RobustScaler(), KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5))

score = rmsle_cv(KRR)

print("\nLasso score: {:.4f} ({:.4f})\n".an(), score.std()))

# 弹性⽹络回归(Elastic Net Regression) Lasso score: 0.1100 (0.0059)

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))

score = rmsle_cv(ENet)

print("\nLasso score: {:.4f} ({:.4f})\n".an(), score.std()))

# 提升树(Gradient Boosting Regression): Lasso score: 0.1180 (0.0088)

GBoost = GradientBoostingRegressor(n_estimators=300, learning_rate=0.05,

max_depth=4, max_features='sqrt',

min_samples_leaf=15, min_samples_split=10,

loss='huber', random_state=5)

score = rmsle_cv(GBoost)

自动序号print("\nLasso score: {:.4f} ({:.4f})\n".an(), score.std()))

结果

模型均⽅根误差

岭回归0.13666

LASSO回归0.13181

弹性⽹络回归0.13174

梯度提升树0.13278

传统stacking集成模型0.12254

武昌阻风

改进stacking集成模型0.12060引⽤

本文发布于:2023-06-18 23:25:32，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1044702.html

上一篇：Fabric stacking device

下一篇：Stacking-type, multi-flow, heat exchanger

标签：模型特征预测

留言与评论（共有 0 条评论）