【Stacking改进】基于随机采样与精度加权的Stacking算法
【Stacking改进】基于随机采样与精度加权的Stacking算法
摘要
近年来,⼈⼯智能的强势崛起让我们领略到⼈⼯智能技术的巨⼤潜⼒,机器学习也被⼴泛应⽤于各个领域,并取得不错的成果。本⽂以Kaggle竞赛Hou Prices的房价数据为实验样本,借鉴Bagging的⾃助采样法和k折交叉验证法,构建⼀种基于伪随机采样的Stacking集成模型,⽤于房价预测。⾸先利⽤GBDT对数据集进⾏简单训练,并得到各个特征重要性。接着对数据集进⾏多次随机采样,然后根据特征重要性进⾏属性扰动,组成多个训练数据⼦集和验证数据⼦集。⽤这些数据⼦集训练基模型,并计算验证集的均⽅根误差和预测结果,根据误差分配权重。根据各个基模型预测结果组成第⼆层的元模型,最后在测试数据集上进⾏房价预测。实验结果表明,基于随机采样和精度加权的Stacking集成模型的均⽅根误差⼩于所有基分类器和同结构的经典Stacking集成⽅法。
Stacking算法理论基础
Stacking是⼀种分层模型集成框架,在1992年被Wolpert提出。Stacking集成可以有多层的情况,但通常会设计两层,第⼀层由多种基模型组成,输⼊为原始训练集,⽽输出为各种基模型的预测值,⽽第⼆层
只有⼀个元模型,对第⼀层的各种模型的预测值和真实值进⾏训练,从⽽得到完成的集成模型。同理,预测测试集的过程也要先经过所有基模型的预测,组成第⼆层的特征,再⽤第⼆层的元模型预测出最终的结果。为了防⽌模型过拟合的情况,⼀般Stacking算法在第⼀层训练基模型时会结合k折交叉验证法。以五折交叉验证法为例,Stacking算法的过程如下图所⽰。
传统Stacking代码
# 定义⼀个交叉评估函数 Validation function
n_folds =5
def rmsle_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
rm= np.sqrt(-cross_val_score(model, train.values, target_variable, scoring="neg_mean_squared_error", cv = kf))
return(rm)
# 堆叠模型(Stacking Averaged Models) score: 0.1087 (0.0061)
class StackingAveragedModels(BaEstimator, RegressorMixin, TransformerMixin):
def__init__(lf, ba_models, meta_model, n_folds=5):
lf.ba_models = ba_models
lf.n_folds = n_folds
爱国主义教育主题班会def fit(lf, X, y):
lf.ba_models_ =[list()for x in lf.ba_models]# 4×5 list 存放训练好的模型
kfold = KFold(n_splits=lf.n_folds, shuffle=True, random_state=156)
# 训练基准模型,基于基准模型训练的结果导出成特征
# that are needed to train the cloned meta-model
out_of_fold_predictions = np.zeros((X.shape[0],len(lf.ba_models)))
for i, model in enumerate(lf.ba_models):
for train_index, holdout_index in kfold.split(X, y):#分为预测holdout_index和训练train_index
instance = clone(model)
lf.ba_models_[i].append(instance)
instance.fit(X[train_index], y[train_index])
y_pred = instance.predict(X[holdout_index])
out_of_fold_predictions[holdout_index, i]= y_pred
# 将基准模型预测数据作为特征⽤来给meta_model训练
return lf
def predict(lf, X):
meta_features = np.column_stack([
for ba_models in lf.ba_models_ ])
# meta_features = np.c_[meta_features,X[:,1]]
a_model_.predict(meta_features)
from sklearn.linear_model import LinearRegression
meta_model = KRR
stacked_averaged_models = StackingAveragedModels(ba_models =(ENet, GBoost, KRR, lasso),
meta_model = meta_model,
n_folds=10)
score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".an(), score.std()))
Stacking Averaged models score:0.1087(0.0061)
Stacking改进
改进思路
本⽂对传统的Stacking算法进⾏研究改进,以Kaggle竞赛Hou Prices的房价数据为实验样本,通过Kaggle测试得分验证改进⽅法的可⾏性。本⽂以以下的⽅式对传统Stacking算法进⾏改进:
1. 使⽤⽆放回抽样得到数据⼦集。⽬前传统的Stacking算法是采⽤五折交叉验证法,将训练数据分成5等份,依次选择其中⼀份作为验
证⼦集,⽽其他四份作为训练⼦集⽤于模型训练,⽤训练好的基模型去预测验证⼦集,预测结果作为第⼆层的特征。⽽本⽂的模型则选择随机不放回抽样,⽐如连续20次随机抽取80%的样本,组成20个独⽴的数据⼦集。
2. 根据概率随机选取特征。数据集经过数据处理和特征⼯⼚后会产⽣很多特征,尤其是对离散特征进⾏独热编码,使得特征空间会变得
⾮常⼤,⽽且存在很多冗余特征。因此本⽂利⽤GBDT对数据集进⾏简单训练,并得到各个特征重要性,组成总和为1的概率列表。利⽤这个概率列表随机选取特征,可过滤冗余特征,构造效率更⾼、消耗更低的独⽴的预测模型。
3. 根据训练集的测试精度进⾏测试集的权重分配。传统的Stacking算法是采⽤五折交叉验证法,将数据集划分成五等份,由五组数据⼦
集构成5个基模型,在第⼀层预测测试集时,取5个基模型的预测结果的平均值作为第⼆层的特征。这⾥可能存在数据划分不均,⽽导致预测效果不佳的情况。因此本⽂根据基模型的测试精度对预测结果进⾏加权平均,得到结果作为第⼆层的特征。
改进Stacking代码
subsample函数是对数据集进⾏样本与特征的采样,并记录采样情况,因为预测的时候需要对测试集的特征进⾏相同的采样。
.
改进的算法有三个超参数:
1. n_tree:基模型个数T
2. ratio_sample:样本采样⽐例a
3. ratio_feature:特征采样⽐例b
.
⾸先先⽤GBoost进⾏简单训练,得到特征重要性列表
.
在stacking框架第⼀层,对每⼀个基模型进⾏T次拷贝,根据样本采样⽐例a对样本进⾏T次随机采样,然后再根据特征采样⽐例b和特征重要性列表进⾏特征选择,得到T个训练⼦集和T个验证⼦集。⽤训练⼦集分别对基模型进⾏训练,然后将对相应的验证⼦集的预测结果作为第⼆层元模型的输⼊特征。同时根据验证集的预测值与真实值的误差,给T个基模型分配权重。误差越⼤,权重越低。
最新儿童发型权重计算公式:
import random
from numpy import median
def subsample(datat_x, ratio_sample, ratio_feature, i, list_fearure):# 创建数据集的随机⼦样本
"""random_forest(评估算法性能,返回模型得分)
Args:
datat 训练数据集
ratio 训练数据集的样本⽐例,特征⽐例
Returns:孟姜女哭长城
sample 随机抽样的训练样本序列号
test_list 随机抽样后的剩下的测试样本序列号
feature 随机抽样的特征序列号
"""
random.ed(i)# 固定随机值
sample =list()
# 训练样本的按⽐例抽样。
# round() ⽅法返回浮点数x的四舍五⼊值。
n_sample =round(len(datat_x)* ratio_sample)
n_feature =round(datat_x.shape[1]* ratio_feature)
sample = random.sample(range(len(datat_x)), n_sample)
# feature = random.sample(range(datat_x.shape[1]), n_feature)
feature = np.random.choice(a=range(datat_x.shape[1]), size=n_feature, replace=Fal, p=list_fearure)
test_list =list(t(range(len(datat_x)))-t(sample))
return sample, test_list, feature
# RF堆叠模型(RF Stacking Averaged Models) 0.1158 (0.0048)
class RfStackingAveragedModels(BaEstimator, RegressorMixin, TransformerMixin):
def__init__(lf, ba_models, meta_model, n_tree=20, ratio_sample=1, ratio_feature=1,list_fearure=[]):花月美人
lf.ba_models = ba_models
我与企业共成长
lf.n_tree = n_tree
lf.ratio_sample = ratio_sample
lf.ratio_feature = ratio_feature
lf.list_fearure = list_fearure
def fit(lf, X, y):
lf.ba_models_ =[list()for x in lf.ba_models]# 4×5 list 存放训练好的模型
lf.list_weight =[list()for x in lf.ba_models]# 4×20 list 存放预测的权重
lf.list_feature =[list()for x in lf.ba_models]# 4×20 list 存放特征
n_tree = lf.n_tree
ratio_sample = lf.ratio_sample
ratio_feature = lf.ratio_feature
list_fearure = lf.list_fearure
# 训练基准模型,基于基准模型训练的结果导出成特征
# that are needed to train the cloned meta-model
rf_predictions =[list()for x in lf.ba_models]
rf_y =[]
rf_x0 =[]
rf_x1 =[]
for i, model in enumerate(lf.ba_models):
out__predictions = np.zeros((X.shape[0], n_tree))
for j in range(n_tree):
train_list, test_list, feature = subsample(X, ratio_sample, ratio_feature, j, list_fearure)
instance = clone(model)
lf.ba_models_[i].append(instance)
instance.fit(X[np.ix_(train_list, feature)], y[train_list])
y_pred = instance.predict(X[np.ix_(test_list, feature)])
# list(t(feature)|t(list_fearure))
rf_predictions[i].extend(y_pred)
如何释放c盘空间if i ==0:
d(test_list)
d(feature)
d(y[test_list])
m = mean_squared_error(y_pred, y[test_list])
lf.list_weight[i].append(m)
out__predictions[:, j]= instance.predict(X[np.ix_(range(X.shape[0]), feature)])
lf.list_feature[i].append(feature)
# 权重计算
sum_weight =sum(lf.list_weight[i])
num_weight =len(lf.list_weight[i])
mid_weight = median(lf.list_weight[i])
for j in range(num_weight):
lf.list_weight[i][j]=(sum_weight - lf.list_weight[i][j])/ sum_weight /(num_weight-1)
# 将基准模型预测数据作为特征⽤来给meta_model训练
rf_x = pd.DataFrame(rf_predictions)
rf_x = rf_x.T
# rf_x['area'] = rf_x0
return lf
def predict(lf, X):
meta_features = np.column_stack([
for i, ba_models in enumerate(lf.ba_models_)])
# meta_features = np.c_[meta_features,X[:,1]]
a_model_.predict(meta_features)
GBoost.fit(train,target_variable)
list_fearure = GBoost.feature_importances_
list_non =o(list_fearure)[0])
meta_model = GBoost
rf_stacked_averaged_models = RfStackingAveragedModels(ba_models =(ENet, KRR, lasso),
meta_model = meta_model,
n_tree=20, ratio_sample=0.8,
ratio_feature=0.6,
list_fearure=list_fearure)
score = rmsle_cv(rf_stacked_averaged_models)
print("RF Averaged models score: {:.4f} ({:.4f})".an(), score.std()))
RF Averaged models score:0.1158(0.0048)
基模型参数设置如下
# LASSO回归(LASSO Regression) Lasso score: 0.1101 (0.0058)
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".an(), score.std()))
# 岭回归(Kernel Ridge Regression) Lasso score: 0.1152 (0.0043)
KRR = make_pipeline(RobustScaler(), KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5))
score = rmsle_cv(KRR)
print("\nLasso score: {:.4f} ({:.4f})\n".an(), score.std()))
# 弹性⽹络回归(Elastic Net Regression) Lasso score: 0.1100 (0.0059)
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
score = rmsle_cv(ENet)
print("\nLasso score: {:.4f} ({:.4f})\n".an(), score.std()))
# 提升树(Gradient Boosting Regression): Lasso score: 0.1180 (0.0088)
GBoost = GradientBoostingRegressor(n_estimators=300, learning_rate=0.05,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state=5)
score = rmsle_cv(GBoost)
自动序号print("\nLasso score: {:.4f} ({:.4f})\n".an(), score.std()))
结果
模型均⽅根误差
岭回归0.13666
LASSO回归0.13181
弹性⽹络回归0.13174
梯度提升树0.13278
传统stacking集成模型0.12254
武昌阻风
改进stacking集成模型0.12060引⽤