首页 > 美文阅读

股票-(五)如何用 Python 从头开始实现 Bagging 算法

更新时间:2023-05-15 08:10:32 阅读：评论：0

（五）如何用 Python 从头开始实现 Bagging 算法

datat [[randrange(10)] for i in range(20)]

print( True Mean: %.3f % mean([row[0] for row in datat]))

# Estimated means

ratio 0.10

for size in [1, 10, 100]:

sample_means list()

for i in range(size):

sample subsample(datat, ratio)

sample_mean mean([row[0] for row in sample])

sample_means.append(sample_mean)

print( Samples %d, Estimated Mean: %.3f % (size, mean(sample_means)))

运行该示例将打印我们要估计的原始数据平均值。

然后我们可以从各种不同数量的自举样本中看到估计的平均值。我们可以看到通过 100 个样本我们可以很好的估计平均值。

True Mean: 4.450

Samples 1, Estimated Mean: 4.500

Samples 10, Estimated Mean: 3.300

彬彬有礼造句Samples 100, Estimated Mean: 4.480

我们可以为每个子样本创建一个模型而不是简单的计算平均值。

接下来让我们看看如何组合多个 bootstrap 模型的预测。

2. 声呐数据集案例研究

在这一节中我们将随机森林算法应用于声呐数据集。

首先我们需要导入数据集将字符串值转换为数值型并将输出列从字符串转换为 0 和 1 的整数值。这是通过辅助函数 load_csv() str_column_to_float() 和 str_column_to_int() 来实现的以便预处理数据集。

我们将使用 k-fold 交叉验证来估计学习模型在未知数据上的性能。这意味着我们将构建和评估 k 个模型并将性能估计为平均模型误差。分类精度将评估每个模型这些算法都在 cross_validation_split() accuracy_metric() 和 evaluate_algoritm() 函数中得到解决。

我们使用 CART 算法来实现我们的 bagging 过程在实现的过程中我们还设计了一些辅助函数 test_split() 函数将数据集拆分成组 gini_index() 用于评估拆分点 get_split() 用于查找最佳拆分点 to_terminal() split() 和 build_tree() 用于创建单个决策树 predict() 用于使用决策树进行预测并使用前一步骤中描述的 subsample() 函数来创建训练的子样本训练集。

我们还开发了一个 bagging_predict() 函数该函数负责使用每个决策树进行预测并将预测组合成单个返回值。这是 bagging 方法常用的一种模式。

最后我们设计一个新的 bagging() 函数负责创建训练数据集的样本在每个样本上训练决策树然后使用bagging() 列表对测试数据集进行预测。

下面给出了完整代码

# Bagging Algorithm on the Sonar datat

from random import ed

from random import randrange

from csv import reader

# Load a CSV file

def load_csv(filename):

datat list()

with open(filename, r ) as file:

csv_reader reader(file)

for row in csv_reader:

if not row:

continue

datat.append(row)

return datat

# Convert string column to float

def str_column_to_float(datat, column):

for row in datat:

row[column] float(row[column].strip())

# Convert string

column to integer

def str_column_to_int(datat, column):

class_values [row[column] for row in datat]

unique t(class_values)

lookup dict()

for i, value in enumerate(unique):

lookup[value] i

for row in datat:

row[column] lookup[row[column]]

面包山

return lookup

# Split a datat into k folds

def cross_validation_split(datat, n_folds):

datat_split list()

datat_copy list(datat)

fold_size int(len(datat) / n_folds)

for i in range(n_folds):

fold list()

while len(fold) fold_size:

index randrange(len(datat_copy))

fold.append(datat_copy.pop(index))

datat_split.append(fold)

return datat_split

# Calculate accuracy percentage

def accuracy_metric(actual, predicted):

correct 0

for i in range(len(actual)):

if actual[i] predicted[i]:

correct 1

return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split

def evaluate_algorithm(datat, algorithm, n_folds, *args):

folds cross_validation_split(datat, n_folds)

scores list()

for fold in folds:

train_t list(folds)

ve(fold)

train_t sum(train_t, [])

test_t list()

for row in fold:

row_copy list(row)

test_t.append(row_copy)

row_copy[-1] None

predicted algorithm(train_t, test_t, *args)

actual [row[-1] for row in fold]

accuracy accuracy_metric(actual, predicted)

scores.append(accuracy)

return scores

繁花血景

# Split a datat bad on an attribute and an attribute value

def test_split(index, value, datat):

left, right list(), list()

for row in datat:

if row[index] value:

left.append(row)

el:

right.append(row)

return left, right

# Calculate the Gini index for a split datat

def gini_index(groups, class):

# count all samples at split point来时

n_instances float(sum([len(group) for group in groups]))

# sum weighted Gini index for each group

gini 0.0

for group in groups:

size float(len(group))

# avoid divide by zero

if size 0:

continue

score 0.0

# score the group bad on the score for each class

for class_val in class:

p [row[-1] for row in group].count(class_val) / size

score p * p

# weight the group score by its relative size

gini (1.0 - score) * (size / n_instances)

return gini

# Select the best split point for a datat

def get_split(datat):

class_values list(t(row[-1] for row in datat))

b_index, b_value, b_score, b_groups 999, 999, 999, None

for index in range(len(datat[0])-1):

for row in datat:

# for i in range(len(datat)):

# row datat[randrange(len(datat))]

groups test_split(index, row[index], datat)

gini gini_index(groups, class_values)

if gini b_score:

b_index, b_value, b_score, b_groups index, row[index], gini, groups

return { index :b_index, value :b_value, groups :b_groups}

# Create a terminal node value

def to_terminal(group):

outcomes [row[-1] for row in group]

return max(t(outcomes), unt)

# Create child splits for a node or make t

erminal

def split(node, max_depth, min_size, depth):

left, right node[ groups ]

del(node[ groups ])

# check for a no split

if not left or not right:

node[ left ] node[ right ] to_terminal(left right)

return

# check for max depth

快速除锈的方法最好用

if depth max_depth:

node[ left ], node[ right ] to_terminal(left), to_terminal(right)

return

# process left child

if len(left) min_size:

node[ left ] to_terminal(left)

el:

node[ left ] get_split(left)

split(node[ left ], max_depth, min_size, depth 1)

# process right child

if len(right) min_size:

node[ right ] to_terminal(right)

el:

node[ right ] get_split(right)

split(node[ right ], max_depth, min_size, depth 1)

# Build a decision tree

def build_tree(train, max_depth, min_size):

root get_split(train)

split(root, max_depth, min_size, 1)

return root

# Make a prediction with a decision tree

def predict(node, row):

if row[node[ index ]] node[ value ]:

if isinstance(node[ left ], dict):

return predict(node[ left ], row)

el:

return node[ left ]

el:

if isinstance(node[ right ], dict):

return predict(node[ right ], row)

el:

return node[ right ]

# Create a random subsample from the datat with replacement

def subsample(datat, ratio):

sample list()

n_sample round(len(datat) * ratio)

while len(sample) n_sample:

index randrange(len(datat))

sample.append(datat[index])

return sample

# Make a prediction with a list of bagged trees

def bagging_predict(trees, row):

predictions [predict(tree, row) for tree in trees]

return max(t(predictions), unt)

# Bootstrap Aggregation Algorithm

def bagging(train, test, max_depth, min_size, sample_size, n_trees):

trees list()

for i in range(n_trees):

sample subsample(train, sample_size)

tree build_tree(sample, max_depth, min_size)

trees.append(tree)

predictions [bagging_predict(trees, row) for row in test]

return(predictions)

# Test bagging on the sonar datat

ed(1)

# load and prepare data

filename sonar.all-data.csv

datat load_csv(filename)

# convert string attributes to integers

江南春古诗朗诵

for i in range(len(datat[0])-1):

str_column_to_float(datat, i)

# convert class column to integers王之才

str_column_to_int(datat, len(datat[0])-1)

# evaluate algorithm

n_folds 5

max_depth 6

min_size 2

sample_size 0.50

for n_trees in [1, 5, 10, 50]:

scores evaluate_algorithm(datat, bagging, n_folds, max_depth, min_size, sample_size, n_trees)

print( Trees: %d % n_trees)

print( Scores: %s % scores)

print( Mean Accuracy: %.3f%% % (sum(scores)/float(len(scores))))

k 值为 5 时用于交叉验证每次迭代评估的数据量为 208/5 41.6 或者直接使用 40 条进行循环迭代。

全部清除

构建树的最大深度我们设为 6 每个节点为 2 的最小训练行数。训练数据集的样本创建为原始数据集大小的 50% 。这是为了强制用于训练每棵树的训练集子样本中的某些变体。bagging 的默认设置是使样本

数据集的大小与原始训练数据集的大小相匹配。

接下来我们打印每个类别的结果

Trees: 1

Scores: [87.8048780487805, 65.85365853658537, 65.85365853658537, 65.85365853658537, 73.17073170731707]

Mean Accuracy: 71.707%

Trees: 5

Scores: [60.97560975609756, 80.48780487804879, 78.04878048780488, 82.92682926829268, 63.41463414634146]

Mean Accuracy: 73.171%

Trees: 10

Scores: [60.97560975609756, 73.17073170731707, 82.92682926829268, 80.48780487804879, 68.29268292682927]

Mean Accuracy: 73.171%

Trees: 50

Scores: [63.41463414634146, 75.60975609756098, 80.48780487804879, 75.60975609756098, 85.36585365853658]

Mean Accuracy: 76.098%

这种方法的一个难点是即使我们构建了一定深度的树但是 bagging 树得到的结果也是非常相似的。但是我们希望在训练的过程中可以降低高方差。这是因为我们在构造的过程中选择了相同或者相似的分裂节点这是一种贪婪算法。

本教程试图通过约束用于训练每棵树的样本大小来重新计算方差。更强大的技术是约束在创建每个分割点时对特征的评估。这是随机森林中使用的方法。

扩展

调整树调整树的大小深度以及单个树的配置 bagging 中构建不同的树结构我们可以通过使用不同的算法进行平均预测比如贝叶斯决策树神经网络等等

本文发布于:2023-05-15 08:10:32，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/638390.html

上一篇：学生会介绍信

下一篇：最新垃圾分类倡议书450(5篇)

标签：训练数据样本使用进行函数用于

留言与评论（共有 0 条评论）