统计学习方法第五章:决策树(decisiontree),ID3算法,C4.5算法及python实现

更新时间:2023-07-20 14:28:16 阅读: 评论:0

统计学习⽅法第五章:决策树(decisiontree),ID3算法,C4.5算法及python实
决策树(decision tree)是⼀种基本的分类与回归⽅法。决策树模型呈树状结构,在分类问题中,表⽰基于特征对实例进⾏分类的过程。
它可以认为是if-then规则的集合,也可以认为是定义在特征空间与类空间上的条件概率分布。
其主要优点是模型具有可读性,分类速度快。
学习时,利⽤训练数据,根据损失函数最⼩化的原则建⽴决策树模型。预测时,对新的数据,利⽤决策树模型进⾏分类。
决策树学习通常包括3个步骤:特征选择,决策树的⽣成和决策树的修剪美发店管理
主要的决策树算法包括ID3,C4.5,CART。
ID3算法的核⼼是在决策树各个结点上应⽤信息增益准则选择特征,递归的构建决策树。具体⽅法是:从根结点开始,对结点计算所有特征的信息增益,选择信息增益最⼤的特征作为结点的特征,由该特征的
不同取值建⽴⼦结点;再对⼦结点递归的调⽤以上⽅法,构建决策树直到所有特征的信息增益很⼩或没有特征可以选择为⽌。最后得到⼀个决策树。ID3相当于⽤极⼤似然法进⾏概率模型的选择。
##ID3算法
代码如下:
import cv2
import time
import logging
import numpy as np
import pandas as pd
import random
del_lection import train_test_split
ics import accuracy_score
total_class = 10cpx
def log(func):
def wrapper(*args, **kwargs):
start_time = time.time()
logging.debug('start %s()' % func.__name__)
ret = func(*args, **kwargs)
end_time = time.time()
logging.debug('end %s(), cost %s conds' % (func.__name__, end_time - start_time))
return wrapper
# ⼆值化
def binaryzation(img):
cv_img = img.astype(np.uint8)
cv2.threshold(cv_img, 50, 1, cv2.THRESH_BINARY, cv_img)    return cv_img
@log
def binaryzation_features(traint):
features = []
for img in traint:
img = np.reshape(img, (10, 10))
cv_img = img.astype(np.uint8)
img_b = binaryzation(cv_img)
# hog_feature = np.transpo(hog_feature)
features.append(img_b)
features = np.array(features)
features = np.reshape(features, (-1, 100))
return features
class Tree(object):
def __init__(lf, node_type, Class=None, feature=None):        lf.node_type = node_type
lf.dict = {}
lf.Class = Class
lf.feature = feature
def add_tree(lf, val, tree):
lf.dict[val] = tree
def predict(lf, features):
de_type == 'leaf':
return lf.Class
if (features[lf.feature] in lf.dict.keys()):
tree = lf.dict[features[lf.feature]]
习俗英语
el:
if (lf.Class is None):
return random.randint(0, 1)
el:
return lf.Class
return tree.predict(features)
lowkey
def calc_ent(x):
"""
calculate empirical entropy of x
"""
x_value_list = t([x[i] for i in range(x.shape[0])])
ent = 0.0
骆驼梁实秋
for x_value in x_value_list:
p = float(x[x == x_value].shape[0]) / x.shape[0]
logp = np.log2(p)
ent -= p * logp
def calc_condition_ent(train_feature, train_label):
"""
calculate empirical entropy H(y|x)
"""
# calc ent(y|x)
ent = 0
train_feature_t = t(train_feature)
# print("train_feature_t", train_feature_t)
for train_feature_value in train_feature_t:
Di = train_feature[train_feature == train_feature_value]        label_i = train_label[train_feature == train_feature_value]        # print("Di", Di)
train_label_t = t(train_label)
temp = 0
# print("train_label_t", train_label_t)
for train_label_value in train_label_t:
Dik = Di[label_i == train_label_value]
手机英汉翻译器下载
# print(Dik)
if (len(Dik) != 0):
nowitzki
p = float(len(Dik)) / len(Di)
logp = np.log2(p)
temp -= p * logp
ent += float(len(Di)) / len(train_feature) * temp
return ent
def recur_train(train_t, train_label, features, epsilon):
global total_class
LEAF = 'leaf'
INTERNAL = 'internal'
meaning
# 步骤1——如果train_t中的所有实例都属于同⼀类Ck
label_t = t(train_label)
bereave
# print(label_t)
if len(label_t) == 1:
return Tree(LEAF, Class=label_t.pop())
# 步骤2——如果features为空
class_count0 = 0
class_count1 = 0
for i in range(len(train_label)):
if (train_label[i] == 1):
class_count1 += 1
el:
class_count0 += 1
if (class_count0 >= class_count1):
max_class = 0
el:
max_class = 0
if features is None:
return Tree(LEAF, Class=max_class)
if len(features) == 0:
return Tree(LEAF, Class=max_class)
# 步骤3——计算信息增益
max_feature = 0
max_gda = 0
D = train_label
HD = calc_ent(D)
for feature in features:
A = np.array(train_t[:, feature].flat)
gda = HD - calc_condition_ent(A, D)
if gda > max_gda:
max_gda, max_feature = gda, feature
# 步骤4——⼩于阈值
if max_gda < epsilon:
return Tree(LEAF, Class=max_class)
# 步骤5——构建⾮空⼦集
sub_features = ve(max_feature)
tree = Tree(INTERNAL, feature=max_feature)
feature_col = np.array(train_t[:, max_feature].flat)
feature_value_list = t([feature_col[i] for i in range(feature_col.shape[0])])
bury
for feature_value in feature_value_list:
index = []
for i in range(len(train_label)):
if train_t[i][max_feature] == feature_value:
index.append(i)
sub_train_t = train_t[index]
sub_train_label = train_label[index]
sub_tree = recur_train(sub_train_t, sub_train_label, sub_features, epsilon)        tree.add_tree(feature_value, sub_tree)
return tree
@log
def train(train_t, train_label, features, epsilon):
# print(features)
return recur_train(train_t, train_label, features, epsilon)
@log
def predict(test_t, tree):
result = []
for features in test_t:
tmp_predict = tree.predict(features)
result.append(tmp_predict)
return np.array(result)
if __name__ == '__main__':
logger = Logger()
logger.tLevel(logging.DEBUG)
raw_data = pd.read_csv('../data/train_binary2.csv', header=0)
data = raw_data.values
images = data[0:, 1:]
labels = data[:, 0]
# 图⽚⼆值化

本文发布于:2023-07-20 14:28:16,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/78/1107018.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:决策树   特征   分类   结点
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图