Hadoop学习笔记三--决策树算法实现用户风险等级分类

更新时间:2023-07-29 15:32:31 阅读：评论：0

Hadoop学习笔记三--决策树算法实现⽤户风险等级分类

前⾔

蝴蝶灯笼

刚刚过去的2016年被称为⼈⼯智能的元年，在AlphaGo⼤战李世⽯取得⾥程碑式的胜利后，神经⽹络和深度学习的概念瞬间进⼊了⼈们的视野，各⼤商业巨头也纷纷将⾃⼰的⽬标转移到这个还没有任何明确⽅向但所有⼈都知道它⼀旦出⼿将改变世界的⼈⼯智能⽅向中。在这个过程中，⼈们也突然发现在过去⼏年⼤数据存储技术和硬件处理能⼒不断发展，⽽产出却有限，主要是⾯对如此纷繁复杂的数据，⼈们却不知道如何利⽤。答案就在那⾥，却不知道如何寻找答案。所以数据挖掘、机器学习的算法的学习和研究⼜成了⾼度热门的话题。本⽂继上⼀篇博客中研究的KNN算法，对机器学习中另⼀个⽐较简单的算法 – 决策树算法进⾏学习和研究。KNN算法是基于节点之间的欧式距离进⾏分类，算法简单易懂，⽐较⼤的缺陷是计算量⽐较⼤⽽且⽆法给出数据的内在含义，⽽决策树算法相对⽽⾔在数据内在含义⽅⾯有⽐较⼤的优势，得到的结果也容易在业务上被理解。

实用音乐好看的名字符号决策树算法

决策树算法的规则跟⼈脑决策⾮常相似，通过⼀系列IF-ELSE的问题进⾏决策实现最终的分类。以下是⼀个极简单的决策树例⼦。

决策树算法运⾏的过程也是决策树构造的过程，⾯对庞杂的数据，在构造决策树时，需要解决的第⼀个问题就是当前数据集上哪个特征在划分数据分类时起决定性作⽤。如在上⼀个博客中⽤户风险等级划分的案例，⽤户有股票、基⾦及贵⾦属投资，理财产品投资，存款机货币市场投资三个⽅⾯的数据，⽽实际的商业⽤户有更多维度的数据，我们必须找到决定性的特征，才能划分最好的结果，所以

我们必须评估每个特征的重要性。在找到第⼀个决策点后，整个数据集就会被划分成⼏个分⽀，接下来再检查这⼏个分⽀下的数据是否属于同⼀类，如果是同⼀类数据，则停⽌划分，如果不属于同⼀类数据，则需要继续寻找决策点，创建分⽀的伪代码如下：

检测数据集中每个⼦项是否属于同⼀分类：

If so return类标签；

描写夕阳的古诗El

寻找划分数据集的最好特征

划分数据集

创建分⽀节点

左侧小腹下隐痛For每个划分的⼦集

迭代并增加返回结果到分⽀节点中

return分⽀节点

信息增益

划分数据集的最⼤原则是将⽆序的数据变得更加有序，在划分数据集之前和之后信息发⽣的变化称为信息增益，在计算完每个特征值划分数据集获得的信息增益后，获得信息增益最⾼的特征就是最好的选择。⽽集合信息的度量⽅式称为⾹农熵。⾹农熵的计算公式为

在MapReduce中实现每个维度信息增益的计算。

public class CalcShannonEntMapper extends

Mapper<LongWritable, Text, Text, AttributeWritable> {

@Override

protected void tup(Context context) throws IOException,

InterruptedException {

super.tup(context);

}

磷偏低@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = String();

StringTokenizer tokenizer = new StringTokenizer(line);

Long id = Long.Token());

String category = Token();

boolean isCategory = true;

while (tokenizer.hasMoreTokens()) {

isCategory = fal;

String attribute = Token();

String[] entry = attribute.split(":");

context.write(new Text(entry[0]), new AttributeWritable(id,

category, entry[1]));

}

if (isCategory) {

context.write(new Text(category), new AttributeWritable(id,

category, category));

}

@Override

protected void cleanup(Context context) throws IOException,

InterruptedException {

super.cleanup(context);

}

public class CalcShannonEntReducer extends

Reducer<Text, AttributeWritable, Text, AttributeGainWritable> {学术型硕士和专业型硕士的区别

@Override

protected void tup(Context context) throws IOException,

InterruptedException {

super.tup(context);

}

@Override

protected void reduce(Text key, Iterable<AttributeWritable> values,

Context context) throws IOException, InterruptedException {

String attributeName = String();

double totalNum = 0.0;

Map<String, Map<String, Integer>> attrValueSplits = new HashMap<String, Map<String, Integer>>();

Iterator<AttributeWritable> iterator = values.iterator();

boolean isCategory = fal;

while (iterator.hasNext()) {

AttributeWritable attribute = ();

String attributeValue = AttributeValue();

if (attributeName.equals(attributeValue)) {

isCategory = true;

break;

}

Map<String, Integer> attrValueSplit = attrValueSplits

.get(attributeValue);

if (null == attrValueSplit) {

attrValueSplit = new HashMap<String, Integer>();

attrValueSplits.put(attributeValue, attrValueSplit);

}

String category = Category();

Integer categoryNum = (category);

attrValueSplit.put(category, null == categoryNum ? 1

: categoryNum + 1);

totalNum++;

}

if (isCategory) {

System.out.println("is Category");

int sum = 0;

iterator = values.iterator();

while (iterator.hasNext()) {

<();

sum += 1;

}

System.out.println("sum: " + sum);

ck手表怎么样context.write(key, new AttributeGainWritable(attributeName, sum, true, null));

} el {

double gainInfo = 0.0;

double splitInfo = 0.0;

for (Map<String, Integer> attrValueSplit : attrValueSplits.values()) {

double totalCategoryNum = 0;

for (Integer categoryNum : attrValueSplit.values()) {

totalCategoryNum += categoryNum;

}

double entropy = 0.0;

for (Integer categoryNum : attrValueSplit.values()) {

double p = categoryNum / totalCategoryNum;

entropy -= p * (Math.log(p) / Math.log(2));

}

double dj = totalCategoryNum / totalNum;

gainInfo += dj * entropy;

splitInfo -= dj * (Math.log(dj) / Math.log(2));

}

double gainRatio = splitInfo == 0.0 ? 0.0 : gainInfo / splitInfo;

StringBuilder splitPoints = new StringBuilder();

for (String attrValue : attrValueSplits.keySet()) {

splitPoints.append(attrValue).append(",");

}

splitPoints.deleteCharAt(splitPoints.length() - 1);

context.write(key, new AttributeGainWritable(attributeName,

gainRatio, fal, String()));

}

@Override

protected void cleanup(Context context) throws IOException,

InterruptedException {

super.cleanup(context);

}

实验

我们还是⽤上⼀篇博客中⽤户风险等级分类的例⼦中的数据，去测试决策树算法的优劣，但由于决策树算法只能对是或者否进⾏判断，所以，对案例中的数据进⾏了改造，⽰例如下：

⽤户

股票、基⾦及贵⾦属投资理财产品投资存款及货币市场投资风险分类1

111high 2

011middle 3001low

把⼀组已经打好标签的数据作为训练数据，另⼀组没打标签的数据作为测试数据，测试的结果如下：

实验结果⾮常具有可读性，也符合业务的常理，但是由于决策树算法只能输⼊0-1数据，运算结果的错

误率为13.6%，相对KNN来说是错误率是提⾼了，要进⼀步降低错误率，可以增加判断的维度，⽐如对于理财产品来说，有不同类型的理财产品，可以依据理财产品的类型增加⼏个维度等。}

本文发布于:2023-07-29 15:32:31，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1100974.html

上一篇：Patient rule induction method on large disk reside

下一篇：arcgis提取面的折点坐标

标签：数据算法决策树划分结果信息

留言与评论（共有 0 条评论）