首页 > 美文鉴赏

R软件实现随机森林算法(带详细代码操作)

更新时间:2023-05-15 08:16:51 阅读：评论：0

R软件实现随机森林算法（带详细代码操作）

随机森林算法是我们经常要用到的机器学习，本文我们将使用随机森林模型，根据鲍鱼的一系列特征来预测鲍鱼是否“新鲜”。鲍鱼数据来自UCI机器学习存储库（我们将数据分为训练和测试集）。

目录如下：

1、数据准备（输入，数据已处理等）

盐亭职业技术学校2、数据分割（把数据分割为测试集和训练集）

3、变量选择

4、模型拟合结果及评估（混淆矩阵，ROC曲线等）

首先，我们将数据加载到R中：

# 加载需要的安装包

library(caret)

library(ranger)孔繁森电影

library(tidyver)

library(e1071)

# 读入数据

中国哈萨克abalone_data <- read.table("../data/abalone.data", p = ",")

# 读入变量名称

colnames(abalone_data) <- c("x", "length", "diameter", "height",

"whole.weight", "shucked.weight",

"viscera.weight", "shell.weight", "age")

# 对预测变量进行划分

abalone_data <- abalone_data %>%

猎猪 mutate(old = age > 10) %>%

# remove the "age" variable拳打脚踢

lect(-age)

# 把数据分割成训练集合测试集

t.ed(23489)

train_index <- sample(1:nrow(abalone_data), 0.9 * nrow(abalone_data))

abalone_train <- abalone_data[train_index, ]

abalone_test <- abalone_data[-train_index, ]

# remove the original datat

rm(abalone_data)

# view the first 6 rows of the training data

head(abalone_train)

促进的同义词可以看到，输出结果如下：

下一步，拟合随机森林模型

rf_fit <- train(as.factor(old) ~ .,

data = abalone_train,

method = "ranger")

默认情况下，train不带任何参数函数重新运行模型超过25个bootstrap样本和在调谐参数的3个选项（用于调谐参数ranger是mtry;随机选择的预测器在树中的每个切口的数目）。

rf_fit

## Random Forest

## 3759 samples

## 8 predictor

## 2 class: 'FALSE', 'TRUE'

前言是什么意思

## No pre-processing

## Resampling: Bootstrapped (25 reps)

## Summary of sample sizes: 3759, 3759, 3759, 3759, 3759, 3759, ...

## Resampling results across tuning parameters:

## mtry splitrule Accuracy Kappa

## 2 gini 0.7828887 0.5112202

## 2 extratrees 0.7807373 0.4983028

## 5 gini 0.7750120 0.4958132

## 5 extratrees 0.7806244 0.5077483

## 9 gini 0.7681104 0.4819231

## 9 extratrees 0.7784264 0.5036977

## Tuning parameter 'de.size' was held constant at a value of 1

## Accuracy was ud to lect the optimal model using the largest value.

## The final values ud for the model were mtry = 2, splitrule = gini

## de.size = 1.

使用内置predict函数，在独立的测试集上测试数据同样简单。

# predict the outcome on a test t

abalone_rf_pred <- predict(rf_fit, abalone_test)

# compare predicted outcome and true outcome

confusionMatrix(abalone_rf_pred, as.factor(abalone_test$old))

## Confusion Matrix and Statistics

## Reference

## Prediction FALSE TRUE

## FALSE 231 52

## TRUE 42 93

## Accuracy : 0.7751

一什么信

## 95% CI : (0.732, 0.8143)

## No Information Rate : 0.6531

## P-Value [Acc > NIR] : 3.96e-08

## Kappa : 0.4955

## Mcnemar's Test P-Value : 0.3533

## Sensitivity : 0.8462

## Specificity : 0.6414

## Pos Pred Value : 0.8163

## Neg Pred Value : 0.6889

## Prevalence : 0.6531

## Detection Rate : 0.5526

## Detection Prevalence : 0.6770

## Balanced Accuracy : 0.7438

## 'Positive' Class : FALSE

现在我们已经看到了如何拟合模型以及默认的重采样实现（引导）和参数选择。尽管这很棒，但使用插入符号可以做更多的事情。

预处理（preProcess）

插入符号很容易实现许多预处理步骤。脱字号的几个独立功能针对设置模型时可能出现的特定问题。这些包括

dummyVars：根据具有多个类别的分类变量创建伪变量

nearZeroVar：识别零方差和接近零方差的预测变量（在进行二次采样时可能会引起问题）

findCorrelation：确定相关的预测变量

findLinearCombos：确定预测变量之间的线性相关性

除了这些单独的功能外，还存在preProcess可用于执行更常见任务（例如居中和缩放，插补和变换）的功能。preProcess接收要处理的数据帧和方法，可以是“ BoxCox”，“ YeoJohnson”，“ expoTrans”，“ center”，“ scale”，“ range”，“ knnImpute”，“ bagImpute”，“ medianImpute”中的任何一种”，“ pca”，“ ica”，“ spatialSign”，“ corr”，“ zv”，“ nzv”和“ conditionalX”。

# center, scale and perform a YeoJohnson transformation

# identify and remove variables with near zero variance

# perform pca

abalone_no_nzv_pca <- preProcess(lect(abalone_train, - old),

method = c("center", "scale", "nzv", "pca"))

abalone_no_nzv_pca

## Created from 3759 samples and 8 variables

## Pre-processing:

## - centered (7)

本文发布于:2023-05-15 08:16:51，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/898817.html

上一篇：从L2R开始理解一下xgboost的objective：rank：pairwi参数