首页 > 英语园地

机器学习1：scikit-learn简介（1）——机器学习的一般步骤（手写数字数据集案例）

更新时间:2023-05-20 10:01:59 阅读：评论：0

机器学习1：scikit-learn简介（1）——机器学习的⼀般步骤（⼿写数字数据集案

例）

机器学习的⼀般步骤

scikit-learn 的介绍

的安装：

在cmd中输⼊命令进⾏安装。

激活matplotlib，使得在notebook中显⽰内联图。

%matplotlib inline

import matplotlib.pyplot as plt

scikit-learn是⼀个开源的Python语⾔机器学习⼯具包。它涵盖了⼏乎所有主流机器学习算法的实现，并且提供了⼀致的调⽤接⼝。

它基于Numpy和SciPy等Python数值计算库，提供了⾼效的算法实现。

总结起来，scikit-learn⼯具包有以下⼏个优点：

(1) ⽂档齐全：官⽅⽂档齐全，更新及时。

(2) 接⼝易⽤：针对所有的算法提供了⼀致的接⼝调⽤规则，不管是KNN、K-Means还是PCA。

(3) 算法全⾯：涵盖主流机器学习任务的算法，包括回归算法、分类算法、聚类分析、数据降维处理等。

当然，scikit-learn不⽀持分布式计算，不适合⽤来处理超⼤型数据。

⼀、机器学习的⼀般步骤

1. 加载数据集

thx使⽤digits数据集，这是⼀个⼿写数字的数据集。

from sklearn.datats import load_digits

hapefrom matplotlib import pyplot as plt

digits=load_digits()

images_and_labels=list(zip(digits.images,digits.target))翻译笔译兼职

plt.figure(figsize=(8,6),dpi=200)

for index,(image,label)in enumerate(images_and_labels[:8]):

plt.subplot(2,4,index+1)

plt.axis('off')# 关闭坐标轴

plt.imshow(image,ay_r,interpolation='nearest')

plt.title('Digit:%i'%label,fontsize=20)

辅助理解：

zip() 函数：

enumerate()函数：

plt.imshow() 函数负责对图像进⾏处理，并显⽰其格式，但是不能显⽰。其后跟着plt.show（）才能显⽰出来。

cmap即colormaps，图谱

<是matplotlib库中内置的⾊彩映射函数。

<[⾊彩] (’[数据集]’) 即对 [数据集] 应⽤ [⾊彩]

digits.data.shape

scikit-learn使⽤Numpy的array对象来表⽰数据，所有的图⽚数据保存在⾥，每个元素都是⼀个8x8尺⼨的图⽚。我们在进⾏机器学习时，需要把数据保存为格式的array对象，针对⼿写数字识别这个案例，scikit-learn已经为我们转换好了，把特征数据保存在digits.data数据⾥，可以通过来查看它的数据格式。

print("shape of raw image_data:{0}".format(digits.images.shape))

print("shape if data:{0}".format(digits.data.shape))

输出：

shape of raw image_data:(1797,8,8)

shape if data:(1797,64)

2. 划分训练集和测试集

在开始训练我们的模型之前，需要先把数据集分成和。我们可以使⽤下⾯的代码吧数据集分出20%作为测试数据集，80%作为训练数据集。

del_lection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(digits.data,digits.target,test_size=0.20,random_state=2)

3. 训练模型

⼀旦我们拥有独⽴的训练集和测试集，我们就可以使⽤学习机器学习模型。我们将使⽤来测试模型，通过准确度指标⽐较模型的好坏。

from sklearn.linear_model import LogisticRegression

歧路亡羊翻译# 求出逻辑回归 Logistic 的精确度得分

clf=LogisticRegression(solver='lbfgs',multi_class='ovr',max_iter=5000,random_state=42)

clf.fit(X_train,y_train)

输出：

LogisticRegression(C=1.0, class_weight=None, dual=Fal, fit_intercept=True,

intercept_scaling=1, max_iter=5000, multi_class='ovr',

n_jobs=None, penalty='l2', random_state=42, solver='lbfgs',

tol=0.0001, verbo=0, warm_start=Fal)

4. 测试模型

accuracy=clf.score(X_test,y_test)

print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__,accuracy))

输出：

Accuracy score of the LogisticRegression is0.94

除此之外，还可以直接把测试数据集⾥的部分图⽚显⽰出来，并且在图⽚的左下⾓显⽰预测值，右下⾓显⽰真实值。什么叫ko

y_pred=clf.predict(X_test)

fig,axes=plt.subplots(4,4,figsize=(8,8))

fig.subplots_adjust(hspace=0.1,wspace=0.1)

for i,ax in enumerate(axes.flat):

ax.imshow(X_test[i].reshape(8,8),ay_r,interpolation='nearest')

<(0.05,0.05,str(y_pred[i]),fontsize=32,ansAxes,

color='green'if y_pred[i]== y_test[i]el'red')

<(0.8,0.05,str(y_test[i]),fontsize=32,ansAxes,color='black')

ax.t_xticks([])

ax.t_yticks([])

输出：

从图中可以看出，第⼆⾏第⼀个的出现错误，预测值与真实值不同。

5. 模型保存与加载

当我们对模型的准确度感到满意后，就可以把模型保存下来。这样下次需要预测时，可以直接加载模型来进⾏预测，⽽不是重新训练⼀遍模型。

karisma可以使⽤下⾯的代码来保存模型：

als import joblib

joblib.dump(clf,'digits_svm.pkl')

输出：

['digits_svm.pkl']

当我们需要这个模型来进⾏预测时，直接加载模型即可进⾏预测。

clf2=joblib.load('digits_svm.pkl')

clf2.score(X_test,y_test)

输出：

0.9361111111111111

6. 模型的轻松更改

scikit-learn的模型API在应⽤中是相似的。因此，我们通过随机森林分类器RandomForestClassifier轻松替换逻辑归回

LogisticRegression分类器。这些更改很⼩，仅与分类器实例的创建有关。

ble import RandomForestClassifier # RandomForestClassifier轻松替换LogisticRegression分类器

clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)

clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))

输出：

Accuracy score of the RandomForestClassifier is0.96

treehou练习

加载乳腺癌数据集。从导⼊函数。

from sklearn.datats import load_breast_cancer

breast = load_breast_cancer()

使⽤

拆分数据集，其中训练集占70%，测试集占30％。

del_lection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(breast.data,breast.target,test_size=0.30,random_state=2)

使⽤梯度提升分类器训练数据。

ble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=100, random_state=0)

clf.fit(X_train, y_train)

输出：

GradientBoostingClassifier(criterion='friedman_m', init=None,

learning_rate=0.1, loss='deviance', max_depth=3,

max_features=None, max_leaf_nodes=None,

min_impurity_decrea=0.0, min_impurity_split=None,femen

埃克曼螺线

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=100,

n_iter_no_change=None, presort='auto', random_state=0,

subsample=1.0, tol=0.0001, validation_fraction=0.1,

verbo=0, warm_start=Fal)

使⽤分类器预测测试集的分类标签。

y_pred = clf.predict(X_test)

您需要从导⼊，计算测试集的精度。

ics import balanced_accuracy_score

ics import accuracy_score

accuracy = balanced_accuracy_score(y_test, y_pred)

bbc英剧

# accuracy = accuracy_score(y_test, y_pred)

print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))输出：

Accuracy score of the GradientBoostingClassifier is0.94

# Accuracy score of the GradientBoostingClassifier is 0.95

预测评分的⽅法包括三种：

(1) model.score(X_test, y_test)

(2) accuracy_score(y_test, y_pred)

(3) balanced_accuracy_score(y_test, y_pred)

（完。）

本文发布于:2023-05-20 10:01:59，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/706060.html

上一篇：Alpha多样性稀释曲线rarefractioncurve还不会画吗？快看此文

下一篇：小波变换图像处理实现程序课题实现步骤(精)

标签：数据模型学习训练算法分类器测试机器

留言与评论（共有 0 条评论）