首页 > 英语园地

2.1如何评价模型好坏_学习笔记

更新时间:2023-07-05 00:36:19 阅读：评论：0

2.1如何评价模型好坏_学习笔记

⼀、判断模型好坏

1、鸢尾花train_test

鸢尾花数据集是UCI数据库中常⽤数据集。我们可以直接加载数据集，并尝试对数据进⾏⼀定探索：

import numpy as np

caramella

历年六级作文题目from sklearn import datats

import matplotlib.pyplot as plt

iris = datats.load_iris()

X = iris.data

y = iris.target

X.shape

(150, 4)

y.shape

(150,)

将数据集打乱，做⼀个shuffle操作。但是本数据集的特征和标签是分开的，分别乱序后，原来的对应关系就不存在了。有两种⽅法解决这⼀问题：

将X和y合并为同⼀个矩阵，然后对矩阵进⾏shuffle，之后再分解

对y的索引进⾏乱序，根据索引确定与X的对应关系，最后再通过乱序的索引进⾏赋值

# ⽅法1

# 使⽤concatenate函数进⾏拼接，因为传⼊的矩阵必须具有相同的形状。

#因此需要对label进⾏reshape操作，reshape(-1,1)表⽰⾏数⾃动计算，1列。axis=1表⽰纵向拼接。

tempConcat = np.concatenate((X, y.reshape(-1,1)), axis=1)

# 拼接好后，直接进⾏乱序操作

np.random.shuffle(tempConcat)

# 再将shuffle后的数组使⽤split⽅法拆分

shuffle_X,shuffle_y = np.split(tempConcat,[4], axis=1)

# 设置划分的⽐例

test_ratio =0.2

test_size =int(len(X)* test_ratio)

X_train = shuffle_X[test_size:]

y_train = shuffle_y[test_size:]

X_test = shuffle_X[:test_size]

y_test = shuffle_y[:test_size]

print(X_train.shape)

河北省会计考试时间print(X_test.shape)

print(y_train.shape)

print(y_test.shape)

(120, 4)

48个音标(30, 4)

(120, 1)

(30, 1)

# 将x长度这么多的数，返回⼀个新的打乱顺序的数组，注意，数组中的元素不是原来的数据，⽽是混乱的索引shuffle_index = np.random.permutation(len(X))

# 指定测试数据的⽐例

test_ratio =0.2

test_size =int(len(X)* test_ratio)

test_index = shuffle_index[:test_size]

train_index = shuffle_index[test_size:]

X_train = X[train_index]

X_test = X[test_index]

y_train = y[train_index]

y_test = y[test_index]

print(X_train.shape)

print(X_test.shape)

print(y_train.shape)

print(y_test.shape)

(120, 4)

(30, 4)

(120,)

(30,)

2、编写⾃⼰的train_test_split

#调⽤

from model_lection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

print(X_train.shape)

print(X_test.shape)

print(y_train.shape)

print(y_test.shape)

(120, 4)

(30, 4)

(120,)

(30,)

简单验证⼀下，X_train, y_train通过fit传⼊算法，然后对X_test做预测，得到y_predict

from kNN import kNNClassifier

my_kNNClassifier = kNNClassifier(k=3)

my_kNNClassifier.fit(X_train, y_train)

y_predict = my_kNNClassifier.predict(X_test)

y_predict

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 2])

y_test

array([1, 0, 1, 0, 0, 1, 0, 1, 1, 2, 2, 0, 0, 0, 1, 2, 1, 0, 0, 2, 0, 0,

2, 0, 1, 2, 2, 1, 0, 0])

# 两个向量的⽐较，返回⼀个布尔型向量，对这个布尔向量（falu=1，true=0）sum

sum(y_predict == y_test)

sum(y_predict == y_test)/len(y_test)

0.23333333333333334

3、sklearn中的train_test_split

del_lection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

print(X_train.shape)

print(X_test.shape)

print(y_train.shape)我怎么了英语

print(y_test.shape)

(120, 4)

(30, 4)

(120,)

(30,)

⼆、分类准确度accuracy

因accuracy定义清洗、计算⽅法简单，因此经常被使⽤。但是它在某些情况下并不⼀定是评估模型的

最佳⼯具。精度（查准率）和召回率（查全率）等指标对衡量机器学习的模型性能在某些场合下要⽐accuracy更好。

1、数据探索

import numpy as np

import matplotlib

import matplotlib.pyplot as plt

from sklearn import datats

del_lection import train_test_split

ighbors import KNeighborsClassifier

# ⼿写数字数据集，封装好的对象，可以理解为⼀个字段

digits = datats.load_digits()

naluone# 可以使⽤keys()⽅法来看⼀下数据集的详情

digits.keys()

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

sklearn.datats提供的数据描述：

# 5620张图⽚，每张图⽚有64个像素点即特征（8*8整数像素图像），每个特征的取值范围是1～16（sklearn中的不全），对应的分类结果是10个数字

print(digits.DESCR)

.. _digits_datat:

Optical recognition of handwritten digits datat

51批改网

--------------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 5620

:Number of Attributes: 64

:Attribute Information: 8x8 image of integer pixels in the range 0..16.

:Missing Attribute Values: None

:Creator: E. Alpaydin (alpaydin '@' )

:Date: July; 1998

This is a copy of the test t of the UCI ML hand-written digits datats

archive.ics.uci.edu/ml/datats/Optical+Recognition+of+Handwritten+Digits

The data t contains images of hand-written digits: 10 class where

each class refers to a digit.

Preprocessing programs made available by NIST were ud to extract normalized bitmaps of handwritten digits from a preprinted form. From a

total of 43 people, 30 contributed to the training t and different 13

to the test t. 32x32 bitmaps are divided into nonoverlapping blocks of

4x4 and the number of on pixels are counted in each block. This generates

an input matrix of 8x8 where each element is an integer in the range

0..16. This reduces dimensionality and gives invariance to small

distortions.

For info on NIST preprocessing routines, e M. D. Garris, J. L. Blue, G.

T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.

L. Wilson, NIST Form-Bad Handprint Recognition System, NISTIR 5469, 1994.

.. topic:: References

- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their

Applications to Handwritten Digit Recognition, MSc Thesis, Institute of

Graduate Studies in Science and Engineering, Bogazici University.

E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.

- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.

Linear dimensionalityreduction using relevance weighted LDA. School of

Electrical and Electronic Engineering Nanyang Technological University.

2005.

- Claudio Gentile. A New Approximate Maximal Margin Classification

Algorithm. NIPS. 2000.

# 特征的shape

X = digits.data

X.shape

(1797, 64)

# 标签的shape

y = digits.target

y.shape

(1797,)

# 标签分类

digits.target_namesarranged

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# 取出某⼀个具体的数据，查看其特征以及标签信息

some_digit = X[666]

some_digit

array([ 0., 0., 5., 15., 14., 3., 0., 0., 0., 0., 13., 15., 9.,

15., 2., 0., 0., 4., 16., 12., 0., 10., 6., 0., 0., 8.,

16., 9., 0., 8., 10., 0., 0., 7., 15., 5., 0., 12., 11.,minority

0., 0., 7., 13., 0., 5., 16., 6., 0., 0., 0., 16., 12.,

15., 13., 1., 0., 0., 0., 6., 16., 12., 2., 0., 0.])

y[666]

# 也可以这条数据进⾏可视化

some_digmit_image = shape(8,8)

plt.imshow(some_digmit_image_image, cmap = binary)

plt.show()

2、⾃⼰实现分类准确度

在分类任务结束后，我们就可以计算分类算法的准确率

X_train, X_test, y_train, y_test = train_test_split(X, y)

knn_clf = KNeighborsClassifier(n_neighbors=3)

knn_clf.fit(X_train, y_train)

因为你爱过我

y_predict = knn_clf.predict(X_test)

# ⽐对y_predict和y_test结果是否⼀致

sum(y_predict == y_test)/len(y_test)

0.9844444444444445

⼯程⽂件中添加⼀个metrics.py，⽤来度量性能的各种指标，封装函数

shape(-1,1)

y_test.shape[0]== y_predict.shape[0]

True

#调⽤

from metrics import accuracy_score

accuracy_score(y_test, y_predict)

0.9844444444444445

⽤classifier将我们的预测值y_predict预测出来了，再去看和真值的⽐例。但是有时候我们对预测值y_predict是多少不感兴趣，我们只对模型的准确率感兴趣,kNN算法模型中进⼀步封装⼀个score函数。

knn_clf.score(X_test, y_test)

本文发布于:2023-07-05 00:36:19，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/1078889.html

上一篇：技术数据降维知识40题（附答案）

下一篇：机器学习的算法

标签：数据特征模型分类预测学习矩阵数组

留言与评论（共有 0 条评论）