2.1如何评价模型好坏_学习笔记

更新时间:2023-06-17 09:30:43 阅读: 评论:0

2.1如何评价模型好坏_学习笔记
⼀、判断模型好坏
1、鸢尾花train_test
鸢尾花数据集是UCI数据库中常⽤数据集。我们可以直接加载数据集,并尝试对数据进⾏⼀定探索:
import numpy as np
from sklearn import datats
import matplotlib.pyplot as plt
iris = datats.load_iris()
X = iris.data
y = iris.target
X.shape
(150, 4)
y.shape
(150,)
将数据集打乱,做⼀个shuffle操作。但是本数据集的特征和标签是分开的,分别乱序后,原来的对应关系就不存在了。有两种⽅法解决这⼀问题:
将X和y合并为同⼀个矩阵,然后对矩阵进⾏shuffle,之后再分解
对y的索引进⾏乱序,根据索引确定与X的对应关系,最后再通过乱序的索引进⾏赋值
# ⽅法1
# 使⽤concatenate函数进⾏拼接,因为传⼊的矩阵必须具有相同的形状。
#因此需要对label进⾏reshape操作,reshape(-1,1)表⽰⾏数⾃动计算,1列。axis=1表⽰纵向拼接。
tempConcat = np.concatenate((X, y.reshape(-1,1)), axis=1)
# 拼接好后,直接进⾏乱序操作
np.random.shuffle(tempConcat)
# 再将shuffle后的数组使⽤split⽅法拆分
shuffle_X,shuffle_y = np.split(tempConcat,[4], axis=1)
# 设置划分的⽐例
test_ratio =0.2
test_size =int(len(X)* test_ratio)
X_train = shuffle_X[test_size:]
y_train = shuffle_y[test_size:]
X_test = shuffle_X[:test_size]
y_test = shuffle_y[:test_size]
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(120, 4)
(30, 4)
(120, 1)
(30, 1)
# 将x长度这么多的数,返回⼀个新的打乱顺序的数组,注意,数组中的元素不是原来的数据,⽽是混乱的索引shuffle_index = np.random.permutation(len(X))
# 指定测试数据的⽐例
test_ratio =0.2
test_size =int(len(X)* test_ratio)
test_index = shuffle_index[:test_size]
train_index = shuffle_index[test_size:]
X_train = X[train_index]
X_test = X[test_index]
y_train = y[train_index]
y_test = y[test_index]
print(X_train.shape)
print(X_test.shape)
沙坡头旅游景区
print(y_train.shape)
print(y_test.shape)
(120, 4)
(30, 4)
(120,)
(30,)
2、编写⾃⼰的train_test_split
#调⽤
from model_lection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(120, 4)
(30, 4)
生活中的大数有哪些
(120,)
(30,)
简单验证⼀下,X_train, y_train通过fit传⼊算法,然后对X_test做预测,得到y_predict
from kNN import kNNClassifier
my_kNNClassifier = kNNClassifier(k=3)
my_kNNClassifier.fit(X_train, y_train)
y_predict = my_kNNClassifier.predict(X_test)
y_predict
济南万达array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2])
y_test
array([1, 0, 1, 0, 0, 1, 0, 1, 1, 2, 2, 0, 0, 0, 1, 2, 1, 0, 0, 2, 0, 0,
2, 0, 1, 2, 2, 1, 0, 0])
# 两个向量的⽐较,返回⼀个布尔型向量,对这个布尔向量(falu=1,true=0)sum
sum(y_predict == y_test)
7
sum(y_predict == y_test)/len(y_test)
0.23333333333333334
3、sklearn中的train_test_split
del_lection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(120, 4)
(30, 4)
(120,)
(30,)
⼆、分类准确度accuracy
因accuracy定义清洗、计算⽅法简单,因此经常被使⽤。但是它在某些情况下并不⼀定是评估模型的
最佳⼯具。精度(查准率)和召回率(查全率)等指标对衡量机器学习的模型性能在某些场合下要⽐accuracy更好。
1、数据探索
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datats
del_lection import train_test_split
ighbors import KNeighborsClassifier
# ⼿写数字数据集,封装好的对象,可以理解为⼀个字段
digits = datats.load_digits()
# 可以使⽤keys()⽅法来看⼀下数据集的详情
digits.keys()
dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])
sklearn.datats提供的数据描述:
# 5620张图⽚,每张图⽚有64个像素点即特征(8*8整数像素图像),每个特征的取值范围是1~16(sklearn中的不全),对应的分类结果是10个数字
print(digits.DESCR)
.. _digits_datat:
Optical recognition of handwritten digits datat
--------------------------------------------------
**Data Set Characteristics:**
消毒名词解释:Number of Instances: 5620
:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' )爵士音乐
:Date: July; 1998
This is a copy of the test t of the UCI ML hand-written digits datats
archive.ics.uci.edu/ml/datats/Optical+Recognition+of+Handwritten+Digits房屋租赁合同电子版
The data t contains images of hand-written digits: 10 class where
each class refers to a digit.
Preprocessing programs made available by NIST were ud to extract normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training t and different 13
to the test t. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.
For info on NIST preprocessing routines, e M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Bad Handprint Recognition System, NISTIR 5469, 1994.
.. topic:: References
- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
Graduate Studies in Science and Engineering, Bogazici University.
-
E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School of
Electrical and Electronic Engineering Nanyang Technological University.
2005.葫芦的做法
- Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.
# 特征的shape
X = digits.data
X.shape
(1797, 64)
# 标签的shape
y = digits.target
y.shape
(1797,)
# 标签分类
digits.target_names
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# 取出某⼀个具体的数据,查看其特征以及标签信息
some_digit = X[666]
some_digit
array([ 0.,  0.,  5., 15., 14.,  3.,  0.,  0.,  0.,  0., 13., 15.,  9.,
前方后方
15.,  2.,  0.,  0.,  4., 16., 12.,  0., 10.,  6.,  0.,  0.,  8.,
16.,  9.,  0.,  8., 10.,  0.,  0.,  7., 15.,  5.,  0., 12., 11.,
0.,  0.,  7., 13.,  0.,  5., 16.,  6.,  0.,  0.,  0., 16., 12.,
15., 13.,  1.,  0.,  0.,  0.,  6., 16., 12.,  2.,  0.,  0.])
y[666]
# 也可以这条数据进⾏可视化
some_digmit_image = shape(8,8)
plt.imshow(some_digmit_image_image, cmap = binary)
plt.show()
2、⾃⼰实现分类准确度
在分类任务结束后,我们就可以计算分类算法的准确率
X_train, X_test, y_train, y_test = train_test_split(X, y)
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_test)
# ⽐对y_predict和y_test结果是否⼀致
sum(y_predict == y_test)/len(y_test)
0.9844444444444445
⼯程⽂件中添加⼀个metrics.py,⽤来度量性能的各种指标,封装函数
shape(-1,1)
shape(-1,1)
y_test.shape[0]== y_predict.shape[0]
True
#调⽤
from metrics import accuracy_score
accuracy_score(y_test, y_predict)
0.9844444444444445
⽤classifier将我们的预测值y_predict预测出来了,再去看和真值的⽐例。但是有时候我们对预测值y_predict是多少不感兴趣,我们只对模型的准确率感兴趣,kNN算法模型中进⼀步封装⼀个score函数。
knn_clf.score(X_test, y_test)

本文发布于:2023-06-17 09:30:43,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/89/1042301.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:数据   特征   模型   分类   预测
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图