首页 > 美文阅读

python+sklearn进行交叉验证（使用交叉验证对数据划分，模型评估和参数估计，使用。。。

更新时间:2023-07-18 08:15:37 阅读：评论：0

python+sklearn进⾏交叉验证（使⽤交叉验证对数据划分，模型评估和参数估

计，使⽤。。。

⽂章⽬录

⼀、普及

⾸先普及⼀下数据评估⽅法都有哪些：

1.留出法

留出法是将数据集D划分为两个互斥的集合，其中⼀个集合作为训练集S，另⼀个作为测试集T，即D=S∪T，S∩T=空集，在S上训练出模型后，⽤T来评估其测试误差，作为对泛

化误差的估计。

在采样的过程中，为了保证数据分布的⼀致性，⽐如在分类任务中保证类别⽐例的采样⽅式叫做“分层采样”。

另外单次的使⽤留出法得到的结果往往不够稳定可靠，在使⽤留出法时，⼀般采⽤若⼲次随机划分，重

复进⾏实验评估后取平均值作为留出法的评估结果。

2.⾃助法（bootstrapping）

以⾃助采样为基础（从⾃助采样延申出的Bagging⽅法，继⽽可以引出随机森林的采样⽅法，这⾥不做详细讲解）。

每次随机从D中挑选⼀个样本，将其拷贝放⼊D‘，然后再将该样本放回初始数据集D中，使得样本在下次采样时仍然有可能被采集到，循环执⾏m遍，就得到了m个样本的数据集

D’，这就是⾃助采样的结果。

3.交叉验证法

通常把交叉验证法叫做K折交叉验证法，其先将数据集D划分为 k个⼤⼩相似的互斥⼦集，即D=D1∪D2∪…∪Dk，Di∩Dj=空集（i≠j），每个⾃⼰Di都尽可能保持数据分布的⼀致

性，即从D中通过分层采样得到。

⽐较常⽤的是10折交叉验证法，⽰意图如下：

（⿏标不好画，看懂就好了）

总结：

⾃助法在数据集较⼩、难以有效划分训练/测试集时很有⽤；此外，⾃助法能从初始数据集中产⽣多个不同的训练集，这对集成学习（随机森林⽤到）等⽅法有很⼤的好处。然

⽽，⾃助法产⽣的数据集改变了初始数据集的分布，这会引⼊估计偏差。因此，在初始数据量⾜够时，留出法和交叉验证法更常⽤⼀些

⼆、使⽤交叉验证法进⾏数据划分

提前说⼀下，这部分提到的函数其实在代码中，只是将其实例化作为⼀个参数使⽤。

在这⾥终于要回归题⽬，在sklearn中有很多交叉验证的库进⾏数据的划分，分别是：

KFold，GroupKFold，StratifiedKFold，LeaveOneGroupOut，LeavePGroupsOut，LeaveOneOut，LeavePOut，ShuffleSplit，GroupShuffleSplit，StratifiedShuffleSplit（train_test_split，个⼈理解这个⽅法和StratifiedShuffleSplit作⽤是相同的），PredefinedSplit，TimeSeriesSplit。

分类：

从原理上来分类：

1. K折交叉验证：KFold，GroupKFold，StratifiedKFold，RepeatedKFold

2. 留⼀法（是k折交叉验证的特列，如果数据有m个样本，那么k当等于m，就是留⼀法，这样划分之后每⼀个样本都是⼀个独⽴的数据集，这种⽅法虽然效果被认为很好，但

是当数据量很多的时候，⽐如100万个样本，那么就要训练100万个模型，计算开销很⼤，no free launch对于实验评估⽅法仍然适

⽤）：LeaveOneGroupOut，LeavePGroupsOut，LeaveOneOut，LeavePOut

3. 随机划分法：ShuffleSplit，GroupShuffleSplit，StratifiedShuffleSplit

从应⽤上来分类

1. 对于分类数据来说，它们的target可能分配是不均匀的，⽐如在医疗数据当中得癌症的⼈⽐不得癌症的⼈少很多，这个时候，使⽤的数据划分⽅法有StratifiedKFold

，StratifiedShuffleSplit

2. 对于分组数据来说，它的划分⽅法是不⼀样的，主要的⽅法有GroupKFold，LeaveOneGroupOut，LeavePGroupOut，GroupShuffleSplit

3. 对于时间关联的数据，⽅法有TimeSeriesSplit

在这⾥我具体说⼀下StratifiedShuffleSplit，后⾯实例会⽤到，其他可以到sklearn官⽹进⾏学习：

⼀共有四个参数

Parameters：

原⽂如下：

n_splitsint, default=10

Number of re-shuffling & splitting iterations.

test_size, float or int, default=None梦到前妻

If float, should be between 0.0 and 1.0 and reprent the proportion of the datat to include in the test split. If int, reprents the absolute number of test samples. If None, the value is t to the complement of the train size. If train_size is also train_size, float or int, default=None

If float, should be between 0.0 and 1.0 and reprent the proportion of the datat to include in the tr

ain split. If int, reprents the absolute number of train samples. If None, the value is automatically t to the complement of the test size random_state, int or RandomState instance, default=None

Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls. See Glossary.

解释：

n_splitsint, default=10（重新洗牌和拆分迭代次数，也就是默认会把原数据集D分成10组和进⾏10折交叉验证）

test_size和train_size就是简单的训练和测试数据集的⽐例，整数代表具体数量，百分⽐代表占总数据集的多少⽐例

random_state这个值在参数估计很重要，我们需要将其设置为⼀个⼤于0的整数，这样对于在不同的参数估计中，可以保证每次k折划分的数据都是相同的；如果设置为0，那么每

次进⾏交叉验证进⾏数据划分结果是不相同的，这样没办法保证参数评估的准确性）

论语12则

举例：

>>> import numpy as np

>>> del_lection import StratifiedShuffleSplit

>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

>>> y = np.array([0, 0, 0, 1, 1, 1])

>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)

>>> _n_splits(X, y)

>>> print(sss)

StratifiedShuffleSplit(n_splits=5, random_state=0, ...)

>>> for train_index, test_index in sss.split(X, y):

... print("TRAIN:", train_index, "TEST:", test_index)

.. X_train, X_test = X[train_index], X[test_index]

... y_train, y_test = y[train_index], y[test_index]

TRAIN: [5 2 3] TEST: [4 1 0]

TRAIN: [5 1 4] TEST: [0 2 3]

TRAIN: [5 0 2] TEST: [4 3 1]

TRAIN: [4 1 0] TEST: [2 3 5]

TRAIN: [0 5 1] TEST: [3 4 2]

使⽤这个函数是为了保证每个分类样本中每个类别的占⽐相同，避免出现某⼀个类别过多或者过少的情况，train_test_split作⽤也⼀样

三、适⽤交叉验证进⾏模型评估

列举sklearn中三种模型评估⽅法：

1. cross_val_score

2. cross_validate

3. cross_val_predict

这⾥举例说⼀下cross_val_score函数：

杭州野生动物园原⽂：(有点多，可跳过看后⾯解释)

Parameters：

estimator: estimator object implementing ‘fit’

The object to u to fit the data.

X: array-like of shape (n_samples, n_features)

The data to fit. Can be for example a list, or an array.

y: array-like of shape (n_samples,) or (n_samples, n_outputs), default=None

The target variable to try to predict in the ca of supervid learning.

groups: array-like of shape (n_samples,), default=None

Group labels for the samples ud while splitting the datat into train/test t. Only ud in conjunction with a “Group” cv instance (e.g., GroupKFold).

scoring: str or callable, default=None

A str (e model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y) which should return only a single value.

Similar to cross_validate but only a single metric is permitted.

If None, the estimator’s default scorer (if available) is ud.

cv: int, cross-validation generator or an iterable, default=None

Determines the cross-validation splitting strategy. Possible inputs for cv are:

None, to u the default 5-fold cross validation,

int, to specify the number of folds in a (Stratified)KFold,

CV splitter,

An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is ud. In all other cas, KFold is ud.

Refer Ur Guide for the various cross-validation strategies that can be ud here.

Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.

n_jobs: int, default=None

The number of CPUs to u to do the computation. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

verbo: int, default=0

The verbosity level.

fit_params: dict, default=None

Parameters to pass to the fit method of the estimator.

pre_dispatch: int or str, default=’2*n_jobs’

Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be uful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which ca all the jobs are immediately created and spawned. U this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs

An int, giving the exact number of total jobs that are spawned

A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

error_score: ‘rai’ or numeric, default=np.nan

Value to assign to the score if an error occurs in estimator fitting. If t to ‘rai’, the error is raid. If a numeric value is given, FitFailedWarning is raid. This parameter does not affect the refit step, which will always rai the error

在这⾥！

对于这个函数cross_val_score，其实⽤到参数不多，⼤多是默认。

cross_val_score(clf, x, y, cv=cv)，⼤概就是这四个参数，

第⼀个是机器算法，看要对什么算法进⾏模型评估，我⽤的是决策树，后⾯有代码样例；

第⼆个参数和第三个参数就是训练和测试数据的list

第四个参数最重要：

确定交叉验证拆分策略。cv的可能输⼊包括：

1. ⽆，要使⽤默认的5倍交叉验证

2. int，指定（分层）k⽂件夹中的折叠数

3. CV分离器

4. ⼀个可承受的屈服（训练，测试）分裂成⼀系列的指数（翻译的好像不对，但不影响后续使⽤）

对于int/None输⼊，如果估计器是分类器且y是⼆进制或多类，则使⽤StratifiedKFold。在所有其他情况下，使⽤KFold。

最后，本⼈也是在摸着⽯头过河，对于第⼆部分和第三部分本⼈不敢保证说的很全，或者说没有瑕疵，如果有，欢迎⼤家评论区指正，这些是在⼤量查阅资料后总结的，算是综

合众多⼤佬的智慧结晶，可查看最后⼀部分的参考⽂章进⾏加深阅读理解。

或者直接看后续样例代码更容易理解

四、决策树样例

1.数据的简单处理

data = np.genfromtxt(data, delimiter='\t')

x = data[:, 1:]

y = data[:, 0]

x[:, 0], x[:, 1], x[:, 2], x[:, 3], x[:, 4], x[:, 5] = \

x[:, 0] * 0.2, x[:, 1] * 0.2, x[:, 2] * 0.1, x[:, 3] * 0.1, x[:, 4] * 0.2, x[:, 5] * 0.2

主要为数据加了权重（对于决策树不需要标准化）

2.参数分析

在进⾏参数分析时，就需要知道哪些参数重要，哪些参数不重要。⽽分类决策树总共有12个参数可以⾃⼰调整，这么多参数⼀个个记起来太⿇烦，我们可以把这些参数分成三个

类别。

1. ⽤于模型调参的参数：

1）criterion（划分标准）：有两个参数 ‘entropy’（熵）和 ‘gini’（基尼系数）可选，默认为gini。

2）max_depth（树的最⼤深度）：默认为None，此时决策树在建⽴⼦树的时候不会限制⼦树的深度。也可以设置具体的整数，⼀般来说，数据少或者特征少的时候可以不管

这个值。如果模型样本量多，特征也多的情况下，推荐限制这个最⼤深度，具体的取值取决于数据的分布。常⽤的可以取值10-100之间。

3）min_samples_split（分割内部节点所需的最⼩样本数）：意思就是只要在某个结点⾥有k个以上的样本，这个节点才需要继续划分，这个参数的默认值为2，也就是说只要

有2个以上的样本被划分在⼀个节点，如果这两个样本还可以细分，这个节点就会继续细分

4）min_samples_leaf（叶⼦节点上的最⼩样本数）：当你划分给某个叶⼦节点的样本少于设定的个数时，这个叶⼦节点会被剪枝，这样可以去除⼀些明显异常的噪声数据。

默认为1，也就是说只有有两个样本类别不⼀样，就会继续划分。如果是int，那么将min_samples_leaf视为最⼩数量。如果为float，则min_samples_leaf为分数，ceil(min

_ samples _ leaf * n _ samples)为每个节点的最⼩样本数。

2. ⽤于不平衡样本预处理参数：

class_weight: 这个参数是⽤于样本不均衡的情况下，给正负样本设置不同权重的参数。默认为None，即不设置权重。具体原理和使⽤⽅法见：【机器学习超详细实践攻略

(12)：三板斧⼲掉样本不均衡问题之2——通过正负样本的惩罚权重解决样本不均衡】

3. 不重要的参数：

这些参数⼀般⽆需⾃⼰⼿⼯设定，只需要知道具体的含义，在遇到特殊情况再有针对性地调节即可。

1）‘max_features’：如果我们训练集的特征数量太多，⽤这个参数可以限制⽣成决策树的特征数量的，这个参数在随机森林中有⼀定的作⽤，但是因为随机抽取特征，这

个算法有概率把数据集中很重要的特征筛选掉，所以就算特征太多，我宁愿采⽤降维算法、或者计算特征重要度⾃⼰⼿⼯筛选也不会设置这个参数。

2） ‘min_impurity_decrea’（节点划分最⼩不纯度）: 这是树增长提前结束的阈值，如果某节点的不纯度(基于基尼系数，均⽅差)⼩于这个阈值，则该节点不再⽣成⼦节点

。⼀般不推荐改动。

3）‘max_leaf_nodes’（最⼤叶⼦节点数）: 默认是"None”，即不限制最⼤的叶⼦节点数。如果加了限制，算法会建⽴⼀个在最⼤叶⼦节点数内最优的决策树。限制这个值

可以防⽌过拟合，如果特征不多，可以不考虑这个值，但是如果特征多的话，可以加以限制。

4）‘random_state’（随机数种⼦）: 默认为None，这⾥随便设置⼀个值，可以保证每次随机抽取样本的⽅式⼀样。

5）‘splitter’:⽤来控制决策树中划分节点的随机性，可选”best"和“random"两个值，默认为“best”。当输⼊”best"，决策树在分枝时虽然随机，但是还是会优先选择更重要的特

征进⾏分枝，输⼊“random"，决策树在分枝时会更加随机，从⽽降低对训练集的拟合成都。这也是防⽌过拟合的⼀种⽅式。当然，这种防⽌过拟合的⽅法属于“伤敌⼀千⾃

损⼋百”的⽅法，树的随机分枝会使得树因为含有更多的不必要信息⽽更深更⼤，所以我们最好使⽤上边的剪枝参数来防⽌过拟合，这个参数⼀般不⽤动。

6）min_weight_fraction_leaf（叶⼦节点最⼩的样本权重和）：这个值限制了叶⼦节点所有样本权重和的最⼩值，如果⼩于这个值，则会被剪枝。默认是0，就是不考虑权

重问题。⼀般来说，如果我们有较多样本有缺失值，或者分类树样本的分布类别偏差很⼤，就会引⼊样本权重，这时我们才会稍微注意⼀下这个值。

7）‘ccp_alpha’（复杂性参数）：这个参数同样是是⽤于避免树的过度拟合。在树⽣成节点的过程中，加⼊⼀个惩罚因⼦（逻辑回归⾥的惩罚项），避免⽣成的树过于冗

余，公式如下：

其中T就是树的总节点个数，α \alphaα就是我们设置的参数，随着树节点的增多，最终损失函数就会变得更⼤。默认为0，即不加⼊惩罚项。

舌苔刷以上就是分类决策树12个参数的介绍。

说完了分类决策树，回归决策树就⾃然⽽然懂了，它和分类决策树在参数上的区别只有两个地⽅：

1） ‘criterion’: 评价划分节点质量的参数，类似于分类决策树的熵，有三个参数可选，‘m’(default)均⽅误差；'friedman_m’均⽅误差近似，最⼩化L2 loss；'mae’平均绝对误

差，最⼩化L1 loss。

2）少了’class_weight’: None，当然，回归问题也就不存在给每个类别加不同的权重了。

3.开始调参

八月用英语怎么说在知道了哪些是重要的参数，那么就可以开始调参了，分别是：

1）确定criterion参数（决策树划分标准）：这⾥可以简单⽐较⼀下。

2）通过绘制得分曲线缩⼩max_depth（树的最⼤深度）的搜索范围，得到⼀个暂定的max_depth。

之所以第⼀个参数调max_depth，是因为模型得分⼀般随着max_depth单调递增，之后会区域稳定。

3）利⽤暂定的max_depth参数，绘制曲线，观察得分随着min_samples_split（分割内部节点所需的最⼩样本数）的变化规律，从⽽确定min_samples_split参数的⼤概范围。

因为随着min_samples_split的增⼤，模型会倾向于向着简单的⽅向发展。所以如果模型过拟合，那么随着min_samples_split的增⼤，模型得分会先升⾼后下降，我们选取得分

最⾼点附近的min_samples_split参数；如果模型⽋拟合，那么随着min_samples_split的增⼤，模型得分会⼀直下降，接下来调参时只需要从默认值2开始取就好。

4）利⽤暂定的max_depth和min_samples_split参数，绘制曲线，观察得分随着min_samples_leaf（叶⼦节点上应有的最少样例数）的变化规律，从⽽确定min_samples_leaf参

数的⼤概范围。该参数的范围确定⽅法同上。

5）利⽤⽹格搜索，在⼀个⼩范围内联合调max_depth、min_samples_split和min_samples_leaf三个参数，确定最终的参数。

4.代码分析

对《3.开始调参》中步骤⼀⼀填写代码

在这⾥特别强调⼀下random_state，⼀定要将其设置为⼀个⼤于0的整数，否则参数评估中每次数据都不相同，那就没有评估的意义了！具体多少没有要求调⽤所需库：

from sklearn import tree

del_lection import StratifiedShuffleSplit, GridSearchCV, cross_val_score

import matplotlib.pyplot as plt

import numpy as np

1）确定criterion参数（决策树划分标准）：这⾥可以简单⽐较⼀下。

cv = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=42)

clf = tree.DecisionTreeClassifier(criterion='entropy', random_state=42)

score = cross_val_score(clf, x, y, cv=cv).mean()

print(score)

clf = tree.DecisionTreeClassifier(criterion='gini', random_state=42)

score = cross_val_score(clf, x, y, cv=cv).mean()

print(score)

0.764375

0.7543749999999999

有的说gini效果好⼀点，但我数据跑出来entropy效果更好⼀点

2）通过绘制得分曲线缩⼩max_depth（树的最⼤深度）的搜索范围，得到⼀个暂定的max_depth。

之所以第⼀个参数调max_depth，是因为模型得分⼀般随着max_depth单调递增，之后会区域稳定。

ScoreAll = []

cv = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=42)

for i in range(5, 100, 5):

clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=i, random_state=42)

score = cross_val_score(clf, x, y, cv=cv).mean()

ScoreAll.append([i, score])

ScoreAll = np.array(ScoreAll)

max_score = np.where(ScoreAll == np.max(ScoreAll[:, 1]))[0][0] # 找出最⾼得分对应的索引

print("最优参数以及最⾼得分:", ScoreAll[max_score])

plt.figure(figsize=[20, 5])

plt.plot(ScoreAll[:, 0], ScoreAll[:, 1])

plt.show()

输出：

最优参数以及最⾼得分: 最优参数以及最⾼得分: [10. 0.78375]

不要忘记，我们还需要在10左右进⾏具体的分析

ScoreAll = []

cv = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=42)

for i in range(5, 15):

clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=i, random_state=42)

score = cross_val_score(clf, x, y, cv=cv).mean()

ScoreAll.append([i, score])

ScoreAll = np.array(ScoreAll)

max_score = np.where(ScoreAll == np.max(ScoreAll[:, 1]))[0][0] # 找出最⾼得分对应的索引

print("最优参数以及最⾼得分:", ScoreAll[max_score])

plt.figure(figsize=[20, 5])

plt.plot(ScoreAll[:, 0], ScoreAll[:, 1])

秋千坠

plt.show()

输出：

最优参数以及最⾼得分: [7. 0.791875]

我们暂定树的⾼度为7，达到了0.791875

3）利⽤暂定的max_depth参数，绘制曲线，观察得分随着min_samples_split（分割内部节点所需的最⼩样本数）的变化规律，从⽽确定min_samples_split参数的⼤概范围。ScoreAll = []

cv = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=42)

for i in range(5, 15):

clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=9, min_samples_split=i, random_state=42)

score = cross_val_score(clf, x, y, cv=cv).mean()

ScoreAll.append([i, score])

ScoreAll = np.array(ScoreAll)

认形状

max_score = np.where(ScoreAll == np.max(ScoreAll[:, 1]))[0][0] # 找出最⾼得分对应的索引

print("最优参数以及最⾼得分:", ScoreAll[max_score])

plt.figure(figsize=[20, 5])

plt.plot(ScoreAll[:, 0], ScoreAll[:, 1])

plt.show()

输出：

最优参数以及最⾼得分: [5. 0.79375]

确定min_samples_split为5

4）利⽤暂定的max_depth和min_samples_split参数，绘制曲线，观察得分随着min_samples_leaf（叶⼦节点上应有的最少样例数）的变化规律，从⽽确定min_samples_leaf参数的⼤概范围。该参数的范围确定⽅法同上。

ScoreAll = []

cv = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=42)

for i in range(5, 15):

国家意志clf = tree.DecisionTreeClassifier(criterion='entropy', min_samples_leaf=i, max_depth=7, min_samples_split=5, random_state=42)

score = cross_val_score(clf, x, y, cv=cv).mean()

ScoreAll.append([i, score])

ScoreAll = np.array(ScoreAll)

max_score = np.where(ScoreAll == np.max(ScoreAll[:, 1]))[0][0] # 找出最⾼得分对应的索引

print("最优参数以及最⾼得分:", ScoreAll[max_score])

plt.figure(figsize=[20, 5])

plt.plot(ScoreAll[:, 0], ScoreAll[:, 1])

plt.show()

输出：

最优参数以及最⾼得分: [2. 0.795]

本文发布于:2023-07-18 08:15:37，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1102751.html

上一篇：基于聚类分析和优化神经网络的风电功率预测研究

下一篇：布尔混沌系统的物理随机性分析

标签：参数数据样本节点划分决策树评估

留言与评论（共有 0 条评论）