python实例SVMSVRcv核函数LinearSVR、RBFSampler、
SGDRe。。。
SVM实例,两个数据,两个例⼦。cancer data样本量⼩,分类数据⽤svc函数,较为简单;houprice样本量⼤,连续数据,⽤了⽀持向量回归SVR,函数先⽤了RBFSampler和 Nystroem做核映射,然后⽤SGDRegressor做⽀持向量回归,使⽤的这三个函数都很适合⼤样本。
I. 准备
In [1]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import aborn as sns
from scipy import stats,integrate
sns.t_style("darkgrid")
sns.t(color_codes=True)
warnings.filterwarnings("ignore") # 不要显⽰警告
II. cancer data
In [2]:
cancer = pd.read_excel('C:\\Urs\\91333\\Documents\\mester6\\data science\\3.NB&DT\\Week3_CancerDatat.xlsx')
2.数据标准化
In [3]:
cancer_scaled = cancer.iloc[:,1:-1].apply(lambda x: (x - np.mean(x)) / (np.std(x)))
3. 训练SVM模型
对于三种常见的核函数,分别训练SVM模型,并调节参数
(1)线性核函数
线性核函数没有映射到⾼维度空间,由于是soft margin svm,所以需要调节的参数只有容忍错判的样本的惩罚系数C,⽤cv结果来挑选C。In [4]:
from sklearn.svm import SVC,SVR,LinearSVR
del_lection import cross_val_score
C_range = range(1,31)
猜词游戏cv_scores = []
for c in C_range:
clf = SVC(C=c,kernel = 'linear')
scores = cross_val_score(clf, cancer_scaled, cancer.iloc[:, -1], cv=5, scoring='accuracy')
cv_scores.an())
print('当松弛变量C取{}时,score最⼤,为{}'.format(C_range[cv_scores.index(max(cv_scores))],max(cv_scores)))
plt.figure(figsize=(10,4))
think过去式plt.plot(C_range,cv_scores)
plt.xlabel("C")
plt.ylabel("score")
plt.title("score随惩罚系数C变化图")
当松弛变量C取6时,score最⼤,为0.973728357060408
Out[4]:
Text(0.5, 1.0, 'score随惩罚系数C变化图')
(2)⾼斯核函数
由于存在两个参数松弛变量C和gamma,使⽤GridSearchCV函数进⾏⽹格搜索。
In [5]:
del_lection import GridSearchCV
ics import fbeta_score, make_scorer
太原科技大学校徽
ftwo_scorer = make_scorer(fbeta_score, beta=2)
grid1 = GridSearchCV(SVC(kernel = 'rbf'), param_grid={'C': [1,10,20],'gamma': [0.09,0.009,0.0009,0.00009]},
scoring=ftwo_scorer, cv=5)
grid1.fit(cancer_scaled, cancer.iloc[:, -1])
print('使⽤GridSearchCV⽹格搜索后,最优的松弛变量C取值为{},最优的gamma取值为{},该模型score为{}'.format(
grid1.best_estimator_.get_params()['C'],grid1.best_estimator_.get_params()['gamma'],grid1.best_score_))
使⽤GridSearchCV⽹格搜索后,最优的松弛变量C取值为10,最优的gamma取值为0.009,该模型score为0.9882421118919705(3)多项式核函数
存在两个参数松弛变量C和阶数,同样利⽤GridSearchCV函数
In [6]:
grid2 = GridSearchCV(SVC(kernel = 'poly'), param_grid={'C': [1,5,10,15,20],'degree': [1,2,3,4,5,6,7,8,9]},
scoring=ftwo_scorer, cv=5)
grid2.fit(cancer_scaled, cancer.iloc[:, -1])
print('使⽤GridSearchCV⽹格搜索后,最优的松弛变量C取值为{},最优的阶数取值为{},该模型score为{}'.format(
grid2.best_estimator_.get_params()['C'],grid2.best_estimator_.get_params()['degree'],grid2.best_score_))
plt.scatter(x=grid2.cv_results_['param_C'].data, y=grid2.cv_results_['param_degree'].data,
s=1000*(grid2.cv_results_['mean_test_score']-min(grid2.cv_results_['mean_test_score'])+0.1),
_cmap('RdYlBu'))
plt.xlabel("C")
plt.ylabel("degree")
plt.title("score随惩罚系数C和degree变化散点图")
plt.annotate('圆圈⼤⼩表⽰score⼤⼩', xy=(1,1), xytext=(1, -1),color='b',size=10)
plt.annotate('这个点最优',xy=(1,1),xytext=( grid2.best_estimator_.get_params()['C']+1,grid2.best_estimator_.get_params()['degree']),color='r')使⽤GridSearchCV⽹格搜索后,最优的松弛变量C取值为20,最优的阶数取值为3,该模型score为0.9867086284245016
Out[6]:
Text(21, 3, '这个点最优')
奇偶degree取值使得score交替变化,奇特。
4. 写出超平⾯
尝试了⼀下,只有没有映射到⾼维空间的线性核函数SVM可以取出超平⾯表达式,设超平⾯h(x)=wx+b。
(1)⽅向w
In [7]:
clf = SVC(kernel = 'linear')
clf.fit(cancer_scaled, cancer.iloc[:, -1])
f_)
[[-8.80113094e-02 -3.59306843e-01 -3.35013827e-01 -8.39562339e-04
6.46330866e-01 -
7.45104189e-01 -9.27479275e-01 -7.01658929e-02
3.81723561e-01 -8.64975132e-01 3.24937431e-01 -2.37246961e-01
-8.85892419e-01 -3.70854979e-01 3.81626684e-01 3.71302098e-01
王者之悲-4.33843461e-01 8.46263549e-02 8.92709596e-01 -6.49610348e-01
-1.00930557e+00 -3.76030776e-01 -7.55483286e-01 -3.94390435e-01
1.53334795e-01 -1.02528284e+00 -1.18490959e-01 -4.30099332e-01
-8.64181126e-01]]
这是⼀个和样本feature数同维数的向量,31维。
(2)截距b
性爱文章
In [8]:
配乐纯音乐print(clf.intercept_ )
[0.00091257]
III. houprice
1. read data
In [9]:
cal_housing = pd.read_csv('C:\\Urs\\91333\\Documents\\mester6\\data science\\5.SVM\\房价预测\\cal_housing.data',
header=None,names=['longitude','latitude','Age','Rooms','Bedrooms','population','houholds','Income','HouValue'])
2.探索⼀下
1.前5⾏
In [10]:
cal_housing.head(5)
Out[10]:
故宫的英语
longitude latitude Age Rooms Bedrooms population houholds Income HouValue 0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0
3-122.2537.8552.01274.0235.0558.0219.0 5.6431341300.0
4-122.2537.8552.01627.0280.0565.0259.0 3.8462342200.0 2.可视化
(1)经纬度和房屋价格的散点图
In [11]:
plt.figure(figsize=[17,7])
plt.subplot(1,2,1)
plt.scatter(cal_housing['longitude'].values,cal_housing['HouValue'].values,alpha=0.05)
plt.xlabel("longitude")
plt.ylabel("MedianHouValue")
plt.title("scatterplot of longitude and MedianHouValue",fontsize="x-large")
plt.subplot(1,2,2)
plt.scatter(cal_housing['latitude'].values,cal_housing['HouValue'].values,alpha=0.05)
plt.xlabel("latitude")
plt.ylabel("MedianHouValue")
plt.title("scatterplot of latitude and MedianHouValue",fontsize="x-large")
Out[11]:
Text(0.5, 1.0, 'scatterplot of latitude and MedianHouValue')
⽆论经度还是纬度,都主要有两种区间:第⼀种区间上,房屋价格的分布考下,很多都在200000以下;第⼆种区间上,各种价格的房屋都存在,⽽且都中间价格的多,点更密集,两头更稀疏。图中的规律和现实⼤致相符,在主城区各种价格的房屋都有,⽽在乡村偏远地区,主要是价格低的房屋。
2)收⼊和房屋价格的核密度图
In [12]:
sns.jointplot(x='Income',y='HouValue',data=cal_housing,kind='kde')
Out[12]:
<aborn.axisgrid.JointGrid at 0x267bd149c88>
收⼊和房屋价格正相关,符合⽣活经验。 其他变量与房屋价格的双变量图像效果不好,我就没展⽰。
3. 数据预处理
(1)数据标准化
In [13]:
怀念过去的唯美句子cal_housing_scaled = cal_housing.apply(lambda x: (x - np.mean(x)) / (np.std(x)))
(2)主成分分析