泰坦尼克号⽣存预测-----基于决策树模型(机器学习-sklearn)"""主要是存储本⼈的笔记为主,但是希望和各位进⾏交流""
简介:该代码主要会⽤ train_test_split 及 cross_val_score验证模型的有效度。 此外,还会⽤GridSearchCV找出模型最优的参数。
忙的句子step 1:对数据进⾏处理,⽐如填补或者删除缺失值。此外, 决策树⽆法处理⽂字,所以,我们需要把性别(x)及 船票号码(embark)转换数字。⽐如,0,1,2等⽆意义的数字。
import pandas as pd
import DecisionTreeClassifier
del_lection import train_test_split
del_lection import GridSearchCV
del_lection import cross_val_score
import matplotlib.pyplot as plt
import numpy as np
import graphviz
"""读取数据及观察数据"""
data = pd.read_csv("data【瑞客论坛 】.csv")
data.info()
data.head()
"""处理数据"""
盆栽西红柿
##删除缺失值过多的列,和观察判断来说和预测的y没有关系的列
# inplace 是指将原有的数据进⾏合并。axis = 1 是指列00
如何提升服务质量data.drop(["Cabin","Name","Ticket","Cabin"],inplace=True,axis=1)
data.info()
#以整体年龄的平均数填充缺失值年龄,
data["Age"] = data["Age"].fillna(data["Age"].mean())
data.info()
#将缺失值的“⾏” ⼀并删除。括号内默认是“0”, 0 是指⾏的意思,反之1则是列的意思
data = data.dropna()
绝版qq秀data.info()
#tolist是将数据转换成list的形式,⽽ unique主要是找出该数据含有那⼏种分类
label_1 = data["Embarked"].unique().tolist()
#将数据以 o 1 3 进⾏替换。替换的原因是 decision tree classifier是不接受中⽂字的运算
data["Embarked"] = data["Embarked"].apply(lambda x: label_1.index(x))
#这是另外⼀种投机的转换⽅法。因为 male 和femal 是以 true 和 fal 的数据形式进⾏表达。
#我们直接按照下⾯的代码进⾏强制转换便可
data["Sex"] = (data["Sex"]== "male").astype("int")
Step 2 建⽴模型。
""利⽤⽤⽹格搜索调整参数,找出最⾼的有效度的决策树。但是这并⾮最精准的。因为,这个⽅法是基于下列的所以参数中找出最⾼的score,但是也许不要 min samples leaf 的参数或许能得到更⾼的分数"""
gini_thresholds = np.linspace(0,0.5,20)
用电申请arameters = {'splitter':('best','random')
,'criterion':("gini","entropy")
,"max_depth":[*range(1,10)]
,'min_samples_leaf':[*range(1,50,5)]
,'min_impurity_decrea':[*np.linspace(0,0.5,20)]
}
clf = DecisionTreeClassifier(random_state=25)
GS = GridSearchCV(clf, parameters, cv=10)
实证主义
GS.fit(x_train,y_train)
GS.best_params_
GS.best_score_
#利⽤上⾯得到的参数重新建模,看看实际效果
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.3,random_state=0)
clf =tree.DecisionTreeClassifier(criterion ="entropy"
,splitter="best"
,max_depth=6
,min_samples_leaf =1
,min_samples_split =10
,min_impurity_decrea= 0
)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
Step 5: 我们基于 Step 4已经找到了最优模型的参数,选择可以⽤graphviz 进⾏决策树的可视化了。
"""选取最优的分数进⾏graphviz可视化"""
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.3,random_state=0)
clf =tree.DecisionTreeClassifier(criterion ="entropy"
,splitter="best"
,max_depth=6
,min_samples_leaf =1
,min_samples_split =10
,min_impurity_decrea= 0
)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
feature_name =["Pclass","Sex","Age","SibSp","Fare","Embarked"]
dot_data = port_graphviz(clf
, feature_names = feature_name
, class_names =["⽣","死"]
, filled = True#圆头的意思
, rounded =True#颜⾊的意思荡秋千
)
graphviz.Source(dot_data)
最后的结果如下啦:
我对于 这⼀步可视化的代码并不是特别熟悉。希望各位能提供更好的代码及解释,感谢!完整代码如下:
import pandas as pd
import DecisionTreeClassifier
del_lection import train_test_split
del_lection import GridSearchCV
del_lection import cross_val_score
import matplotlib.pyplot as plt
import numpy as np
import graphviz
"""读取数据及观察数据"""
data = pd.read_csv("data【瑞客论坛 】.csv")
data.info()
data.head()
"""处理数据"""
##删除缺失值过多的列,和观察判断来说和预测的y没有关系的列
# inplace 是指将原有的数据进⾏合并。axis = 1 是指列00
data.drop(["Cabin","Name","Ticket","Cabin"],inplace=True,axis=1)
打印机墨粉怎么加data.info()