⼏种风控算法的原理和代码实现
⼀、基算法
1、决策树(Decision Tree)
(1)原理:决策树根据样本数据集的数据特征对数据集进⾏划分,直到针对所有特征都划分过,或者划分的数据⼦集的所有数据的类别标签相同。
(2)代码实现:
#1、调⽤包和⽅法
from sklearn.datats import load_iris
from sklearn import tree #调⽤树算法模型
import graphviz #调⽤可视化模块
import pydotplus #调⽤图形界⾯模块
del_lection import train_test_split #调⽤样本划分⽅法
from sklearn2pmml import sklearn2pmml, PMMLPipeline #调⽤⽣成pmml⽂件⽅法
#2、导⼊数据
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 40) # 划分训练集和验证集
#3、训练、验证模型
clf = tree.DecisionTreeClassifier(criterion = 'gini', splitter = 'best', max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_le af = 0.0, max_features = None, random_state = 10, max_leaf_nodes = None, min_impurity_decrea = 0.0, min_impurity_split = None, class_weight = Non e, presort = Fal)
clf.fit(X_train, y_train)
print("训练集:", clf.score(X_train, y_train)) #训练集
print("验证集:", clf.score(X_test, y_test)) #验证集
#4、⽣成PMML⽂件
pipeline = PMMLPipeline([("classifier", clf)])
pipeline.fit(X_train, y_train)
sklearn2pmml(pipeline, r"C:/Urs/马青坡/Desktop/data/dt_iris.pmml")
#5、⽣成本地pdf⽂件
dot_data = port_graphviz(clf) #将决策树以pdf格式可视化
graph = aph_from_dot_data(dot_data)
graph.write_pdf("C:/Urs/Desktop/data/decisiontree.pdf") #⽂件保存到本地
2、逻辑回归(Logistic Regression)
(1)原理:逻辑回归是⼀种线性分类器,通过logistic函数,将特征映射成⼀个概率值,来判断输⼊数据的类别。
(2)代码实现:
#1、调⽤包和⽅法
from sklearn.datats import load_iris
from sklearn.linear_model import LogisticRegression as lr #调⽤逻辑回归算法模型
del_lection import train_test_split #调⽤样本划分⽅法
#2、导⼊数据
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 40) # 划分训练集和验证集
#3、训练、验证模型
clf = lr(penalty = 'l2', dual = Fal, tol = 0.0001, C = 1.0, fit_intercept = True, intercept_scaling = 1, class_weight = None, random_state = None, solver = 'lbf gs', max_iter = 100, multi_class = 'auto', verbo = 0, warm_start = Fal, n_jobs = None, l1_ratio = None)
clf.fit(X_train, y_train)
print("训练集:", clf.score(X_train, y_train)) #训练集
print("验证集:", clf.score(X_test, y_test)) #验证集
3、⽀持向量机(SVM, Support Vector Machine)
(1)原理:寻找⼀个能够正确划分训练数据集并且⼏何间隔最⼤的分离超平⾯。
(2)代码实现:
#1、调⽤包和⽅法
from sklearn.datats import load_iris
from sklearn import svm #调⽤SVM算法模型
del_lection import train_test_split #调⽤样本划分⽅法
from sklearn2pmml import sklearn2pmml, PMMLPipeline #调⽤⽣成pmml⽂件⽅法
#2、导⼊数据
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 40) # 划分训练集和验证集
#3、训练、验证模型
svm = svm.SVC(C = 1.0, kernel = 'rbf', degree = 3, gamma = 'scale', coef0 = 0.0, shrinking = True, probability = Fal, tol = 0.001, cache_size = 200, class _weight = None, verbo = Fal, max_iter = -1, decision_function_shape = 'ovr', break_ties = Fal, random_state = None)
svm.fit(X_train, y_train)
print("训练集:", clf.score(X_train, y_train)) #训练集
print("验证集:", clf.score(X_test, y_test)) #验证集
#4、⽣成PMML⽂件
pipeline = PMMLPipeline([("classifier", svm)])
pipeline.fit(X_train, y_train)
sklearn2pmml(pipeline, r"C:/Urs/Desktop/data/svm_iris.pmml")
⼆、集成算法
4、随机森林(Random Forest)
首次公开发行股票
(1)原理:使⽤CART树作为弱分类器,将多个不同的决策树进⾏组合,利⽤这种组合来降低单棵决策树的可能带来的⽚⾯性和判断不准确性。
(2)代码实现:
#1、调⽤包和⽅法
from sklearn.datats import load_iris
ble import RandomForestClassifier #调⽤随机森林算法模型
del_lection import train_test_split #调⽤样本划分⽅法
from sklearn2pmml import sklearn2pmml, PMMLPipeline #调⽤⽣成pmml⽂件⽅法
#2、导⼊数据
iris = load_iris()
X = iris.data
y = iris.target治疗鼻炎的方法>晚托班
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 40) # 划分训练集和验证集
#3、训练、验证模型
clf = RandomForestClassifier(n_estimators = 100, criterion = 'gini', max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction _leaf = 0.0, max_features = 'auto', max_leaf_nodes = None, min_impurity_decrea = 0.0, min_impurity_split = None, bootstrap = True, oob_score = Fal , n_jobs = None, random_state = None, verbo = 0, warm_start = Fal, class_weight = None, ccp_alpha = 0.0, max_samples = None)
clf.fit(X_train, y_train)
print("训练集:", clf.score(X_train, y_train)) #训练集
print("验证集:", clf.score(X_test, y_test)) #验证集
#4、⽣成PMML⽂件
pipeline = PMMLPipeline([("classifier", clf)])
pipeline.fit(X_train, y_train)
sklearn2pmml(pipeline, r"C:/Urs/Desktop/data/rf_iris.pmml")
5、AdaBoost(Adaptive Boosting)
(1)原理:AdaBoost算法中,前⼀个基本分类器分错的样本会得到加强,加权后的全体样本再次被⽤来训练下⼀个基本分类器。同时,在每⼀轮中加⼊⼀个新的弱分类器,直到达到某个预定的⾜够⼩的错误率或达到预先指定的最⼤迭代次数。
(2)代码实现:
ble import AdaBoostClassifier #调⽤AdaBoost算法模型
del_lection import train_test_split #调⽤样本划分⽅法
from sklearn2pmml import sklearn2pmml, PMMLPipeline #调⽤⽣成pmml⽂件⽅法
#2、导⼊数据
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 40) # 划分训练集和验证集
#3、训练、验证模型
clf = AdaBoostClassifier(ba_estimator = None, n_estimators = 50, learning_rate = 1.0, algorithm = 'SAMME.R', random_state = None)
clf.fit(X_train, y_train)
print("训练集:", clf.score(X_train, y_train)) #训练集
print("验证集:", clf.score(X_test, y_test)) #验证集
#4、⽣成PMML⽂件
pipeline = PMMLPipeline([("classifier", clf)])
pipeline.fit(X_train, y_train)
sklearn2pmml(pipeline, r"C:/Urs/Desktop/data/ada_iris.pmml")
6、GBDT(Gradient Boosting Decision Tree)
(1)原理:GBDT是每次建⽴单个分类器时,是在之前建⽴的模型的损失函数的梯度下降⽅向。GBDT的核⼼在于每⼀棵树学的是之前所有树结论和的残差,残差就是真实值与预测值的差值,所以为了得到残差,GBDT中的树全部是回归树,之所以不⽤分类树,是因为分类的结果相减是没有意义的。
幽默语言
(2)代码实现:
#1、调⽤包和⽅法
from sklearn.datats import load_iris
ble import GradientBoostingClassifier #调⽤GBDT算法模型
del_lection import train_test_split #调⽤样本划分⽅法
from sklearn2pmml import sklearn2pmml, PMMLPipeline #调⽤⽣成pmml⽂件⽅法
#2、导⼊数据
发烧能吃橘子吗iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 40) # 划分训练集和验证集
#3、训练、验证模型
clf = GradientBoostingClassifier(loss = 'deviance', learning_rate = 0.1, n_estimators = 100, subsample = 1.0, criterion = 'friedman_m', min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_depth = 3, min_impurity_decrea = 0.0, min_impurity_split = None, init = None, random_ state = None, max_features = None, verbo = 0, max_leaf_nodes = None, warm_start = Fal, presort = 'deprecated', validation_fraction = 0.1, n_iter_no _change = None, tol = 0.0001, ccp_alpha = 0.0)
clf.fit(X_train, y_train)
print("训练集:", clf.score(X_train, y_train)) #训练集
print("验证集:", clf.score(X_test, y_test)) #验证集
#4、⽣成PMML⽂件
pipeline = PMMLPipeline([("classifier", clf)])名下房产查询
pipeline.fit(X_train, y_train)
sklearn2pmml(pipeline, r"C:/Urs/Desktop/data/gbdt_iris.pmml")
7、XGBoost(eXtreme Gradient Boosting)
(1)原理:XGBoost的原理与GBDT基本相同,但XGB是在GBDT基础上的优化。相⽐⽽⾔,XGB主要有2点优化:①XGB⽀持并⾏,速度快;②损失函数加⼊了正则项,防⽌过拟合。
(2)代码实现:
from xgboost import XGBClassifier #调⽤xgb算法模型
del_lection import train_test_split #调⽤样本划分⽅法
from sklearn2pmml import sklearn2pmml, PMMLPipeline #调⽤⽣成pmml⽂件⽅法
#2、导⼊数据
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 40) # 划分训练集和验证集
#3、训练、验证模型
clf = XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, silent=None, objective='binary:logistic', booster='gbtree', n_jobs=1, nt hread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
colsample_bynode=1, reg_alp ha=0, reg_lambda=1, scale_pos_weight=1, ba_score=0.5, random_state=0, ed=None, missing=None)
clf.fit(X_train, y_train)
print("训练集:", clf.score(X_train, y_train)) #训练集
print("验证集:", clf.score(X_test, y_test)) #验证集
#4、⽣成PMML⽂件
pipeline = PMMLPipeline([("classifier", clf)])
pipeline.fit(X_train, y_train)
sklearn2pmml(pipeline, r"C:/Urs/Desktop/data/xgb_iris.pmml")
8、LightGBM(Light Gradient Boosting Machine)
心乱如麻造句(1)原理:LightGBM是在XGB基础上的优化,主要优化点在于速度更快、占⽤内存更⼩、精确度更⾼、⽀持类别变量。
(2)代码实现:
import pandas as pd
ics import mean_squared_error
del_lection import GridSearchCV
from sklearn.datats import make_classification
import lightgbm as lgb #调⽤lgbm算法模型
del_lection import train_test_split #调⽤样本划分⽅法
from sklearn2pmml import sklearn2pmml, PMMLPipeline #调⽤⽣成pmml⽂件⽅法
#2、导⼊数据
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 40) # 划分训练集和验证集
#3、训练、验证模型
gbm = lgb.LGBMRegressor(boosting_type = 'gbdt', num_leaves = 31, max_depth = -1, learning_rate = 0.05, n_estimators = 100, subsample_for_bin = 200 000, objective = 'regression', class_weight = None, min_split_gain = 0.0, min_child_weight = 0.001, min_child_samples = 20, subsample = 1.0, subsample_ freq = 0, colsample_bytree = 1.0, reg_alpha = 0.0, reg_lambda = 0.0, random_state = None, n_jobs = -1, silent = True, importance_type = 'split')
gbm.fit(X_train, y_train,eval_t = [(X_test, y_test)], eval_metric = 'l1', early_stopping_rounds = 10)
gbm.fit(X_train, y_train)
#测试集预测
y_pred = gbm.predict(X_test, num_iteration = gbm.best_iteration_)
规划建设
##模型评估
print("The rm of prediction is: ", mean_squared_error(y_test, y_pred) ** 0.5)
#特征重要性
print("Feature importances: ", list(gbm.feature_importances_))
#⽹格搜索法调参
estimator = lgb.LGBMRegressor(num_leaves = 31)
param_grid = {"learning_rate": [0.01, 0.1, 1], "n_estimators": [20, 40]}
gbm = GridSearchCV(estimator, param_grid)
#训练模型
gbm.fit(X_train, y_train)
#输出最优参数
print("Best parameters found by grid arch are: ", gbm.best_params_)
#训练集、验证集评分
print("训练集:", gbm.score(X_train, y_train)) #训练集
print("验证集:", gbm.score(X_test, y_test)) #验证集
#4、⽣成PMML⽂件
pipeline = PMMLPipeline([("classifier", gbm)])
pipeline.fit(X_train, y_train)
sklearn2pmml(pipeline, r"C:/Urs/Desktop/data/gbm_iris.pmml")