归因分析笔记6:SHAP代码笔记
⽬录
Python包:
该包的⽂档:
SHAP(SHapley Additive exPlanations)是⼀种归因⽅法attribution method, ⼀种描述特征影响模型平均⾏为的全局解释⽅法. 基于解释单个预测的局部解释⽅法Shapley 值, 通过组合Shapley 值得到.
SHAP包的介绍参考:
SHAP多分类参考:
安装深红色
pip install shap
or
conda install -c conda-forge shap
activate Liver
pip install shap
使⽤⽰例
Kernel SHAP的实现, 核 SHAP 是⼀种与模型⽆关的⽅法,⽤于估计任何模型的 SHAP 值。因为它不对模型类型做出假设,所以KernelExplainer ⽐其他特定于模型类型的算法慢。
周末培训
该例⼦解释iris数据集上的多分类 SVM
完整的notebook代码(解释scikit-learn中的6种模型):
import sklearn
import shap
del_lection import train_test_split
武汉# print the JS visualization code to the notebook
shap.initjs()
# train a SVM classifier
X_train,X_test,Y_train,Y_test = train_test_split(*shap.datats.iris(), test_size=0.2, random_state=0)
svm = sklearn.svm.SVC(kernel='rbf', probability=True)
svm.fit(X_train, Y_train)
# ⽤ Kernel SHAP解释测试集的预测
explainer = shap.KernelExplainer(svm.predict_proba, X_train, link="logit")
shap_values = explainer.shap_values(X_test, nsamples=100)
其中explainer.shap_values(X_test, nsamples=100)代表解释每个预测(单个测试样本)时重新评估模型的次数(见下⽅)
分别解释4个特征贡献, 从平均输出0.3206推向0.01
shap_values()
进⼊shap_values函数仔细看下:
作⽤: 估计⼀组采样的Shap值
输⼊参数
X : numpy.array or pandas.DataFrame or any scipy.spar 矩阵
⽤于解释模型输出的样本矩阵
nsamples : "auto" or int
Number of times to re-evaluate the model when explaining each prediction. More samples lead to lower variance estimates of the SHAP values. The "auto" tting us `nsamples = 2 * X.shape[1] + 2048`.
解释每个预测(单个测试样本)时重新评估模型的次数。样本越多,Shap值的⽅差估计越低。设为"auto"代表使⽤
`nSamples=2*X.Shape[1]+2048`。
返回值
array or list
For models with a single output this returns a matrix of SHAP values (# samples x # features). Each row sums to the difference between the model output for that sample and the expected value of the model output (which is stored as expected_value attribute of the explainer). For models with vector outputs this returns a list of such matrices, one for each output.
对于具有单⼀输出的模型,这将返回Shap值矩阵(#Samples x#Feature)。每⾏的总和(sums)为该样本的模型输出与其期望值之间的差值(存储为解释器的expected_value属性)。
对于具有⽮量输出的模型,这将返回此类矩阵的列表,每个输出对应⼀个矩阵。
(查看⽂档例⼦后可以理解), shap_values矩阵中, 每⼀⾏是⼀个样本:
shap_values常⽤来绘制全局摘要
shap_values[0]则⽤来绘制单个实例(其中的0改成其他也是⼀样)
KernelExplainer返回值使⽤
easton例⼦参考:
使⽤Kernel Explainer(模型不可知解释器⽅法形式为 SHAP)设置解释器
Set the explainer using the Kernel Explainer (Model agnostic explainer method form SHAP)
explainer = shap.KernelExplainer(model = model.predict, data = X.head(50), link = "identity")
获取单个⽰例的 Shapley 值(即只输⼊⼀个样本⽤于解释)
# Set the index of the specific example to explain
X_idx = 0
shap_value_single = explainer.shap_values(X = X.iloc[X_idx:X_idx+1,:], nsamples = 100)
显⽰单个样本的详细信息(即输⼊值)
X.iloc[X_idx:X_idx+1,:]
smsm单个样本-单个标签的热⼒图:
# print the JS visualization code to the notebook
shap.initjs()
print(f'Current label Shown: {list_of_labels[current_label.value]}')
shap.force_plot(ba_value = pected_value[current_label.value],
shap_values = shap_value_single[current_label.value],
17 year oldfeatures = X.iloc[X_idx:X_idx+1,:]
)
可以看出, 这个热⼒图是可以直接接受对输⼊参数喂⼊Kernel解释器返回的shap_value
为特定输出/标签/⽬标创建汇总图:
# Note: 限制前50个训练样本是因为计算所有样本时间太长
shap_values = explainer.shap_values(X = X.iloc[0:50,:], nsamples = 100)
# print the JS visualization code to the notebook
shap.initjs()
print(f'Current Label Shown: {list_of_labels[current_label.value]}\n')
shap.summary_plot(shap_values = shap_values[current_label.value],
features = X.iloc[0:50,:]
)
shap.initjs()
shap.force_plot(ba_value = pected_value[current_label.value],
shap_values = shap_values[current_label.value],
features = X.iloc[0:50,:]
)
可以看出这个summary_plot和force_plot⼀样可以接收Kernel Explainer的shap_values作为参数
基于上⾯的汇总图,我们可以看到特征 01、03 和 07 是对模型没有影响的特征,可以被删除
KernelExplainer
"""Us the Kernel SHAP method to explain the output of any function.
Kernel SHAP is a method that us a special weighted linear regression to compute the importance of each feature. The computed importance values are Shapley values from game theory and also coefficents from a local linear regression.
Parameters
----------
model : function or iml.Model
Ur supplied function that takes a matrix of samples (# samples x # features) and computes a the output of the model for tho samples.
The output can be a vector (# samples) or a matrix (# samples x # model outputs).
data : numpy.array or pandas.DataFrame DenData or any scipy.spar matrix
网站设计培训班
The background datat to u for integrating out features. To determine the impact of a feature, that feature is t to "missing" and the change in the model output is obrved. Since most models aren't designed to handle arbitrary missing data at test time, we simulate "missing" by replacing the feature with the values it takes in the background datat. So if the background datat is a simple sample of all zeros, then we would approximate a feature being missing by tting it to zero. For small problems this background datat can be the whole training t, but for larger problems consider using a single reference value or using the kmeans function to summarize the datat.
Note: for spar ca we accept any spar matrix but convert to lil format for performance.
link : "identity" or "logit"
小学生大队委竞选稿A generalized linear model link to connect the feature importance values to the model output. Since the feature importance values, phi, sum up to the model output, it often makes n to connect them to the output with a link function where link(output) = sum(phi).
If the model output is a probability then the LogitLink link function makes the feature importance values have log-odds units.
将特征重要性值连接到模型输出的⼴义线性模型链接。由于特征重要性值 phi 与模型输出相加,因此使⽤链接函数(link(output) = sum(phi))将它们连接到输出通常是有意义的。
Examples
--------
See :ref:`Kernel Explainer Examples <kernel_explainer_examples>`
scripting
可视化
参考这个⽰例, 有⾮常多的图种类, 不过也是针对explainer的
可视化第⼀个prediction的解释 如果不想⽤JS,传⼊matplotlib=True
# plot the SHAP values for the Setosa output of the first instance
shap.force_pected_value[0], shap_values[0][0,:], X_test.iloc[0,:], link="logit")
堆叠热⼒图
怎么样提高记忆力
force_plot
堆叠整个数据集(测试集)各样本解释:
# plot the SHAP values for the Setosa output of all instances
shap.force_pected_value[0], shap_values[0], X_test, link="logit")
我们还可以只取每个特征的 SHAP 值的平均绝对值来获得标准条形图(为多类输出⽣成堆叠条形图)shap.plots.bar(shap_values)