Python-matplotlib-决策曲线分析(DecisionCurveAnalysis )
⽂章⽬录
⼀、决策曲线分析概念
预测模型(predictive models)被⼴泛地应⽤于诊断(diagnosis)或预后预测(prognosis)。通常,这些模型的价值是通过统计学指标如敏感性、特异性、ROC曲线下⾯积、校准度来评估的,⽽这些指标⽆法考虑特定模型的临床实⽤性(clinical utility)。决策曲线分析(decision curve analysis, DCA)是衡量临床实⽤性的⼀种⼴泛使⽤的⽅法。
1. 阈值概率
⼀个预测模型的输出通常为介于0到1之间的⼀个值(pi),根据事前确定的阈值概率(cutoff value, probability threshold, pt),当pi >pt时,判断为阳性;当pi < pt时,判断为阴性。因此,患者被分成了预测阳性⽽施加⼲预和预测阴性⽽不施加⼲预的两组。在预测阳性组中,存在着真阳性病⼈(TP)和假阳性病⼈(FP)。显然,治疗真阳性病⼈会带来受益(benefits),⽽治疗假阳性病⼈会造成伤害(harms)。选择不同的阈值概率,会改变TP和FP的⽐值,从⽽受益和伤害的改变。
2. 净获益
为了同时考虑受益和伤害,决策曲线分析中,将模型的临床效⽤量化为净获益(net benefit)。
对于⼀个总样本量为 n , 阈值为pt的诊断试验,可以画出四格表:
⾦标准(+)
⾦标准(-)模型(+)
TP FP 模型(-)FN TN
阳性组的净获益为:
阴性组的净获益为:
决策曲线定义了这样⼀种关系:
因此,可以计算得到treat all策略(即⽆论预测模型结果如何,所以病⼈都进⾏⼲预)的净获益为:
对于treat none策略,所有病⼈⽆论模型结果如果,都不进⾏⼲预,其净获益恒为0。
net benefit treated =−n TP ∗n FP安卓机顶盒
()
1−p t p t net benefit untreated =−n TN ∗n FN
()
p t 1−p t =()1−p t p t net benefit treated −net benefit treat all net benefit untreated
岜夯鸡net benefit treat all =−n TP +FN ∗n
美智子
TN +FP
()1−p t p t
所谓决策曲线,即是以不同的probability threshold为横坐标,其所对应的net benefit为纵坐标,画出的曲线。
理论成⽴,实践开始!
⼆、matplotlib实现
绘制模型的决策曲线,我们只需要模型输出的 每个样本的预测概率(y_pred_score) 和 每个样本真实的分类(y_label) 。1. 计算模型带来的净获益
模型带来的获益即是模型预测出阳性的部分,因为只有预测阳性的部分会施加和原本不同的⼲预,因此net benefit treated即为net benefit of model:
def calculate_net_benefit_model(thresh_group, y_pred_score, y_label):
net_benefit_model = np.array([])
for thresh in thresh_group:女人胡子
y_pred_label = y_pred_score > thresh
tn, fp, fn, tp = confusion_matrix(y_label, y_pred_label).ravel()
n =len(y_label)
net_benefit =(tp / n)-(fp / n)*(thresh /(1- thresh))
net_benefit_model = np.append(net_benefit_model, net_benefit)
return net_benefit_model
2. 计算treat all策略带来的净获益
def calculate_net_benefit_all(thresh_group, y_label):
net_benefit_all = np.array([])
tn, fp, fn, tp = confusion_matrix(y_label, y_label).ravel()
total = tp + tn
for thresh in thresh_group:
net_benefit =(tp / total)-(tn / total)*(thresh /(1- thresh))
net_benefit_all = np.append(net_benefit_all, net_benefit)
return net_benefit_all
3. 绘制决策曲线
def plot_DCA(ax, thresh_group, net_benefit_model, net_benefit_all):
农业项目招商#Plot
ax.plot(thresh_group, net_benefit_model, color ='crimson', label ='Model')
ax.plot(thresh_group, net_benefit_all, color ='black',label ='Treat all')
ax.plot((0,1),(0,0), color ='black', linestyle =':', label ='Treat none')
#Fill,显⽰出模型较于treat all和treat none好的部分
y2 = np.maximum(net_benefit_all,0)
y1 = np.maximum(net_benefit_model, y2)
ax.fill_between(thresh_group, y1, y2, color ='crimson', alpha =0.2)
#Figure Configuration,美化⼀下细节
ax.t_xlim(0,1)
ax.t_ylim(net_benefit_model.min()-0.15, net_benefit_model.max()+0.15)#adjustify the y axis limitation
ax.t_xlabel(
xlabel ='Threshold Probability',
fontdict={'family':'Times New Roman','fontsize':15}
)
ax.t_ylabel(
ylabel ='Net Benefit',
fontdict={'family':'Times New Roman','fontsize':15}
)
ax.spines['right'].t_color((0.8,0.8,0.8))
ax.spines['top'].t_color((0.8,0.8,0.8))
ax.legend(loc ='upper right')
return ax
三、完整代码
import numpy as np
import matplotlib.pyplot as plt
ics import confusion_matrix
def calculate_net_benefit_model(thresh_group, y_pred_score, y_label):
net_benefit_model = np.array([])
for thresh in thresh_group:
y_pred_label = y_pred_score > thresh
tn, fp, fn, tp = confusion_matrix(y_label, y_pred_label).ravel()
n =len(y_label)
net_benefit =(tp / n)-(fp / n)*(thresh /(1- thresh))
net_benefit_model = np.append(net_benefit_model, net_benefit)
return net_benefit_model
def calculate_net_benefit_all(thresh_group, y_label):
net_benefit_all = np.array([])
tn, fp, fn, tp = confusion_matrix(y_label, y_label).ravel()
total = tp + tn
for thresh in thresh_group:
net_benefit =(tp / total)-(tn / total)*(thresh /(1- thresh))
net_benefit_all = np.append(net_benefit_all, net_benefit)
return net_benefit_all
def plot_DCA(ax, thresh_group, net_benefit_model, net_benefit_all):
#Plot
ax.plot(thresh_group, net_benefit_model, color ='crimson', label ='Model')
ax.plot(thresh_group, net_benefit_all, color ='black',label ='Treat all')
ax.plot((0,1),(0,0), color ='black', linestyle =':', label ='Treat none')
#Fill,显⽰出模型较于treat all和treat none好的部分
y2 = np.maximum(net_benefit_all,0)
y1 = np.maximum(net_benefit_model, y2)
ax.fill_between(thresh_group, y1, y2, color ='crimson', alpha =0.2)
#Figure Configuration,美化⼀下细节
ax.t_xlim(0,1)
ax.t_ylim(net_benefit_model.min()-0.15, net_benefit_model.max()+0.15)#adjustify the y axis limitation
ax.t_xlabel(
xlabel ='Threshold Probability',
fontdict={'family':'Times New Roman','fontsize':15}
)
ax.t_ylabel(
ylabel ='Net Benefit',
fontdict={'family':'Times New Roman','fontsize':15}
)
字幕英文
ax.spines['right'].t_color((0.8,0.8,0.8))
ax.spines['top'].t_color((0.8,0.8,0.8))
ax.legend(loc ='upper right')
return ax
if __name__ =='__main__':
#构造⼀个分类效果不是很好的模型
y_pred_score = np.arange(0,1,0.001)
y_label = np.array([1]*25+[0]*25+[0]*450+[1]*25+[0]*25+[1]*25+[0]*25+[1]*25+[0]*25+[1]*25+[0]*25+[1]*25+[0]*25+[1]*25+[0]*25+[1]*25+ [0]*50+[1]*125)
thresh_group = np.arange(0,1,0.01)
thresh_group = np.arange(0,1,0.01)
net_benefit_model = decision_curve_analysis(thresh_group, y_pred_score, y_label)
net_benefit_all = calculate_net_benefit_all(thresh_group, y_label)
fig, ax = plt.subplots()
ax = plot_DCA(ax, thresh_group, net_benefit_model, net_benefit_all)
# fig.savefig('fig1.png', dpi = 300)
plt.show()
四、拓展
由于存在抽样误差,单次建模的结果可能存在偏倚。通常情况下, 可以采⽤bootstrapping或者k折交叉验证的⽅法来对净获益进⾏校正。同时,还可以⽤这种⽅法获得净获益的置信区间。
1. bootstrapping法校正净获益
1. ⽤原始数据进⾏拟合建模,获得⼀组净获益(未校正的净获益)
2. 从原始数据中采取有放回的随机抽样,得到⼀组⼦数据,⽤这批数据拟合建模
一周计划3. 计算步骤2的模型在步骤2数据集中不同阈值概率的净获益
4. 计算步骤2的模型在原始数据集中不同阈值概率的净获益
5. 计算步骤3和步骤4中两组净获益的差值
6. 重复步骤2-5 n次(通常n = 1000),并可计算n次净效益的差值的平均值
7. ⽤步骤1中未校正的净效益减去步骤6中获得的平均差,即为修正净获益
2. k折交叉验证法校正净获益
1. ⽤原始数据进⾏拟合建模,获得⼀组净获益(未校正的净获益)
2. 将原始样本随机分成k个(例如,k = 5)⼤⼩相等的⼦集
3. 取出⼀个⼦集作为验证集,其他⼦集(k-1个)作为训练集
4. 使⽤训练集拟合建模,再⽤模型来预测验证集,得到预测概率pi
水煮耗儿鱼
5. 重复步骤2-4 k次,所有样本都获得了pi,再计算整个原始数据的净获益
6. 重复分组n次,重复步骤2-5,可计算n次净获益的平均值,即为修正净获益
3. 计算净获益的置信区间
原理:根据bootstrapping法或k折交叉验证法得到的净获益结果,可以根据中⼼极限定理通过正态近似的⽅法求得置信区间