R 语⾔中⽤于计算Rsquare 的包rsq
⽂章⽬录
理论介绍
线性模型情形
R-squared(值范围 0-1)描述的 输⼊变量对输出变量的解释程度。在单变量线性回归中R-squared 越⼤,说明拟合程度越好。数学表达式:其中:TSS 是执⾏回归分析前,响应变量固有的⽅差RSS 是残差平⽅和(就是回归模型不能解释的⽅差)
SSR 回归模型可以解释的⽅差
然⽽,只要增加了更多的变量,⽆论增加的变量是否和输出变量存在关系,则R-squared 要么保持不变,要么增加。
所以需要 adjusted R-squared (范围),它会对那些增加的且不会改善模型效果的变量增加⼀个惩罚项。
数学表达式:
(p 为变量个数,n 为样本个数)
另外⼀个表达是:结论,如果单变量线性回归,则使⽤ R-squared评估,多变量,则使⽤adjusted R-squared。
在单变量线性回归中,R-squared和adjusted R-squared是⼀致的。
R语⾔中的req包可以对线性模型和⼴义线性模型的多种形式的进⾏计算
⼴义线性模型情形
针对⼴义线性模型R-squared通常定义为解释的不确定性的⽐例(the proportionate reduction in uncertainty, measured by KL divergence, due to the inclusion of regressors.)
⼀种基于KL散度的R-squared定义是:
,其中,
,表⽰模型估计的均值是对直接估计的极⼤似然估计,往往是的均值
参考⽂献:
A. C. Cameron, F. A. G. Windmeijer. An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics. 1997, 77: 329-342.
R =2SSR /TSS =1−RSS /TSS
(−∞,1]R (adj )=21−(1−R )×
2n −p −1
n −1R (adj )=21−TSS /(n −1)
RSS /(n −p −1)
R 2R =KL 21−K (y ,)μ
^0K (y ,)μ^K (y ,)=μ
^[log f (y )−∑i =1n y i log f (y )]μ^i =μ^i exp(x )i β
^μ^0μy
函数介绍
rsq()的介绍
rsq(fitObj,adj=FALSE,type=c('v','kl','s','lr','n'))
参数介绍:
fitObj:⼀个“lm”,“glm”, “merMod”, “lmerMod” 或者类 “lme”; 通常是 lm, glm, glm.nb, lmer , glmer , lme4中的glmer ,nlme中 lme的⼀个结果。
adj:这是⼀个逻辑值,如果是TRUE,则计算adjusted R^2
type: 表⽰R-squared的类型 (仅对⼴义线性模型有⽤):
’v’ (default) – variance-function-bad (Zhang, 2016), calling rsq.v;
’kl’ – KL-divergence-bad (Cameron and Windmeijer, 1997), calling rsq.kl;
’s’ – SSE-bad (Efron, 1978), calling rsq.s;
’lr’ – likelihood-ratio-bad (Maddala, 1983; Cox and Snell, 1989; Magee,
1990), calling rsq.lr;
’n’ – corrected version of ’lr’ (Nagelkerke, 1991), calling rsq.n
输出参数:
该函数除了返回R-square的值,针对(⼴义)线性混合模型还有下列输出:
R_M^2: 模型总共可以解释的变异性的⽐例,包含固定效应以及随机效应.
R_F^2:可以被固定因⼦解释的变异性所占的⽐例
R_R^2:可以被随机效应因⼦解释的变异性所占的⽐例
实例
library(rsq)
data(hcrabs)
attach(hcrabs)
y <- ifel(num.satellites>0,1,0)
bnfit <- glm(y~color+spine+width+weight,family=binomial)
rsq(bnfit)
# [1] 0.2171238
rsq(bnfit,adj=TRUE)
# [1] 0.1839109
quasibn <- glm(y~color+spine+width+weight,family=quasibinomial)
rsq(quasibn)
# [1] 0.2171238
rsq(quasibn,adj=TRUE)
# [1] 0.1839109
psfit <- glm(num.satellites~color+spine+width+weight,family=poisson)
rsq(psfit)
# [1] 0.1172267
rsq(psfit,adj=TRUE)
# [1] 0.07977572
quasips <- glm(num.satellites~color+spine+width+weight,family=quasipoisson) rsq(quasips)
# [1] 0.1172267
rsq(quasips,adj=TRUE)
# [1] 0.07977572
# Linear mixed models
require(lme4)
lmm1 <- lmer(Reaction~Days+(Days|Subject),data=sleepstudy)
rsq(lmm1)
# $model
# [1] 0.8003832
#
# $fixed
# [1] 0.2864714
#
# $random
# [1] 0.5139119
rsq.lmm(lmm1)
# $model
# [1] 0.8003832
#
# $fixed
# [1] 0.2864714
#
# $random
# [1] 0.5139119
rsq.partial()的介绍
R2
该函数⽤于计算aka partial , 可⽤于线性模型和⼴义线性模型rsq.partial(objF,objR=NULL,adj=FALSE,type=c('v','kl','s','lr','n'))
参数介绍:
objR:是⼀个 “lm” 或 "glm"对象, 也是lm, glm, 或者glm.nb 的⽤于拟合缩减模型的结果
adj: logical; if TRUE, calculate the adjusted partial R^2.
type: R-squared的类型:
’v’ (default) – variance-function-bad (Zhang, 2016), calling rsq.v;
’kl’ – KL-divergence-bad (Cameron and Windmeijer, 1997), calling rsq.kl;
’s’ – SSE-bad (Efron, 1978), calling rsq.s;
’lr’ – likelihood-ratio-bad (Maddala, 1983; Cox and Snell, 1989; Magee,1990), calling rsq.lr;
’n’ – corrected version of ’lr’ (Nagelkerke, 1991),
输出参数:
返回值包含adjustment 和partial.rsq. 当objR 为 NULL时, variable.full和 duced将被返回; 否则返回的是variable。
adjustment:逻辑值; 如果为TRUE, 则计算adjusted partial R^2.
variable.full :全模型中全部的协变量
variable: 全模型中全部的协变量.
R2R2
partial.rsq: partial 或者 adjusted partial .
相关⽂献
1. Cameron, A. C. and Windmeijer, A. G. (1997) An R-squared measure of goodness of fit for some common nonlinear
regression models. Journal of Econometrics, 77: 329-342.
2. Cox, D. R. and Snell, E. J. (1989) The Analysis of Binary Data, 2nd ed. London: Chapman and Hall.
3. Efron, B.(1978) Regression and ANOVA with zero-one data: measures of residual variation. Journ
al of the American
Statistical Association, 73: 113-121.
4. Maddala, G. S. (1983) Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University.
5. Magee, L. (1990) R^2 measures bad on Wald and likelihood ratio joint significance tests. The American Statistician,
44: 250-253.
6. Nagelkerke, N. J. D. (1991) A note on a general definition of the coefficient of determination. Biometrika, 78: 691-692.
7. Zhang, D. (2017). A coefficient of determination for generalized linear models. The American Statistician, 71(4): 310-
316.
pcor()函数介绍
该函数⽤于计算线性模型和⼴义线性模型的偏相关系数
pcor(objF,objR=NULL,adj=FALSE,type=c('v','kl','s','lr','n'))
参数介绍:
objR:是⼀个 “lm” 或 "glm"对象, 也是lm, glm, 或者glm.nb 的⽤于拟合缩减模型的结果
adj: logical; if TRUE, calculate the adjusted partial R^2.
type: R-squared的类型:
’v’ (default) – variance-function-bad (Zhang, 2016), calling rsq.v;
’kl’ – KL-divergence-bad (Cameron and Windmeijer, 1997), calling rsq.kl;
’s’ – SSE-bad (Efron, 1978), calling rsq.s;
’lr’ – likelihood-ratio-bad (Maddala, 1983; Cox and Snell, 1989; Magee,1990), calling rsq.lr;
’n’ – corrected version of ’lr’ (Nagelkerke, 1991),
注意:
当缩减模型的拟合对象缺失时,对于每个变量的偏相关系数都会被计算(除了多于两个⽔平的因⼦)。
vresidual()函数介绍
⽤于计算variance-function-bad的残差, 这是⽤来计算variance-function-bad R-squared.
vresidual(y,yfit,family=binomial(),variance=NULL)
参数介绍:
y:观测值的向量
yfit:拟合值的向量
family:分布族
variance:⽅差函数 (specified by family by default).
注意:
残差的计算依赖于⽅差函数, 对于quasi 模型需要很好的定义. 当⽅差函数是常数或者线性函数时,这会导出经典的残差。注意只有⽅差函数需要被设定,通过 “family”" 或者 “variance”。
实例
data(hcrabs)
attach(hcrabs)
y <- ifel(num.satellites>0,1,0)
bnfit <- glm(y~color+spine+width+weight,family="binomial")
vresidual(y,bnfit$fitted.values,family="binomial")
# Effectiveness of Bycycle Safety Helmets in Thompson et al. (1989)
y <- matrix(c(17,218,233,758),2,2)
x <- factor(c("yes","no"))
tbn <- glm(y~x,family="binomial")
yfit <- cbind(tbn$fitted.values,1-tbn$fitted.values)
vr0 <- vresidual(matrix(0,2,1),yfit[,1],family="binomial")
vr1 <- vresidual(matrix(1,2,1),yfit[,2],family="binomial")
y[,1]*vr0+y[,2]*vr1