DESeq2中的标准化⽅法---vst
⾸先,vst也是基于负⼆项分布的⼀种标准化⽅法
我们为什么在⼤样本数据中需要采⽤vst的标准化⽅法呢?这是因为:
1.It is a one-size-fits-all solution, ignoring the measurement noi characteristics associated with each
instrument and each run.
2.Negative values that frequently result from background correction of low-intensity signals have to be
ret before taking the logarithm, and thus they are artificially truncated.
3.logarithmic transformation inflates variances when the intensities are clo to zero although it stabilizes
the variances at higher intensities
4.A 2-fold difference can be very significant when the intensities are high; however, when the intensities are
clo to the background level a 2-fold difference can be within the expected measurement error.
总结起来就是rlog标准化⽅法对于⼤样本来说运算较慢,并且对于count的值⽐较敏感,因此引⼊vst的标准化来得到⼀个近似为同⽅差的值矩阵在Biostars上也给出了rlog与vst之间的差别:
This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean
relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size. The rlog is less nsitive to size factors, which can be an issue when size factors vary widely. The transformations are uful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.
variance-stabilizing transformation
⽅差稳定变换是2014年提出的⼀种标准化⽅式(发表于NAR),具体可参见:Model-bad variance-stabilizing transformation for Illumina microarray data
这篇⽂章主要介绍了vst标准化的model:
1.⾸先,该model对于每⼀个基因计算在各个样本中的均值以及⽅差
image
上式是对每⼀个基因按不同的sample分别计算均值和⽅差
2.估计均值与⽅差之间的函数关系
image
其中 v 代表⽅差,u 代表均值;⽽ v(u) 代表的是⽅差 v 是关于均值 u 的函数
那么根据数学推导,可以的到h(y)函数为:
image
这个公式3是根据渐进理论(delta method)以及公式2推算⽽来的,经过转换后可以得到⼀个各个基因间⽅差近似相等的矩阵(因为转换的过程事实上是进⼀步压缩,所以各个基因间⽅差的差异被缩⼩)
image
⽽delta method的作⽤是利⽤经过h(y)变化前的data分布来拟合经过h(y)变化后的data分布
由泰勒公式知,y1和y2分别表⽰两个基因的表达量,那么转换前的差异为y1-y2,转换后的差异为h(y1)-h(y2),那么由上式得,基因表达量之间的差异经过转换以后可以缩⼩,h’()相对于缩⼩,从⽽达到⼀个各个基因间⽅差近似相等的情况
参考:Estimating Transformations for Regression via
Additivity and Variance Stabilization
3.寻找到⼀个合适的转换函数,将标准化前的矩阵(Y)转换为标准后的矩阵(Y~,波浪线应该打在上⾯),我们将公式2做⼀个恒等变形:
image
我们发现等式右边是⼀个线性模型,因此可以利⽤每个基因的⽅差和均值拟合线性模型,从⽽得到c1,c2和c3
4.将c1,c2和c3反带⼊公式3中,求解积分得:
image
这样我们就得到标准化的转换公式h(y)了,即就可以利⽤h(y)进⾏标准化了
vst代码详解
DESeq2中,vst的代码如下: