edgeR的使⽤
1)简介
转⾃:/djx571/p/9647011.html
edgeR作⽤对象是count⽂件,rows 代表基因,⾏代表⽂库,count代表的是⽐对到每个基因的reads数⽬。它主要关注的是差异表达分析,⽽不是定量基因表达⽔平。edgeR works on a table of integer read counts, with rows corresponding to genes and columns to independent libraries. The counts reprent the total number of reads aligning to each gene (or other genomic locus).edgeR is concerned with differential expression analysis rather than with the quantification of expression levels. It is concerned with relative changes in expression levels between conditions,but not directly with estimating absolute expression levels.
edgeR作⽤的是真实的⽐对统计,因此不建议⽤预测的转录本
Note that edgeR is designed to work with actual read counts. We not recommend that predicted transcript abundances are input the edgeR in place of actual counts.归⼀化原因:
技术原因影响差异表达分析:
出世与入世1)Sequencing depth:统计测序深度(即代表的是library size);
2)RNA composition:个别异常⾼表达基因导致其它基因采样不⾜
3)GC content: sample-specific effects for GC-content can be detected
4)sample-specific effects for gene length have been detected
注意:edgeR必须是原始表达量,⽽不能是rpkm等矫正过的。
Note that normalization in edgeR is model-bad, and the original read counts are not themlves transformed. This means that urs should not transform the read counts in any way before inputing them to edgeR. For example, urs should not enter RPKM or FPKM values to edgeR in place of read counts. Such quantities will prevent edgeR from correctly estimating the mean-variance relationship in the data, which is a crucial to the statistical strategies underlying edgeR.Similarly, urs should not add artificial values to the counts before inputing them to edgeR.
2)安装
阿富汗简介3)矩阵构建及差异分析
需要构建2个矩阵:1、表达矩阵;2、分组矩阵( 实验设计);
合同样板
-------------------------------------------------------表达矩阵-----------------------------------------
3.1、读取表达矩阵⽂件(Reading in the data)
3.2 、构建DGEList对象
这⾥因为已经有rawdata的count⽂件,因此直接⽤DGEList()函数就⾏了,否则要⽤readDGE()函数
DGEList对象主要有三部分:
1、counts矩阵:包含的是整数counts;
2、samples数据框:包含的是⽂库(sample)信息。包含 lib.size列:for the library size (quencing depth) for each sample,如果不⾃定义, the library sizes will be computed from the column sums of the counts。其中还有⼀个group列,⽤于指定每个sample组信息
3、⼀个可选的数据框genes:gene的注释信息
3.3)数据注释( Annotation)
这⾥主要是因为该⽂章数据是前好多年的,因此需要过滤,symbol更新等。
1)The study was undertaken a few years ago, so not all of the RefSeq IDs provided by match RefSeq IDs currently in u. We retain only tho transcripts with IDs in the current NCBI annotation, which is provided by the db package
1
2
3
if("edgeR" %in% rownames(installed.packages()) == FALSE) {source("bioconductor/biocLite.R");biocLite("edgeR")}
suppressMessages(library(edgeR))
ls('package:edgeR')
1
2
3
#读取⽂件
rawdata <- read.delim("E:/software/R/R-3.5.0/library/edgeR/", check.names=FALSE, stringsAsFactors=FALSE)
head(rawdata)
1y <- DGEList(counts=rawdata[,4:9], genes=rawdata[,1:3])##构建DGEList对象
in the current NCBI annotation, which is provided by the db package
2)因为edgeR 默认使⽤NCBI 中refSeq 的ID ,所以通过refq Id 找到entrezID,然后通过entrezID 对symbol 更新
3.4) 过滤和归⼀化(Filtering and normalization)
过滤⼀:Di fferent RefSeq transcripts for the same gene symbol count predominantly the same rea
ds. So we keep one transcript for each gene symbol. We choo the transcript with highest overall count :
过滤⼆:Normally we would also filter lowly expresd genes.For this data, all transcripts already have at least 50 reads for all samples of at least one of the tissues types.
余甘子的功效与作用归⼀化:TMM normalization is applied to this datat to account for compositional di fference between the libraries.
3.5) 数据的探索(Data exploration)
样本间关系(samples for outliers and for other relationships)
每一次远行
PC1将tumor 和nomal 组分开,PC2 ⼤略和病号对应。也侧⾯体现了肿瘤组的异质性
--------------------------分组矩阵(根据实验设计、⽬的)--------------------------------
Here we want to test for di fferential expression between tumour and normal tissues within patients, i.e. adjusting for di fferences between patients.
1
2以美好为话题的作文
3
4咽音
5
6
7
8
9
10
11
12
13
14>##retain only tho transcripts with IDs in the current NCBI annotation provided by the db>#library (db)idfound <- y$genes$RefSeqID %in % mappedRkeys (REFSEQ)y <- y[idfound,]dim (y) ##15550 6>>>>## 在注释中加⼊ Entrez Gene IDs >>>>>egREFSEQ <- toTable (REFSEQ) m <- match (y$genes$RefSeqID, egREFSEQ$accession)y$genes$EntrezGene <- egREFSEQ$gene_id[m]>>>>#⽤Entrez Gene IDs 更新gene symbols>>>>>#egSYMBOL <- toTable (SYMBOL)m <- match (y$genes$EntrezGene, egSYMBOL$gene_id)y$genes$Symbol <- egSYMBOL$symbol[m]head (y$genes)1
2
送老师的花
3
4
5o <- order (rowSums (y$counts), decreasing=TRUE )y <- y[o,]d <- duplicated (y$genes$Symbol)y <- y[!d,]nrow (y)
1
2
3
4y$samples$lib.size <- colSums (y$counts) #Recompute the library sizes >>>>>>#U Entrez Gene IDs as row names:>>>>#rownames (y$counts) <- rownames (y$genes) <- y$genes$EntrezGene y$genes$EntrezGene <- NULL1
2y <- calcNormFactors (y)y$samples
1plotMDS (y)
3.4)Estimating the dispersion(estimate the NB dispersion for the datat.)
-----------------------------------差异分析-----------------------------------------
3.5) 差异分析(Di fferential expression)
------------------------------- Gene ontology analysis---------------------------------------- 对上调的基因进⾏BP 分析
1
2
3
4
5
6Patient <- factor (c (8,8,33,33,51,51))Tissue <- factor (c ("N","T","N","T","N","T"))data.frame (Sample=colnames (y),Patient,Tissue)design <- model.matrix (~Patient+Tissue)rownames (design) <- colnames (y)design
1
2
3y <- estimateDisp (y, design, robust=TRUE )y$common.dispersion #0.1594505plotBCV (y)
1
2
3
4
5
6fit <- glmFit (y, design)lrt <- glmLRT (fit)topTags (lrt)summary (decideTests (lrt))plotMD (lrt)abline (h=c (-1, 1), col="blue")
1
2go <- goana (lrt)topGO (go, ont="BP", sort="Up", n=30)