BIOINFORMATICS APPLICATIONS NOTE
Vol.18no.42002Pages
631–633
CpGProD:identifying CpG islands associated with transcription start sites in large genomic mammalian quences
Lo¨ıc Ponger ∗and Dominique Mouchiroud
Laboratoire de Biom´etrie et Biologie Evolutive,UMR CNRS 5558-Universit´e Claude Bernard,43,Bd du 11Novembre 1918,69622Villeurbanne Cedex,France
Received on September 7,2001;revid on October 23,2001;accepted on November 7,2001
ABSTRACT
Results:CpGProD is an application for identifying mam-malian promoter regions associated with CpG islands in large genomic quences.Although it is strictly dedicated to this particular promoter class corresponding to ≈50%of the genes,CpGProD exhibits a higher nsitivity and specificity than other tools ud for promoter prediction.Notably,CpGProD us different parameters according to species (human,mou)studied.Moreover,CpGProD predicts the promoter orientation on the DNA strand.Availability:pbil.univ-lyon1.fr/software/cpgprod.html
Supplementary information:pbil.univ-lyon1.fr/software/cpgprod.html
Contact:ponger@biomrv.univ-lyon1.fr
INTRODUCTION
A number of promoter detection programs attempt-ing to recognize functional quences (TATA,CAAT,transcription factor binding site,...)or to identify the oligonucleotide frequencies specific for promoters exist (for a review e Fickett and Hatzigeorgiou,1997),but,excepting the recently deve
loped programs PromoterIn-spector and CpG promoter (Scherf et al.,2000;Ioshikhes and Zhang,2000),their specificity is often too low to be ud for annotation of large genomic quences.
In vertebrata,there is a particular class of promoters colocalized with an atypical structure,the CpG Islands (CGIs).In vertebrate genomes,the CpG dinucleotide is often methylated and is depleted at 25%of the expected frequency.The CGIs are stretches of DNA escaping methylation and characterized by a high G +C content and a high frequency of CpG dinucleotides relative to the bulk DNA (Bird,1986).50–60%of the human genes exhibit a CGI over the Transcription Start Site (TSS)but not all the CGIs are associated with promoter regions (Larn et al.,1992).The CGIs associated with
∗To whom correspondence should be addresd.
社会变革的先导
promoters (start CGIs)can be,a priori ,identified from their structural characteristics (greater size,higher G +C content and CpGo/e ratio than no-start CGI;Ioshikhes and Zhang,2000;Ponger et al.,2001).
This paper prents CpGProD,a mammalian-specific software to identify the TSS associated with CGIs.
万里长城是成语吗
METHODS
The CpGProD method can be divided into two steps.Firstly,CpGProD arches for all CGIs located in the submitted quences.Secondly,CpGProD identifies the start CGIs and predicts the orientation of the potential promoters.CpGProD was trained and tested by using a human and a mou datat compod by genes with a known TSS.
Datats
The human and the mou coding protein quences were extracted from HOVERGEN (relea 114,October 1999,Duret et al.,1994).HOVERGEN corresponds to GenBank quences from all vertebrate species with some addi-tional data allowing extraction of non-coding quences.The TSS annotations were obtained from the mRNA descriptions available in the features (partial mRNA were not considered).For each gene,we extracted a quence compod by the 5 non-coding region,the exons,the introns and the 3 non-coding region.Sequences with less than 500nt (CGIs length)upstream and downstream the TSS were excluded.The quence datat is compod by 755human and 147mou genes with a known TSS (32.8and 2.4Mb for human and mou datats respectively).CpGProD was ud to find the CGIs over the quences.Partial CGIs,that is CGIs ov
为什么手机充电慢
erlapping one extremity of the quences,were excluded.CGIs located over the TSS were classified as start CGI whereas other CGIs were classified as no-start CGIs.The CGI datat is compod by 818human CGIs and 163mou CGIs.The CGIs datats were divided into two halves:the first half of each datat was ud to train CpGProD to identify start
加拿大的英语怎么说c
Oxford University Press 2002631
L.Ponger and D.Mouchiroud
CGIs and the cond half was ud to test CpGProD. Moreover the quences and the CGIs ud in the datat of Scherf et al.(2000)and Ioshikhes and Zhang(2000) were excluded from the training part of the datats. CpG island arch
饼丝In order to enhance the specificity,the quences have to be primarily procesd by RepeatMasker(Smit and Green, unpublished)to exclude potential noi due to some repeat elements exhibiting a structure similar to CGIs whereas they are often methylated(Ponger et al.,2001).Moreover, to eliminate small CGIs corresponding generally to no-start CGIs,CpGProD us
a CGI definition more stringent than that propod by Gardiner-Garden and Frommer (1987).CGIs are defined as DNA regions longer than 500nucleotides(instead200bp),with a moving average G+C frequency above0.5and a moving average CpG obrved/expected(CpGo/e)ratio greater than0.6. Moving average value for the G+C frequency and the CpGo/e ratio are calculated for each quence by using a500nucleotides window moving along the quence in steps of1nt.Overlapping windows with a G+C frequency greater than0.5and a CpGo/e ratio greater than 0.6were grouped to form the CGIs.Considering the parameters,56%of the human genes and52%of the rodent genes in the quence datat exhibit a start CGI. The percentage obrved for human genes is similar to the result of Larn et al.(1992)who ud a threshold of 200bp,indicating that the nsitivity is not decread. Start CpG island identification
Afirst score corresponding to the probability to be over the TSS(start-p)is calculated from the length,the G+C content and the CpGo/e ratio of each CGI.A cond score is calculated from the AT-skew and the GC-skew values which are two parameters quantifying a compositional bias between the plus and the minus DNA strands(Lobry, 1996)and exhibiting different values according to the strand of the corresponding gene(L.Ponger,unpublished data).A strand(plus or minus)and a probability to be over this predicted strand(strand-p)are determined from this score.The two relatio
ns were determined by using a generalized linear model(McCullagh and Nelder,1989)with thefirst half of the CGI datat.Since, the CGI structure ems to be conrved in all studied mammals(pig,bovine,human)except in mou and rat(Cuadrado et al.,2001;Matsuo et al.,1993),we ud two datats,one compod by human CGIs and one compod by rodent CGIs.
IMPLEMENTATION
CpGProD is implemented in C language.It is available either via a web rver,uful for small datats,or as a standalone application for larger datats(for Solaris,Windows,Linux,SGI and MacOS).The output gives the structural characteristics(length,G+C frequency and CpGo/e ratio),the start-p value,the strand and the strand-p value of each detected CGI.Moreover,a graph reprenting CGIs over the quences is drawn. RESULTS AND DISCUSSION
The main result of CpGProD is a start-p value corre-sponding to the predicted probability to be a start CGI. The nsitivity and the specificity of CpGProD depend on the minimal start-p threshold chon to predict promoter. CpGProD was tested by using the cond part of the CGI datats that was not ud during the training step(cf. web site,Table1).In the human datat,if all the detected CGIs are considered as promoters,CpGProDfinds a CGI over56%of the TSSs with specificity about0.
39.If we consider as promoters only the CGIs with a start-p value greater than0.3,the nsitivity decreas to27%whereas the specificity increas to0.51.For both species,the nsitivity decreas and the specificity increas while the threshold value increas indicating that the start-p value is correlated with the probability to be a start CGI. Concerning the orientation of the promoters,70%of the human and73%of the rodent predictions are correct. The percentages increa with the start-p threshold (cf.web site,Table1).
CpGProD was compared with CpG promoter and PromoterInspector by using three different datats since the programs cannot be ud on our data:the former needs a commercial licen(for Splus),online access to the latter is strongly restricted.Thus,CpGProD was tested on a datat compod by19human genes with a start CGI and already ud to test CpG promoter (cf.web site,Table1).The results show that CpGProD exhibits a higher nsitivity(0.74versus0.62)and a higher specificity(0.87versus0.62)than CpG promoter (cf.web site,Table1).Another test was made by using two datats previously ud for PromoterInspector.The first is compod by35human and mou genes with TSS annotations(cf.web site,Table1)whereas the cond is compod by545genes located over the chromosome22 (cf.web site,Table2;Dunham et al.,1999).For this latter datat we ud the same method as that ud by Scherf et al.(2001)with PromoterInspector:all the
predictions located in the range−2000:+500around the5 extremity of a known gene or in the range−6000:+500around the 5 extremity of a predicted gene were considered as a true positive promoter region.The results show that CpGProD exhibits a higher nsitivity(0.38versus0.33for the chromosome22)and a higher specificity(0.62versus 0.40for the chromosome22)than PromoterInspector (cf.web site,Tables1and2).任牧
The differences obrved between CpG promoter and CpGProD can be explained by the method ud to
632
CpGProD
arch the CGIs.With CpGProD,repeated quences and small CGIs are not considered,thus increasing the specificity of the start CGIs detection.Contrary to PromoterInspector,CpGProD is strictly dedicated to CGI associated promoters and is more efficient for this class of promoters.This difference between PromoterInspector and CpGProD confirms the results of Hannenhalli and Levy(2001)showing that CGIs are the best signal to detect promoter regions.CpGProD was also applied to the Human Genome Project data(cf.web site,Table2).The results indicate that27%of the g
伸筋丹胶囊
ene starts are localized in a CGI exhibiting a start-p value greater than0.3.We obrve a difference of nsitivity between the known and the predicted genes(41and23%respectively)probably due to inaccuracy in location of5 extremity of predicted genes.It could be uful for gene annotation to determine if all the CGIs with a start-p value greater than0.3can be associated with a gene.
电脑卡了怎么办点不动怎么办To date,although relatively simple,CpGProD is the most efficient tool dedicated to the detection of CGI asso-ciated promoters in mammalian quences.In quence annotation,CpGProD should be ud as afirst step,before using other promoter prediction software exhibiting a lower specificity but able to localize more accurately the core promoter and the TSS.
REFERENCES
Bird,A.P.(1986)CpG rich islands and the function of DNA methylation.Nature,321,209–213.
Cuadrado,M.,Sacristan,M.and Antequera,F.(2001)Species-specific organization of CpG island promoter at mammalian ho-mologuous genes.EMBO Rep.,2,586–592. al.(1999)The DNA quence of human chromo-some22.Nature,402,489–495.
Duret,L.,Mouchiroud,D.and Gouy,M.(1994)HOVERGEN:a databa of homologous vertebrate genes.Nucleic Acids Res.,22, 2360–2365.
Fickett,J.W.and Hatzigeorgiou,A.G.(1997)Eukaryotic promoter recognition.Genome Res.,7,861–878.
Gardiner-Garden,M.and Frommer,M.(1987)CpG islands in verte-brate genomes.J.Mol.Biol.,196,261–282.
Hannenhalli,S.and Levy,S.(2001)Promoter prediction in the human genome.Bioinformatics,17,S90–S96.
Ioshikhes,I.P.and Zhang,M.Q.(2000)Large-scale human promoter mapping using CpG islands.Nature Genet.,26,61–63. Larn,F.,Gundern,G.,Lopez,R.and Prydz,H.(1992)CpG is-lands as gene markers in the human genome.Genomics,13, 1095–1107.
Lobry,J.R.(1996)Asymmetric substitution patterns in two DNA strands of bacteria.Mol.Biol.Evol.,13,660–665.
Matsuo,K.,Clay,O.,Takahashi,T.,Silke,J.and Schaffner,W.(1993) Evidence for erosion of mou CpG islands during mammalian evolution.Somat.Cell Mol.Genet.,19,543–555. McCullagh,P.and Nelder,J.A.(1989)Generalized Linear Models.
Chapman and Hall,London.
Ponger,L.,Duret,L.and Mouchiroud,D.(2001)Determinants of CpG islands:expression in early embryo and isochore structure.
Genome Res.,11,1854–1860.
Scherf,M.,Klingenhoff,A.and Werner,T.(2000)Highly specific localization of promoter regions in large genomic quences by PromoterInspector:a novel context analysis approach.J.Mol.
Biol.,297,599–606.
Scherf,M.,Klingenhoff,A.,Frech,K.,Quandt,K.,Schneider,R., Grote,K.,Frisch,M.,Gailus-Durner,V.,Seidel,A.,Brack-Werner,R.and Werner,T.(2001)First pass annotation of promoters on human chromosome22.Genome Res.,11, 333–340.
633