Nucleic Acids Rearch,20141
doi:10.1093/nar/gku1039
SNP-Seek databa of SNPs derived from 3000rice genomes
Nickolai Alexandrov 1,*,†,Shuaishuai Tai 2,†,Wensheng Wang 3,†,Locedie Mansueto 1,Kevin Palis 1,Roven Rommel Fuentes 1,Victor Jun Ulat 1,Dmytro Chebotarov 1,
Gengyun Zhang 2,*,Zhikang Li 3,*,Ramil Mauleon 1,Ruaraidh Sackville Hamilton 1and Kenneth L.McNally 1
1
T .T .Chang Genetic Resources Center,IRRI,Los Ba˜nos,Laguna 4031,Philippines,2BGI,Shenzhen 518083,China and 3CAAS,Beijing 100081,China
Received September 08,2014;Revid October 10,2014;Accepted October 10,2014
ABSTRACT
We have identified about 20million rice SNPs by aligning reads from the 3000rice genomes project with the Nipponbare genome.The SNPs and al-lele information are organized into a SNP-Seek system (asnp/iric-portal/),which consists of Oracle databa having a total number of rows with SNP genotypes clo to 60billion (20M SNPs ×3K rice lines)and web interface for con-venient querying.The databa allows quick retriev-ing of SNP alleles for all varieties in a given genome region,finding different alleles from predefined vari-eties and querying basic passport and morpholog-ical phenotypic information about quenced rice lines.SNPs can be visualized together with the gene structures in JBrow genome browr.Evolutionary relationships between rice varieties can be explored using phylogenetic trees or multidimensional scaling plots.arturo>复读机软件
INTRODUCTION
The current rate of increasing rice yield by traditional breeding is insufficient to feed the growing population in the near future (1).The obrved trends in climate change and air pollution create even bigger threats to the global food supply (2).A promising solution to this problem can be the application of modern molecular breeding technolo-gies to ongoing rice breeding programs.This approach has been utilized to increa dia resistance,drought toler-ance and other agronomically important traits (3–5).Un-derstanding the differences in genome structures,combined
with phenotyping obrvations,gene expression and other
information,is an important step toward establishing gene-trait associations,building predictive models and applying the models in the breeding process.The 3000rice genome project (6)produced millions of genomic reads for a di-ver t of rice varieties.SNP-Seek databa is designed to provide a ur-friendly access to the single nucleotide poly-morphisms,or SNPs,identified from this data.Short,83bp pair-ended Illumina reads were aligned using the BW A program (7)to the Nipponbare temperate japonica genome asmbly (8),resulting in average of 14×coverage of rice genome among all the varieties.SNP calls were made using GATK pipeline (9)as described in (6).SNP DATA
For the SNP-Seek databa we have considered only SNPs,ignoring indels.A union of all SNPs extracted from 3000vcf files consists of 23M SNPs.To eliminate potentially fal SNPs,we have collected only SNPs that have the mi-nor allele in at least two different varieties.The number of such SNPs is 20M.All the genotype calls at the positions were combined into one file of ∼20M ×3K SNP calls,and the data were loaded into an Oracle schema using three main tables:STOCK,SNP and SNP GENOTYPE (Figure 1).Some varieties lack reads mapping to the SNP position,and for them no SNP calls were recorded.Distribution of the SNP coverage is shown in Figure 2.About 90%of all SNP
calls have a number of supporting reads greater than or equal to four.Out of them,98%have a major allele fre-quency >90%and are considered to be homozygous,1.1%have two alleles with frequencies between 40and 60%and considered to be heterozygous,and the remaining 0.9%rep-rent other cas when the SNP could not be classified as neither heterozygous nor homozygous.More than 98%of SNPs have exactly two different allelic variants in 3000vari-
*To
whom correspondence should be addresd.Tel:+63(2)580-5600;Fax:+63(2)580-5699;Email:n.alexandrov@irri.Correspondence may also be addresd to:zhanggengyun@genomics and lizhikang@caas
†
The authors wish it to be known that,in their opinion,the first 3authors should be regarded as joint First Authors.
C The Author(s)2014.Published by Oxford University Press on behalf of Nucleic Acids Rearch.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Liceretailing
n (creativecommons /licens /by /4.0/),which permits unrestricted reu,distribution,and reproduction in any medium,provided the original work is properly cited.
Nucleic Acids Rearch Advance Access published November 27, 2014 by guest on November 29, 2014
2Nucleic Acids Rearch,
2014
Figure 1.Basic schema of the SNP-Seek
databa
Figure 2.Distribution of SNP coverage
eties,1.7%of SNPs have three variants and 0.02%of SNPs have all four nucleotides in different genomes mapped to that SNP position.There are 2.3×more transitions than transvertions in our databa (Table 1).
Not all SNPs have been called in all varieties.Actually,the distribution of the called SNPs among varieties is bi-modal,with one mode at about 18M SNP calls correspond-ing to japonica varieties which are clo to the reference genome,and the cond peak at about 14M correspond-ing to the other varieties (Figure 3
).
Figure 3.SNP distribution by varieties.The major peak shows that about 14M SNPs have been called in most varieties.The bimodal plot indicates that a fraction of SNPs are missing in some varieties,likely due to lack of mapped reads in variable regions.
cocktailTable 1.Types of allele variants and their frequencies in rice SNPs Allele variants Frequency,%A /G +C /T 70A /C +G /T 15A /T 9C /G
6
by guest on November 29, 2014
Nucleic Acids Rearch,2014
3
Figure4.Multidimensional scaling plot of the3000rice varieties.Ind1,ind2and ind3are three groups of indica rice,indx corresponds to other indica varieties,temp is temperate japonica,trop is tropical japonica,temp/trop and trop/temp are admixed temperate and tropical japonica varieties,japx is other japonica varieties,aus is aus,inax is admixed aus and indica,aro is aromatic and admix is all other unassigned varieties.
GENOME ANNOTATION DATA
We ud CHADO databa schema(10)to store the Nip-ponbare reference genome and gene annotation,down-loaded from the MSU rice web site(rice.plantbiology. msu.edu/)(8).To brow and visualize genes and SNPs in the rice genome,we integrated the JBrow genome browr (11)as a feature of our site.
PASSPORT AND MORPHOLOGICAL DATA
Most of the3000varieties(and eventually all)are conrved in the International Rice genebank houd at IRRI(12). Passport and basic morphological data from the source ac-cession for the purified genetic stock are accessible via SNP-Seek.
INTERFACES
We deployed interfaces to facilitate the following major types of queries:(i)for two varieties find all SNPs from a gene or genomic region that differentiate them;(ii)for a gene or genome region,show all SNP calls for all va-rieties(Supplementary Figure S1);(iii)find all quenced varieties from a certain country or a subpopulation,which
can be viewed as a phylogenetic tree,built using TreeCon-structor class from BioJava(13)and rendered using jsPhy-loSVG JavaScript library(14)(Supplementary Figure S2)or
as a multidimensional scaling plots(Figure4).The results
of SNP arch can be viewed as a table exported to text files,
or visualized in JBrow.
USE CASE EXAMPLE FOR QUERYING A REGION OF INTEREST
We ud Rice SNP-Seek databa to quickly examine the diversity of the entire panel at a particular region of inter-
est.We cho the sd-1gene as test ca due to its scientific importance in rice breeding.This mi-dwarf locus,causing
a mi-dwarf stature of rice,was discovered by three differ-
ent rearch groups to be a spontaneous mutation of GA
20-oxida(formally named sd-1gene),originating from
the Taiwane indica variety Deo-woo-gen.Its incorpora-
tion into IR8and other varieties by rice breeding programs spurred the First Green Revolution in rice production in
the late1960s(15).Sd-1is annotated in the Nipponbare genome by Michigan State University’s Rice Genome An-notation Project as LOC Os01g66100,on chromosome1
by guest on November 29, 2014
google英文fordjournals/
Downloaded from
4Nucleic Acids Rearch,
2014
Figure 5.Jbrow view of the SNP genotypes within the sd-1gene (each variety is one row).Red blocks indicate polymorphism of the variety against Nipponbare.Shared SNP blocks are en as vertical columns in red.The blue rectangle box in the bottom contains varieties that do not have the blocks.
from position 38382382to 38385504ba pairs.On the home page of SNP-Seek,the <Genotype >module was opened and the coordinates of sd-1were ud to define the region to retrieve all SNPs,with <All Varieties >checked to lect from all the varieties.Clicking on <Search >but-ton resulted in the identification of 80SNP positions (Sup-plementary Figure S1).An overall view of the SNP posi-tions in the polymorphic panel shows at least eight distinct SNP blocks (Figure 5).In this particular panel group of mostly temperate japonica,two distinct SNP blocks can be en as shared (Figure 5).Variety information can be ob-tained by typing the name of the varieties you e on the genome browr into the <Variety name >field of the Vari-ety module.This u ca is one of the examples detailed in the <Help >module.CONCLUSION
We have organized the largest collection of rice SNPs into the databa data structures for convenient querying and provided ur-friendly interfaces to find SNPs in certain
genome regions.We have demonstrated that about 60bil-lion data points can be loaded into an Oracle databa and queried with a reasonable (quick)respon times.Most of the varieties in SNP-Seek databa have passport and ba-sic phenotypic data inherited from their source accession enabling genome-wide or gene-specific tests of association.The databa is quickly developing and will be expanding in the near future to include short indels,larger structural variations,SNPs calls using other rice reference genomes.SUPPLEMENTARY DATAi3d
Supplementary Data are available at NAR Online.ACKNOWLEDGEMENTS
We would like to thank the IRRI ITS team (especially Ro-gelio Alvarez and Denis Diaz)and Rolando Santos Jr for the support in operation and administration of the databa and web application rvers,and Frances Borja for her help in interface design.
by guest on November 29, 2014
Nucleic Acids Rearch,20145
celebrate
FUNDING
The databa is being supported by the Global Rice Science Partnership(GRiSP),the Bill and Melinda Gates Foundation(GD1393),International S&T Cooperation Program of China(2012DFB32280)and the Peacock Team Award to ZLI from the Shenzhen Municipal government. Conflict of interest statement.None declared. REFERENCES
1.Ray,D.K.,Mueller,N.D.,West,P.C.and Foley,J.A.(2013)Yield trends
are insufficient to double global crop production by2050.PloS One, 8,e66428.
2.Tai,A.P.K.,Martin,M.V.and Heald,C.L.(2014)Threat to future
global food curity from climate change and ozone air pollution.
Nat.Clim.Change,4,817–821.
3.Fahad,S.,Nie,L.,Khan,F.A.,Chen,Y.,Hussain,S.,Wu,C.,Xiong,D.,
Jing,W.,Saud,S.,Khan, al.(2014)Dia resistance in rice
and the role of molecular breeding in protecting rice crops against
dias.Biotechnol.Lett.,36,1407–1420.
4.Hu,H.and Xiong,L.(2014)Genetic engineering and breeding of
drought-resistant crops.Ann.Rev.Plant Biol.,65,715–741.
5.Gao,Z.Y.,Zhao,S.C.,He,W.M.,Guo,L.B.,Peng,Y.L.,Wang,J.J.,
Guo,X.S.,Zhang,X.M.,Rao,Y.C., al.(2013)Discting yield-associated loci in super hybrid rice by requencing
recombinant inbred lines and improving parental genome quences.
repair是什么意思Proc.Natl.Acad.Sci.U.S.A.,110,14492–14497.
6.3K R.G.P.(2014)The3,000rice genomes project.Gigascience,3,
7.
7.Li,H.and Durbin,R.(2009)Fast and accurate short read alignment
with Burrows-Wheeler transform.Bioinformatics,25,1754–1760.
8.Kawahara,Y.,de la Bastide,M.,Hamilton,J.P.,Kanamori,H.,
McCombie,W.R.,Ouyang,S.,Schwartz,D.C.,Tanaka,T.,Wu,J.,
al.(2013)Improvement of the Oryza sativa Nipponbare
reference genome using next generation quence and optical map
data.Rice,6,4.
孙燕姿英文歌9.McKenna,A.,Hanna,M.,Banks,E.,Sivachenko,A.,Cibulskis,K.,
Kernytsky,A.,Garimella,K.,Altshuler,D.,Gabriel,S., al.
(2010)The Genome Analysis Toolkit:a MapReduce framework for
analyzing next-generation DNA quencing data.Genome Res.,20,
1297–1303.
10.Mungall,C.J.,Emmert,D.B.and FlyBa,C.(2007)A Chado ca
study:an ontology-bad modular schema for reprenting
genome-associated biological information.Bioinformatics,23,
i337–i346.
到期英语
11.Skinner,M.E.,Uzilov,A.V.,Stein,L.D.,Mungall,C.J.and
Holmes,I.H.(2009)JBrow:a next-generation genome browr.
Genome Res.,19,1630–1638.
12.Jackson,M.T.(1997)Conrvation of rice genetic resources:the role
of the International Rice Genebank at IRRI.Plant Mol.Biol.,35,
61–67.
13.Prlic,A.,Yates,A.,Bliven,S.E.,Ro,P.W.,Jacobn,J.,Troshin,P.V.,
Chapman,M.,Gao,J.,Koh,C.H., al.(2012)BioJava:an
open-source framework for bioinformatics in2012.Bioinformatics,
28,2693–2695.
14.Smits,S.A.and Ouverney,C.C.(2010)jsPhyloSVG:a javascript
library for visualizing interactive and vector-bad phylogenetic trees
on the web.PloS One,5,e12267.
15.Hedden,P.(2003)The genes of the Green Revolution.Trends Genet.,
19,5–9.
by guest on November 29, 2014
Downloaded from