/cgi/content/full/326/5956/1112/DC1
Supporting Online Material for
The B73 Maize Genome: Complexity, Diversity, and Dynamics Patrick S. Schnable, Doreen Ware, Robert S. Fulton, Joshua C. Stein, Fusheng Wei, Shiran Pasternak, Chengzhi Liang, Jianwei Zhang, Lucinda Fulton, Tina A. Graves, Patrick Minx, Amy Deni Reily, Laura Courtney, Scott S. Kruchowski, Chad Tomlinson, Cindy Strong, Kim Delehaunty, Catrina Fronick, Bill Courtney, Susan M Rock, Eddie Belter, Feiyu Du, Kyung Kim, Rachel M. Abbott, Marc Cotton, Andy Levy, Pamela Marchetto, Kerri Ochoa, Stephanie M. Jackson, Barbara Gillam, Weizu Chen, Le Yan, Jamey Higginbotham, Marco Cardenas, Jason Waligorski, Elizabeth Applebaum, Lindy Phelps, Jason Falcone, Krishna Kanchi, Thynn Thane, Adam Scimone, Nay Thane, Jessica Henke, Tom Wang, Jessica Ruppert, Neha Shah, Kelsi Rotter, Jennifer Hodges, Elizabeth Ingenthron, Matt Cordes, Sara Kohlberg, Jennifer Sgro, Brand
on Delgado, Kelly Mead, Asif Chinwalla, Shawn Leonard, Kevin Crou, Kristi Collura, Dave Kudrna, Jennifer Currie, Ruifeng He, Angelina Angelova, Shanmugam Rajakar, Teri Mueller, Rene Lomeli, Gabriel Scara, Ara Ko, Krista Delaney, Marina Wissotski, Georgina Lopez, David Campos, Michele Braidotti, Elizabeth Ashley, Wolfgang Golr, HyeRan Kim, SeungHee Lee, Jinke Lin, Zeljko Dujmic, Woojin Kim, Jayson Talag, Andrea Zuccolo, Chuanzhu Fan, Aswathy Sebastian, Melissa Kramer, Lori Spiegel, Lidia Nascimento, Theresa Zutavern, Beth Miller, Claude Ambroi, Stephanie Muller, Will Spooner, Apurva Narechania, Liya Ren, Sharon Wei, Sunita Kumari, Ben Faga, Michael Levy, Linda McMahan, Peter Van Buren, Matthew W. Vaughn, Kai Ying, Cheng-Ting Yeh, Scott J. Emrich, Yi Jia, Ananth Kalyanaraman, An-Ping Hsia, W. Brad Barbazuk, Regina S. Baucom, Thomas P. Brutnell, Nicholas C. Carpita, Cristian Chaparro, Jer-Ming Chia, Jean-Marc Deragon, James C. Estill, Yan Fu, Jeffrey A.
Jeddeloh, Yujun Han, Hyeran Lee, Pinghua Li, Damon R Lisch, Sanzhen Liu, Zhijie Liu, Dawn Holligan Nagel, Maureen C. McCann, Phillip San Miguel, Alan M. Myers, Dan Nettleton, John Nguyen, Bryan W. Penning, Lalit Ponnala, Kevin L. Schneider, David C. Schwartz, Anupma Sharma, Carol Soderlund, Nathan M. Springer, Qi Sun, Hao Wang, Michael Waterman, Richard Westerman, Thomas K. Wolfgruber, Lixing Yang, Yeisoo Yu, Lifang Zhang, Shiguo Zhou, Qihui Zhu, Jeffrey L. Be
nnetzen, R. Kelly Dawe, Jiming Jiang, Ning Jiang, Gernot G. Presting, Susan R. Wessler, Srinivas Aluru, Robert A. Martiensn, Sandra W. Clifton, W. Richard McCombie, Rod A. Wing, Richard K.
Wilson
*To whom correspondence should be addresd. E-mail: rwilson@wustl.edu
Published 20 November 2009, Science326, 1112 (2009)
maintain用法DOI: 10.1126/science.1173462
This PDF file includes
Materials and Methods
SOM Text
Figs. S1 to S18
Tables S1 to S18
References
ONLINE SUPPORTING MATERIAL
SEQUENCING , ASSEMBLY AND 2 TE SEARCH APPROACHES, DEFINITIONS OF FAMILIES, INTACT ELEMENTS AND
FRAGMENTED ELEMENTS, AND GENE FRAGMENTS CAPTURED 3 CACTA, T C 1/M ARINER , H AT AND PIF/H ARBINGER ELEMENTS ........................................................................3C ONSTRUCTION OF A LIBRARY OF REPETITIVE SEQUENCES ................................................................................3 Coding Elements (3)
Non-Coding Elements (3)i e
Exemplars (3)
Estimation of copy numbers and genome coverage (4)
Estimation of the number of intact elements carrying gene fragments (4)
H ELITRONS ..........................................................................................................................................
................4 LINE S .................................................................................................................................................................4 LTR RETROTRANSPOSONS ..................................................................................................................................5 MULE S ...............................................................................................................................................................5 SINE S ..................................................................................................................................................................6 G ENE A NNOTATION .............................................................................................................................................6 M ETHODS ............................................................................................................................................................6 Gene prediction and lection (6)
R ESULTS ..............................................................................................................................................................7 ADDITIONAL ESTIMATIONS OF THE NATURE AND ERROR RATE OF GENE CONTENT. (8)
P REDICTIONS IN B73 R EF G EN _
......................8 Estimation of the number of genes in the maize genome.. (8)
Stringent estimation (8)
Non-stringent estimation (8)
FUNCTIONAL ANNOTATION OF THE FILTERED GENE SET (8)
M ETHODS ............................................................................................................................................................8 R ESULTS ..............................................................................................................................................................9 ORTHOLOG AND PARALOG DETERMINATION. (9)
N OTE ON L INEAGE -S PECIFIC G ENE F AMILIES ....................................................................................................9 RNA-SEQ. (10)
Leaf transcriptomics (10)
Transcriptomics of shoot apical meristems (SAMs) and edling from reciprocal hybrids (10)
R ESULTS ............................................................................................................................................................10 CENTROMERE METHODS. (11)
I DENTIFICATION AND DRAFT SEQUENCING OF CENTROMERIC BAC S ................................................................11 E NRICHMENT AND REPRESENTATION CALCULATIONS .......................................................................................11 M APPING C H IP SEQUENCES TO BAC S OR B73 R EF G EN _12 I DENTIFICATION AND MAPPING OF CENTROMERIC EST AND C DNA SEQUENCES .............................................12 SYNTENY ANALYSIS AND IDENTIFICATION OF THE MAIZE LINEAGE WHOLE GENOME
12 R ESULTS ............................................................................................................................................................13M ETHODS ..........................................................................................................................................................12 P REFERENTIAL R ETENTION OF G ENE F UNCTIONAL C LASSES ...............................................
............................13 P REFERENTIAL G ENE L OSS IN A NCESTRAL H OMOEOLOGOUS C HROMOSOMES ...............................................14 I DENTIFICATION OF P ARALOG C LUSTERS (14)
MAIZE 14 G UIDELINES FOR IMPROVEMENT OF THE MAIZE GENOME :...............................................................................15R ULES FOR DETERMINING REGIONS TO BE FINISHED IN THE MAIZE GENOME :.. (14)
SUPPORTING FIGURES (18)
SUPPORTING TABLES (20)
SUPPLEMENTARY REFERENCES (22)
Sequencing, Asmbly and Sequence Improvement
The maize genome was quenced via a BAC-by-BAC approach with a minimum tiling path (MTP) of 16,848 BACs derived from an integrated genetic and physical map (S1). Clones were picked, as d
escribed (S2). Sheared DNA from each clone was ligated into the pSMART (Lucigen, Middleton, WI) plasmid vector. Each BAC library received 2X384-well paired end quences, resulting in ~4- 6X coverage. Data were asmbled, confirmed by BAC End Sequence, checked for minimum coverage standards (submitted to GenBank as
HTGS_FULLTOP), and nt for automated quence improvement. Prior to quence improvement, fosmid end quences, from a repository compod of fosmids prepared and quenced at the Washington University Genome Center, were added to the asmblies. The fosmid clones were chon by running a script against the shotgun asmbly that us BLAST to compare 600 bp gments of the asmbly against the maize quence read repository databa. If at least 99% identity was noted, the program retrieved the fosmid (and fosmid mate pairs, if available) and incorporated it into the asmbly to enhance order and orientation. Connsus quences were evaluated by a K-mer analysis to identify repetitive regions (S3). Automated improvement involved directed quencing across gaps and low quality quences within non-repetitive regions only (submitted to GenBank as HTGS_PREFIN). Predicated on the average number of bas finished per submitted clone (26,968 bps), as of February 1, 2009, a total of 379,598,676 ba pairs of finished unique quence were submitted to GenBank.
Following automated quence improvement, additional data downloaded from GenBank, such as cDNA quences and quences from subtractive libraries with methyl-filtered DNA and high C 0t techniques, were incorporated into the asmblies (submitted as HTGS_ACTIVEFIN). Manual improvement was performed on non-repetitive regions only, with guidelines established by the MGSC (e Supplemental Note: Maize Finishing Guidelines). Improved quences were submitted to GenBank as pha-I improved (HTGS_IMPROVED).
The B73 RefGen_v1 also includes published quences from 55 B73 BACs that were not generated by the maize genome quencing project: Genbank Accession numbers: AC147602, AC147791, AC148152,
gerAC148167, AC148479, AC149475, AC149478, AC149633, AC149640, AC149810, AC149818, AC149828, AC149829, AC150739, AC152495, AC155352, AC155363, AC155376, AC155377, AC155383, AC155397, AC155417, AC155434, AC155496, AC155507, AC155537, AC155610, AC155622, AC155624, AC159612, AC166636, AF466202, AF546187, AF546188, AY211534, AY211535, AY530952, AY542798, EF517600, EF517600, EF562447 (S4-S12). Note that some accession numbers reference more than one BAC.
In total, the B73 RefGen_v1 contains 2,048 Mb in 125,325 quence contigs (N50 of 40 kb), forming 61,161 scaffolds (N50 of 76 kb) of the maize genome, which consists of an estimated 2.3 Gb (S13). We thus estimate that ~250 Mb (~10.8%) of the genome is missing from B73 RefGen_v1. The ~7% of the genome (~170 Mb) that is not contained within the maize physical map accounts for ~70% of the missing quence. Some of the remaining missing quence can be attributed to tandem repeats as illustrated in Table S13, which shows that 90% of the estimated 30 Mb of knob 180 repeat are missing in the asmbly, as well as 80% of the 2 Mb knob 350 repeat, 45% of 3 Mb CentC and 86% of the estimated 35Mb of 45S rDNA, among others. Thus the analyzed repeats alone account for a total of 60 Mb of missing DNA (27+ 1.6 + 1.35 + 30 Mb). Becau each
BAC was quenced at 4-6x coverage, it is possible that some quences were misd in DNA quencing. We also cannot exclude the missing DNA from an asmbly-bad collap of two highly similar LTRs of a recently inrted retrotransposon. Comparison of B73 RefGen_v1 to ten highly curated gene quences showed that the maximum quencing error rate was 0.025%, as summarized in Table S1.
tripoli
TE Search Approaches, Definitions of Families, Intact Elements and Fragmented Elements, and Gene Fragments Captured by TEsbuchi
sheepleThe 2045 Mb of maize (B73) genomic quence (excluding gaps) analyzed in this study (B73 RefGen_v1) were downloaded from The Maize Genome Sequence Browr () and TEs were annotated with a variety of approaches.
Construction of a library of repetitive quences
To systematically identify repetitive quences, the maize genomic quence was downloaded and clustered with RECON, a program for de novo identification of repetitive quences (S3). The cutoff for consideration as a repetitive quence was 10 or more copies in the final quence t. The resulting library (containing 33,201 repetitive quences) is referred to as the RECON library below. It is ud for the identification of Mutator-like elements (MULEs) and for other DNA elements that were abnt from other collections (e below). CACTA, Tc1/Mariner, hAT and PIF/Harbinger elements
Coding Elements
Coding regions of the DNA TE superfamilies PIF/Harbinger, CACTA, Tc1/Mariner and hAT were identified with connsus quences derived from the most conrved (catalytic) region of each superfamily, and TBLASN arches were performed with a pipeline called TARGeT (S14). Retrieved 考研数学复习计划
quences were aligned with MUSCLE (S15) and phylogenetic trees generated for each superfamily with MEGA4 and PAUP version 4.0b8 (S16, S17). In addition, the conrved quences of previously identified maize DNA TEs (Ac, En, Doppia4, PIF and Bergamo) were also included as queries.
Next, full-length coding elements [including terminal inverted repeats (TIRs) and target site duplications (TSDs)] were determined with a newly developed TIR/TSD structure-bad method that includes a statistical method to filter fal positives. Full-length copies were further confirmed with BLAST to identify homologs in the genome.
Non-Coding Elements
Identification of full-length coding elements is a prerequisite for detecting non-coding elements that are deletion derivatives. Therefore, full-length coding copies were ud to survey the maize genomic quences for related non-coding elements including MITEs via BLAST and RepeatMasker (/). As with the coding elements, non-coding elements previously identified in maize [including Dotted (rDt), Ds, Ds1, and the MITEs Hbr (Heartbreaker), Irma/mPIF, TouristZm1, StowawayZm1] were also included in this screen. Exemplars
To reduce the redundancy of recovered quences and to hasten future annotation of maize genomic DNA, we generated a collection of exemplars (reprentative TE quences) using the following procedure. All element quences from the same superfamily were compared with BLASTN. The element with the most matches (cutoff at 90% identity in 90% of the element length) was considered as the first exemplar. Thereafter, this element and its matches were excluded from the group and a cond round BLASTN arch was conducted with the remainder of the elements, leading to the generation of the cond exemplar. This process was repeated until all elements were excluded. For coding elements in the four superfamilies, a phylogenetic tree was generated for each family. On the basis of visual examination of the phylogenetic tree, a full-length element was chon from each clade as the exemplar. The exemplars for both coding and non-coding elements then were ud to
mask the RECON library (with RepeatMasker) and the unmasked quences were examined for elements with features of CACTA, hAT, Tc1/Stowaway, and PIF/Tourist superfamilies. This led to the identification of additional elements that were not included by the exemplars. Each exemplar and each additional element identified from the RECON library were considered a family.
年末结转Estimation of copy numbers and genome coverage
启发式
德英乐教育The exemplars and quences derived from the RECON library were ud to mask B73 RefGen_v1 and the output of RepeatMasker was ud for estimation of copy number and genomic coverage of each superfamily. The redundant matches in the output were eliminated by excluding the shorter match (for copy number calculation) if two elements matched the same region and the overlapped part was 90% or longer of the shorter match. If an element in the genomic quence matched an exemplar over the entire quence, or if the truncation was less than 20 bp on each end, this element was considered to be an intact element. Otherwi it was considered as a truncated element or half of a copy. Fragmented elements that lack both ends (truncated more than 20 bp on both ends) were not included in copy number estimation. The genome coverage of TEs was estimated as the total quence masked by each superfamily.
Estimation of the number of intact elements carrying gene fragments
To identify elements with gene quences in their internal regions, the quences of exemplars were ud to mask B73 RefGen_v1. Candidate elements were retrieved if they posss terminal quences of exemplars from the same super-family at both ends, less than 15 kb, contain non-exemplar quences, and with a minimum of two copies in the genome. The quences of candidate elements were ud to arch against the coding region of the filtered gene t of B73 RefGen_v1,
and the genes that align with the elements with at least 100 bp in length were considered candidate parental genes. To minimize the effect of gene annotation artifacts, captured gene fragments were excluded if they match examplar quences (>= 30 bp) of any type of TEs at the nucleotide level, or match any known transposas (> 50 amino acids) at the protein level, or are flanked by TE terminal quences. The remaining gene quences were ud to arch against plant proteins in NCBI and a maize EST databa, downloaded on Sep. 6, 2009. Only tho genes matching known plant proteins (E < 10-5) and matching a maize EST quence (at least 100 bp) are considered as final parental genes. The elements with one or more corresponding final parental genes are considered as TEs carrying gene fragments.
Helitrons
Helitrons were sought in the B73 RefGen_v1 by arching for the canonical 3’ and 5’ ends associated with intact elements, and requiring at least two independent elements with the exact ends to confirm that they are members of a unique Helitron family, as described (S18). Becau many Helitron families in some plant genomes are predicted to be prent with zero or one intact members (S18), this approach provides a minimum estimate of total Helitron copy number. A family is defined as all Helitrons with the same 3’ end (>80% identity over the terminal 30 bp). Intact memb
ers of the same family have both 3’ and 5’ ends, while fragmented elements are defined as all non-intact elements with at least 100 contiguous bp of >80% homology to an intact Helitron (S19). A minimum total number of Helitrons (fragmented plus intact) in B73 RefGen_v1 was calculated by counting the number of 3’ ends (~21,000) and 5’ ends (~22,000) for the eight identified families. The ratio of apparently fragmented to intact elements in the B73 RefGen_v1 is greater than ten to one, but this fragment excess is at least partly an annotation artifact caud by the great number of gaps, incorrectly ordered scaffolds and improper asmblies in the repeat-rich regions of the quenced BACs of of B73 RefGen_v1
(S19).
LINEs
Candidate Long Intersperd Nuclear Elements (LINEs) were identified in the maize genome quence with a Perl script that arched for host site duplications of a given size range flanking a block of quence of appropriate length that terminated on one end with a simple quence repeat, usually poly A (this Perl script is available upon request from Phillip SanMiguel, Purdue University). This large pool of quences was filtered by