Nucleic Acids Rearch,20141
doi:10.1093/nar/gku917
Multiplex quencing of pooled mitochondrial
genomes––a crucial step toward biodiversity analysis using mito-metagenomics
Min Tang 1,Meihua Tan 1,2,Guanliang Meng 1,3,Shenzhou Yang 1,Xu Su 1,Shanlin Liu 1,Wenhui Song 1,Yiyuan Li 1,Qiong Wu 1,Aibing Zhang 4and Xin Zhou 1,*
1
China National GeneBank,BGI-Shenzhen,Beishan Road,Beishan Industrial Zone,Y antian District,Shenzhen,Guangdong Province 518083,China,2University of Chine Academy of Sciences,19A Yuquan Road,Shijingshan District,Beijing 100094,China,3China University of Geosciences,388Lumo Road,Wuhan 430074,China and 4
Capital Normal University,Beijing 100094,China
Received June 10,2014;Revid September 16,2014;Accepted September 22,2014
ABSTRACT
The advent in high-throughput-quencing (HTS)technologies has revolutionized conventional bio-diversity rearch by enabling parallel capture of DNA quences posssing species-level diagnosis.However,polymera chain reaction (PCR)-bad im-plementation is biad by the efficiency of primer binding across lineages of organisms.A PCR-free HTS approach will alleviate this artefact and signif-icantly improve upon the multi-locus method utiliz-ing full mitogenomes.Here we developed a novel multiplex quencing and asmbly pipeline allow-ing for simultaneous acquisition of full mitogenomes from pooled animals without DNA enrichment or am-plification.By concatenating asmblies from three de novo asmblers,we obtained high-quality mi-togenomes for all 49pooled taxa,with 36species >15kb and the remaining >10kb,including 20complete mitogenomes and nearly all protein coding genes (99.6%).The asmbly quality was carefully validated with Sanger quences,reference genomes and con-rvativeness of protein coding genes across taxa.The new method was effective even for cloly re-lated hree Drosophila spp.,demonstrat-ing its broad utility for biodiversity rearch and mito-phylogenomics.Finally,the in silico simulation showed that by recruiting multiple mito-loci,taxon detection was improved at a fixed quencing depth.Combined,the results demonstrate the plausibility of a multi-locus mito-metagenomics approach as the next pha of the current single-locus metabarcod-ing method.
INTRODUCTION
Over the past few years,DNA metabarcoding––identifying mixed taxa using short DNA markers via high-throughput-quencing (HTS)––has emerged as a fast and effective ap-proach to characterizing bulk environmental samples (1).To date,most published works have relied on polymera chain reaction (PCR)amplification of a single (typically standard)DNA he CO1‘barcode’fragment for animals (2).While enriching targeted gene fragments,PCR amplifications can introduce taxonomic bias (1,3,4)and chimeric quences (5,6)due to varied primer binding efficiencies across taxa.When the target bulk sample con-tains organisms from a wide range of lineages,as is typical of many biodiversity surveys,such artefacts would cau systematic bias in the subquent diversity analysis.For example,our recent work bad on HTS of PCR amplicons of CO1barcodes (7)showed significantly higher failure rate in hymenopterans (wasps and bees,32%)relative to other mixed incts,even though the overall taxonomic recov-ery rate was improved from the previous method (8).Zhou et al.(9)demonstrated some success in identifying mixed species without PCR amplifications.However,in that study,a large proportion of potentially informative quences (-CO1mitochondrial gene fragments)were ignored for species recovery becau only CO1barcodes were available as the reference.
A multi-locus identification approach has not only been promoted as a standard barcoding method for difficult groups,such as plants (10,11),but also improved barcod-ing efficiencies in incts (12)and fungi (13),where a single-locus approach has been predominantly applied.A multi-locus system has been argued to deliver better taxonomic resolution in general (14–17).In addition to improving tax-onomic delineation,a multi-locus approach can also allevi-ate fal negatives caud by random missing of a given tar-
*To
whom correspondence should be addresd.Tel:+86075525273620;Fax:+86075525273620;Email:xinzhou@genomics
Prent Address:Yiyuan Li,Department of Biological Sciences,Galvin Life Science Center,University of Notre Dame,IN 46556,USA.
C The Author(s)2014.Published by Oxford University Press on behalf of Nucleic Acids Rearch.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licen (creativecommons /licens /by-nc /4.0/),which permits non-commercial re-u,dis
tribution,and reproduction in any medium,provided the original work is properly cited.For commercial re-u,plea contact journals.
Nucleic Acids Rearch Advance Access published October 7, 2014 by guest on October 19, 2014
2Nucleic Acids Rearch,2014
get gene caud by insufficient quencing or DNA degra-dation.Therefore,the acquisition of reference quences for non-standard-barcode genes will greatly facilitate the ex-pansion of the current single-gene barcoding approach by taking advantage of additional informative markers.Recent studies(15,17)have also discusd mechanisms in utilizing multiple markers in biodiversity analysis.
The main impediment for a wide application of a multi-locus identification system does not lie in the lack of scien-tific motivation but rather in challenges in practical logistics (cost,technical difficulties,etc.).While significant progress has been made toward constructing DNA reference li-braries uful for taxonomic he Interna-tional Barcode of Life(iBOL,ibol)pr
oject,such endeavors are primarily focud on carefully lected CO1barcode for the animals,leaving other tax-onomically informative genes aside.Although other mito-chondrial CYTB,ND1)have been demonstrated effectively in both species delineation(18,19)and phyloge-netic reconstruction(20,21),Sanger-quencing-bad ref-erence construction for each of the additional MT genes will require similar global investment as the DNA barcod-ing initiative.Alternatively,whole mitochondrial genome quencing can produce a full t of references in one shot, including protein coding genes,ribosomal DNA(16and 12S),tRNA genes and the hyper-variable control region. Conventional methods in obtaining whole mitogenome -quences include primer-walking and long-range PCR cou-pled with Sanger or Next generation quencing(NGS) (22,23).The time-consuming pipelines typically also re-quire high-quality mitochondrial DNA to ensure the suc-cess of targeted PCR amplifications,which rules out the utility of many prerved specimens in less optimal condi-tions.Furthermore,for taxa containing high variability in gene quences and gene orders,primer optimization is dif-ficult(24).A whole-genome shotgun approach employing cond generation quencing technologies has been suc-cessfully applied in asmbling full mitogenomes.However, most previous work has only dealt with a single taxon at a time(24–26)or a limited number of pooled taxa(27,28),re-sulting in high analytical cost for each genome.An in silico test containing100species(29)showed the pooling strategy might enable simulta
neous construction of many distantly related taxa but this has not been demonstrated with real data.
The main motivation for multiplex quencing and re-constructing mitogenomes from pooled taxa is to reduce analytical cost on individual library construction required for HTS.In principle,the more taxa that can be pool-quenced,the less the average cost for each species,to the point where the main cost per taxon is mainly determined by its quencing volume and the associated computational cost.In practice,a number of factors must be balanced: total number of pooled taxa,phylogenetic distance among taxa,DNA quality and quantity,and total quencing vol-ume.Empirical analysis will also need to consider specific features associated to the employed quencing technology and asmbly programs.In this study,we ek to answer the questions and develop a new pipeline for rapid and accurate reconstruction of multiplex mitogenomes from pooled taxa without relying on any DNA enrichment or amplification.In addition,we explore the plausibility of a
北京高考语文multi-locus identification approach that integrates full mi-togenome quences or‘mito-metagenomics’.
MATERIALS AND METHODS
Raw data(SRA174290),Sanger quences(KM207019–
KM207147)and asmbled mitogenomes(KM244654–
KM244713)are available on GenBank.
Taxon lection
The schematic analytical pipeline is illustrated in Figure1.
The level of phylogenetic distance among pooled taxa can potentially impact both shotgun-read asmbly and sub-
quent taxonomic assignments for asmbled scaffolds.A to-
tal of49animal species(primarily incts,Table1and Sup-plementary Table S1in Appendix1)were lected from47
genera and42families,with most taxa reprenting a sin-
gle family while a number of them were chon from the
same family,Cheilomenes xmaculata and Propylea japonica,Lethe confusa and Myc
alesis mineus)or
hree Drosophila spp.)Such sampling strategy en-
ables us to understand the influence of pooling cloly re-
lated species on mitogenome asmbly.Samples ud in this
work include recently collected specimens and prerved tis-
sues(collected in2009and2010,e Supplementary Table
S1in Appendix1for details).
DNA extraction and quencing
Genomic DNA of each individual specimen was extracted parately following Ivanova et al.(30).All genomic DNA
extracts were quantified using Qubit2.0(Invitrogen,Life technologies).DNA quality was categorized as levels A,B,
C and
D bad on quantity and level of degradation(e
notes in Supplementary Table S1in Appendix1).A to-
tal of100ng of each DNA was then pooled and ud for
失物招领Hiq DNA library construction with an inrt size of250
bp following manufacturer’s instruction.The library was -
quenced on an Illumina HiSeq2000with the strategy of150
paired-end(PE)at BGI-Shenzhen,China.
De novo asmbly and taxonomic assignments of mitochon-
drial scaffolds
Scripts and Shell command lines are provided in‘Appendix
2Supplementary notes.pdf’.All relevant script files and -
quence alignments are available at:sourceforge/
projects/mt10k/files/?source=navbar.
Data filtering and parallel asmbling using multiple asm-
blers.Pre-analysis data filtering includes:(i)Reads with
adapter contamination and ploy-Ns(≥5)and PE reads with
>10bas of low quality scores(<20)were removed from
raw data following Zhou et al.(9);(ii)Clean data were then compared with reference mitogenomes downloaded from GenBank(716RefSeq genomes,including699arthropods,
ven starfish and10cyprinid fish accesd on10March
2014)to screen out candidate mitochondrial reads using a
relaxed criteria:blast identity>30%and E-value≤10−5;
by guest on October 19, 2014
Downloaded from
Nucleic Acids Rearch,20143
Table 1.List of taxa analyzed and corresponding asmbly
results
by guest on October 19, 2014
4Nucleic Acids Rearch,
2014
Figure 1.Schematic illustration of the pipeline.怎么找人
(iii)51-mer t was then generated from the candidate mito-reads and ud as references for a cond round
of data filtering for the discarded reads from step 2;(iv)The com-bined clean reads from steps 2and 3were ud for de novo asmbling.
were asmbled by SOAPdenovo SOAPdenovo-Trans (33)(-K 71,(34)(kMaxShortSequence =respectively.Three ts of asm-were annotated parately using a Perl script described by Zhou et al.(9)and a mi-togenome reference databa containing full mitogenomes (RefSeq)from 604arthropod species,two asteriid starfish and the zebrafish downloaded from GenBank on 13June 2013.Only scaffolds of mitochondrial origin were kept for subquent taxonomic assignments.
乒乓球怎么握拍
Scaffold concatenation.All scaffolds containing mito-chondrial proteins (mito-scaffolds)were clustered and re-scaffolds misd by TGICL .Concatenated scaffolds were annotated again to identify regions of protein coding genes.Taxonomic assignments for protein coding genes and scaf-folds.The taxono
mic assignment pipeline is summarized in Supplementary Figure S1(Appendix 2).Briefly,all pro-tein coding genes were aligned by ‘megablast’to a mito-
chondrial protein coding gene reference databa contain-ing 886010quences downloaded from GenBank on Feb.25th,2014,including all arthropods,starfish and the ze-brafish.For a given protein coding gene,the best blast match (top hit)was lected for subquent taxonomic as-signment:if the best-matched species was listed in our input taxa table,a species-level assignment was made for the pro-tein coding gene;otherwi the associated higher taxonomic hierarchy of the best-matched species (i.e.Genus,Subfam-ily,Family,Order)were ud to compare against the input taxon list until a match was achieved.Unassigned CO1-quences were also compared with the Barcode of Life Data Systems (BOLD,boldsystems )for further taxo-nomic assignments.Taxonomic assignment of scaffolds was made primarily bad on CO1(when available)and con-firmed by other protein coding genes asmbled on the same scaffold on a majority connsus basis (ction S2in Ap-pendix 2).
Finally,the remaining unassigned scaffolds were made subject to Sanger quence verification.Consulting results from missing taxa (i.e.species without any associated mito-scaffolds after the above protein coding gene and scaffold taxon assignments)and missing protein coding gen
es,we amplified and Sanger quenced three ts of markers:CO1,ND1and ND5.The optional genes were lected to obtain an even coverage for the mitochondrial genomes revealed in general arthropod mitochondrial structure.Primers ud in this study are listed in Supplementary Table S2(Appendix 3).The Sanger quences were then ud to identify mito-
by guest on October 19, 2014
Nucleic Acids Rearch,20145
scaffolds misd from previous taxonomic assignment pro-cedures.Finally,all mito-scaffolds that were assigned to the input taxa were ud to construct the super-scaffold for each of the pooled species.
泥螺Alignment and validation of mitogenome quences Becau none of the current de novo genome asmblers was designed for handling circular genomes,complete linear mi-togenomes usually contained repetitive overlaps on the ter-minus.If the terminus of a linear super-scaffold con-tained overlapping quences of>25bp,the correspond-ing asmbly was considered a complete cir
cular genome. Sometimes TGICL produces 20kb) scaffolds due to its incapability in recognizing real ends of the genome.Thus repetitive quences need to be re-moved from the final genome asmblies.Automated Perl scripts(ction S1in Appendix2)were developed to iden-tify unusually long scaffolds containing identical overlaps.焖四季豆
A manual inspection using‘Geneious’(36)was followed for remaining scaffolds of>15kb after the automated step to identify overlapping terminal regions with Ns or mis-matched nucleotides(artefacts produced in IDBA-UD as-mbling or concatenation of different asmblies by TG-ICL).Mito-scaffolds with redundant quences removed from the overlapping terminus were ud for subquent analysis.
Each of the13MT protein coding genes extracted from the mito-scaffolds was aligned individually across all asm-bled taxa by‘ClustalW’(37)using reference protein cod-ing gene quences from six model organisms(Drosophila melanogaster,Drosophila simulans,Drosophila pudoob-scura,Aedes aegypti,Danaus plexippus and Tribolium cas-taneum)and by MEGA(38)to ensure correct transla-tion frames for amino acids.Indels created by asmbly programs bad on Hiq reads’paired-end information were validated by the global alignment results:redundant Ns were removed and alignment gaps were inrted.Stop codons were also examined as a hint for erroneous
asm-blies.Aligned protein coding genes were then placed back to the mitogenomes.Finally,a manual procedure was taken to assure asmbly quality:the three asmbly versions were compared to the final corresponding mitogenome and fil-tered mito-reads were mapped to the genome to examine uneven quence depth(ction S1in Appendix2).When a particular region was asmbled only by one of the three asmblers with low read lo to0),we con-sidered it as a fal asmbly and corrected it according to the other two programs.As a final step,reads were mapped to the mitogenomes using BW A(39)to identify regions with exceptionally low coverage relative to adjacent regions. The problematic regions were examined using SAMtools (40)to investigate potential conflicting allelic variations(in-cluding both natural polymorphic alleles and artefacts). Nucleotide suggested by SAMtools was subquently cho-n as the connsus ba for the final asmbly. Reference mito-genomes of six lanogaster, Drosophila erecta,D.pudoobscura,T.castaneum,Bac-trocera dorsalis and Danio rerio)were downloaded from GenBank and compared to our final asmblies.Both nu-cleotide and amino acid quences were examined with the recognized possibility of intraspecific variation in the mi-togenomes.Finally,Sanger quences from fragments of genes CO1,ND1and ND5were obtained for validating the
final asmbly quality.Primers were designed using Primer
5.0(Supplementary Table S2in Appendix S3).With the combined evidence,we examined all mitogenome scaffolds
阳光下的星星歌词for hints of local asmbly errors and chimeras.The asm-
blies of the all input taxa were also annotated for protein coding genes,tRNA,rRNA genes and the control regions
using‘Geneious’.
In silico simulation for multi-locus mitochondrial metage-nomics
To evaluate whether and how multiple mitochondrial loci
could improve biodiversity recovery for mixed animal sam-ples,we conducted an in silico analysis using portions of the Illumina data generated in this study.A ries of data vol-
umes(2,5and8Gb)were randomly lected from the total
high-quality reads to simulate varied quencing depths for
the given animal sample aining49species
of varied phylogenetic relatedness).Reads and the corre-sponding scaffolds/contigs that were asmbled from the
reads using only SOAPdenovo2.0were‘BLASTed’against
the aligned protein coding genes derived from the above asmbly pipeline using‘BW A’and‘BLAST’,respectively. Criteria for a successful taxon recovery were defined as:
100%quence identity,≥90%coverage for at least one pro-
tein coding gene marker.We first calculated taxonomic re-covery rates at varied quencing depths for the standard animal CO1barcode region,then expanded the analysis to include the full CO1gene,CO2,CO3and eventually all13
MT protein coding genes.
RESULTS
婴幼儿奶粉排名
Construction of mitogenomes
As shown in Figure1,a total of230million raw PE reads
were produced on a whole Hiq2000lane(ca.35Gb raw data,SRA174290),while22million high-quality PE reads
(3.3Gb,containing candidate mitochondrial reads)were fil-
tered out after removal of adaptors,low-quality reads and
most non-mitochondrial quences.The clean reads were
ud for the subquent asmbling.A total of884000, 208000and270000scaffolds were obtained using SOAP-denovo,SOAPdenovo-Trans and IDBA-UD,producing691,
383and416mito-scaffolds,respectively.The three ts
of mito-scaffolds were clustered and further asmbled into
658scaffolds using TGICL.A total of649scaffolds were retained for further analysis after manual examination.
All protein coding genes annotated from the649mito-scaffolds(including118CO1-scaffolds and531non-CO1-scaffolds)were blasted against the NCBI MT protein cod-
ing gene reference library using‘megablast’.The first round
of taxonomic assignment identified47scaffolds contain-
ing protein coding genes readily assigned to38input taxa,
which were retained for the final mitogenome construction.
An additional four CO1-scaffolds were further identified by Barcode of Life Data(BOLD)via the CO1barcode regions
and kept.After the two steps,ven(all mayflies)of the49
input taxa were not yet associated with any mito-scaffolds.
by guest on October 19, 2014
Downloaded from