电子PCR技术

更新时间:2023-07-24 07:30:50 阅读：评论：0

A web rver for performing electronic PCR

Kirill Rotmistrovsky,Wonhee Jang and Gregory D.Schuler*

National Center for Biotechnology Information,National Library of Medicine,National Institutes of Health,Bethesda,MD 20984,USA

Received February 27,2004;Revid and Accepted April 21,2004

ABSTRACT

‘Electronic PCR’(e-PCR)refers to a computational procedure that is ud to arch DNA quences for quence tagged sites (STSs),each of which is defined by a pair of primer quences and an expected PCR product size.To gain speed,our implementation extracts short ‘words’from the 30end of each primer and stores them in a sorted hash table that can be accesd efficiently during the arch.One recent improvement is the u of overlapping discontinuous words to allow matches to be found despite the pre-nce of a mismatch.Moreover,it is possible to allow gaps in the alignment between the primer and the quence.The effect of the changes is to improve nsitivity without significantly affecting specificity.The new software provides a arch mode using a query STS again

st a quence databa to augment the previously available mode using a query quence against an STS databa.Finally,e-PCR may now be ud through a web rvice,with arch results linked to other web resources such as the UniSTS databa and the MapViewer genome browr.The e-PCR web rver may be found bi.v/sutils/e-pcr.

INTRODUCTION

A major milestone in the history of genome map construction was the notion of a quence tagged site (STS),which is deﬁned by a pair of oligonucleotide primers that can be ud in a PCR to amplify a unique site within the genome (1).STS markers have formed the basis for virtually all physical and genetic maps constructed over the last decade,rapidly replacing the earlier generation of cloned DNA gment mar-kers.PCR primer pairs can also be ud to probe the tran-scriptome,yielding large-scale proﬁles of gene expression.In an era when the large-scale quencing of genomes and tran-scriptomes is routinely undertaken,there is signiﬁcant utility

in being able to cross-reference large collections of PCR pri-mer pairs and quences.

We have previously described the concept of ‘electronic PCR’(e-PCR)as a computational procedure for ﬁnding quence tagged sites within DNA quences and provided an efﬁcient implementation of

this procedure (2).To gain speed we employed the commonly ud strategy of hashing,in which the bas from a window of size W (a ‘word’)are ud as an index into a hash table (for an overview of program parameters,e Table 1).Each time a matching word is found,a portion of the quence is checked for an alignment to the corresponding primer.Finally,a match is reported if both primers are found in the correct orientation and imply a product size that is within M bas of the expected size (Figure 1).Increasing the value of W accelerates the arch by reducing the background of word matches that must be investigated.In the original implementation of the program,only one word was hashed per primer and the requirement that its W contiguous bas match exactly led to a loss of nsi-tivity.Even though mismatches were allowed in the primer alignment step,their prence within the hashed word was sufﬁcient to deny a match.

In previous reports,we have described applications of e-PCR for binding genomic and transcribed quences to map positions (2)and for using STS maps to asss the quality and completeness of a genomic quence (3).Here we describe algorithmic changes that result in improved n-sitivity,a arch mode in which a query STS can be compared to a quence databa,and a web rver for performing e-PCR.

*To whom correspondence should be addresd.Tel:+13014962475;Fax:+13014809241;Email:schul

er@ncbi.v

The online version of this article has been published under an open access model.Urs are entitled to u,reproduce,disminate,or display the open access version of this article provided that:the original authorship is properly and fully attributed;the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given;if an article is subquently reproduced or disminated not in its entirety but only in part or as a derivative work this must be clearly indicated.ª2004,the authors

Table 1.Electronic PCR program parameters Parameter Meaning W Number of bas ud as a word for hashing

F Number of discontiguous words hashed (F =0for contiguous)N Number of mismatches allowed in primer alignment

G Number of gaps allowed in primer alignment

Number

bas the STS size may differ from expected size

Additional options may be added in the future.Invoking the program with -h as the argument produces a list of all available options.

Nucleic Acids Rearch,Vol.32,Web Server issue ªOxford University Press 2004;all rights rerved

W108–W112Nucleic Acids Rearch,2004,Vol.32,Web Server issue DOI:10.1093/nar/gkh450

SEARCH SENSITIVITY

The original e-PCR implementation was fairly rigid in its match criteria,but there are a number of reasons why a more ‘fuzzy’matching strategy might be desired.For exam-ple,when arching against single-pass (low-quality)quences,such as expresd quence tags and clone end quences,a certain rate of quencing error is expected.Alternatively,the quence may be free of errors and the goal to ﬁnd near-matches that may cau confounding signals.In any event,depending on the particular bas involved,the PCR biochemical reaction may tolerate some mispairing within the primers.To improve arch nsitivity,we have modiﬁed both the hashing and th

cim primer alignment steps of e-PCR.

To reduce the likelihood that a true STS will be misd due to mismatches,we have changed the way in which hash table values are generated.Instead of using a single exact word,multiple,discontiguous words are ud,each of which has groups of signiﬁcant positions parated by ‘wildcard’posi-tions that are not required to match.They have been variously called ‘templates’,‘patterns’and ‘motifs’and have been ud previously in DNA databa arching (4–6)and multiple protein quence alignment (7).In e-PCR,the F parameter speciﬁes the number of words generated as well as the spacing of the wildcard positions.For example,using F =3(as in Figure 1b),the wildcards occur every third position.By having this template successively shifted by one position in each of the three words,every ba corresponds to a wildcard in some word.Thus,a word match is guaranteed for any ca where there is just one mismatch.Two or more mismatches will still

po a problem (except in the fortuitous ca where their spacing is a multiple of F ).However,as will be shown later,allowing more than one mismatch greatly increas the number of fal positives.

The primer alignment step that is invoked following each word hit has been modiﬁed to allow gaps (inrtions or dele-tions)in the alignment.This feature is enabled by the G para-meter,who value sp

eciﬁes the maximum number of gaps allowed in each primer.Although the algorithm does not place any constraints on where the gap may be,it is important to note that gaps within the W bas ud for hashing will gen-erally prevent getting a word hit.Allowing gaps is very uful when arching low-quality quences or when using primers designed from low-quality data.However,this option will also increa the running time and may generate fal positives.Given the modiﬁcations for improving nsitivity,it is of interest to e how effective they are and to what extent they affect speciﬁcity.To test this,we cho a t of 584micro-satellite STSs that were ud as reference markers for the human transcript map (8,9).They have all been mapped with high conﬁdence,and in a consistent order,by meiotic linkage mapping (10)and radiation hybrid (RH)mapping using two different RH panels,the GeneBridge 4panel (11)and the Stanford G3panel (12).Thus,we may be fairly certain that they reprent unique sites in the genome.However,one caveat is that if a site appears multiple times within a window smaller than the map resolution,it would have appeared as a unique site in the mapping studies.Of the three mapping resources,Stanford G3is the most preci,with an average resolution of 500kb (12).In fact,we found two instances of STSs reacting with multiple sites that were <500kb apart and we treated each pair as a single site.If an STS was found only once in the genome,and on the expected chromosome,it was assumed to be a true positive.For tho that hit multiple times,one was counted as a true positive (if on the correct chromo-

some)and all the rest were counted as fal positives.Of cour,any STS not found was regarded a fal negative.We compared the test STS t to the complete quence of the human genome using various e-PCR parameters to e how speciﬁcity and nsitivity would be affected.Table 2provides the numbers of true and fal matches,the number of STSs not found and calculations of nsitivity (fraction of STSs report-ing a match)and speciﬁcity (fraction of matches that are true).Not surprisingly,a speciﬁcity of 1.0is obtained when primers are required to match exactly (N =0,G =0),but 80of the markers were not found,yielding a nsitivity of only 0.863.Although the human genomic quence is known to contain gaps,a recent analysis suggests that it includes 99%of the euchromatin (The International Human Genome Sequencing Consortium,manuscript submitted),suggesting an upper bound of 0.990on nsitivity.Using discontiguous words and allowing one mismatch and one gap per primer gives the best balance of nsitivity and speciﬁcity (0.983and 0.991).Allowing two gaps resulted in only one additional STS being found but more than doubled the number of fal positives.The most drastic loss of speciﬁcity is en when two mis-matches are allowed.The ﬁve markers that were not found under any conditions were investigated further and found to have alignment gaps within the hashed W bas.In the comparisons,the word size was kept constant (W =12).It should be noted that for tests using discontiguous words,

the

Figure 1.Electronic PCR concepts.(a )An STS is defined by a pair of primers which anneal to the target DNA in opposite orientations.Each primer is extended on its 30end in the direction of the other primer by Taq polymera.Multiple cycles of annealing and extension lead to a substantial amplification of the STS quence,also known as the ‘amplicon’.(b )By default,a single contiguous word of W bas (W =9in this example)is extracted from the 30end of each primer.Any mismatch or

gap within this region is sufficient to eliminate a match.With discontiguous words enabled (F =3),three words are indexed,each with a ‘wildcard’position every third ba.Staggering of the wildcard positions ensures that no single mismatch in this region will deny a match.

Nucleic Acids Rearch,2004,Vol.32,Web Server issue

在线兼职翻译W109

‘effective word size’is eight(excluding four wildcard posi-tions),which caus the program to run more slowly but does not signiﬁcantly change the results shown in Table2.It should be noted that the tests were performed using a t of well-mapped markers,which will surely have different properties from tho chon at random.

REVERSE SEARCHING

Although the original e-PCR program constructed a hash table from the STS databa,there are situations in which it is more desirable to hash the quence databa.We have imple-mented this strategy and refer to it as‘rever e-PCR’.Con-verly,the previous u of a quence query against an STS databa would be‘forward e-PCR’.The main motivation for implementing rever e-PCR wa

s to make it feasible to arch the human genome quence(and other large genomes)in an interactive web rvice.Before performing a rever e-PCR

arch,the quence databa must be procesd using a speciﬁc word size and discontiguous word option.Sequences are scanned,examining each word in turn,and ultimately a data structure is created in which each possible word has an associated list of all quence coordinates(pairs of quence identiﬁer and ba position)at which it is found.This step is time-consuming,but only needs to be done once(unless the underlying quence changes).Thereafter,an STS can be compared against the genome by extracting a few words from each primer,retrieving lists of positions to examine and reading only the necessary portions of the quence into memory to test for primer alignments.However,the index—actually a t of veralﬁles organized for efﬁcient memory mapping—requires storage that is 10–15·the size of the original quence databa.In other words,space is traded to gain time.

It should be noted that speed will degrade signiﬁcantly when a primer contains a highly repetitive word.This is due to the fact that its list of quence coordinates will be large,and following up on each one requires reading a gment of quence data from the storage device.In the initial indexing of the databa,it is possible to identify words that occur too frequently and simply mark the

m as repetitive rather than storing all of their positions.This ems reasonable becau, as a general rule,urs will only want to know that a candidate marker is repetitive(so that it can be eliminated from further consideration)and not e a full account of its positions. Eliminating repetitive word coordinates from the index both increas speed and decreas storage requirements.However, it also may result in occasionally missing a true STS becau it is possible that the W bas ud for the lookup are repetitive, even if the primer as a whole is not.

To provide a better n of which arch strategy is most appropriate for certain situations,we have devid a ries of test cas using ts of quences and STSs of different sizes. The large quence databa consists of all of the quence contigs from the2.86Gb human genome quence.For the small quence,we cho a single270kb quence entry(AB026898)corresponding to a region of human chromosome3.This reprents 1/10000of the genome and falls at the high end of a typical size distribution for large-inrt clones.A t of132648non-repetitive human markers from UniSTS constitutes the large STS databa. Of the,a t of13(again,1/10000of the large t)markers (all of which fall within AB026898)was chon as the small STS databa.In addition,the program was run using just a single STS(marker D3S3333).The time and disk space required for each situation are shown in Table3.Overall, rever e-PCR is faster when using small numbers of STSs,

hotel california

Table2.e-PCR nsitivity and specificity with different arch options

Search parameters Search results

Word type Mismatches allowed Gaps allowed True positives Fal positives Fal negatives Sensitivity Specificity Contiguous005040800.863 1.000 Contiguous105430410.930 1.000 Contiguous115563280.9520.995 Discontiguous105600240.959 1.000 Discontiguous115745100.9830.991 Discontiguous125751490.9850.976 Discontiguous2157817260.9900.771 Discontiguous2257987450.9910.398 Tests were conducted with a t of584microsatellite STSs with a consistent order among three maps.They were compared to the complete human genome (build34;July2003)using W=12and M=200.Discontiguous words were activated with F=3and the number of mismatches and gaps allowed were varied using the N and G parameters(Table1).Sensitivity(Sn)is defined as Sn=TP/(TP+FN),where TP is true positives and FN is fal negatives.Specificity(Sp)is defined as Sp=TP/(TP+FP),where FP is fal positives.

Table3.Relative running times and storage requirements for forward and

rever e-PCR

Datats Forward e-PCR Rever e-PCR

Time(s)Space(MB)Time(s)Space(MB)

Small quence(270kb)

Versus single STS4<136

Versus small STS t4<1166

Versus large STS t11127817

Large quence(2865kb)

Versus single STS117829063835837

Versus small STS t1161290615535837

Versus large STS t545402917n.d.35844

The large quence datat is the human genome and the small quence

is GenBank entry AB026898.The large STS databa consists of all

132648non-repetitive human markers from UniSTS and the small t is a

group of13markers found within AB026898.All tests were conducted

using discontiguous words d.:the time for arching a large

quence against a large STS t using rever e-PCR was not determined

exactly,but is estimated to take 10days.

W110Nucleic Acids Rearch,2004,Vol.32,Web Server issue

while forward e-PCR is better with larger STS ts.Comparing a single STS to the human genome required38s with rever e-PCR,about a40-fold increa in speed compared to the equivalent arch with forward e-PCR.However,with the small(13marker)t,the advantage is clor to10-fold, and with the large t,the rever arch takes too long to be feasible.The basis for any performance beneﬁt with rever e-PCR lies in avoiding a scan of the entire genome.As the number of STSs increas,we rapidly approach the situation where most of the databa must be examined.Furthermore, there are more data involved and they are retrieved in a ran-dom-access fashion rather than quentially.Conquently, rever e-PCR is best suited for its intended u in int

eractive arching of large quences.It should be appreciated that the actual arch times that may be expected with the web rvice may vary signiﬁcantly due to system load,network latency and properties of the data.

THE e-PCR WEB SERVER

Although urs may download the e-PCR software and apply it to any datat they wish,there are a number of advantages to having a centralized web rver dedicated to this task.Though not particularly difﬁcult,downloading,installation and main-tenance of the software are chores that occasional urs will probably want to avoid.A more signiﬁcant issue is the main-tenance of the STS and quence databas,which,as we have en,mayrequiresubstantialdiskresources.Wehavedeveloped an e-PCR web rver on the NCBI site,which provides a comprehensive STS databa,bi.nlm.nih. gov/genome/unists),and DNA quence datats for the genomes and transcriptomes of veral well-studied organ-isms.Another advantage of a web-bad implementation is that results can be linked to related resources and more sophist-icated views can be easily provided.As described in more detail below,e-PCR results may be linked to UniSTS,the NCBI MapViewer[(13)available from ncbi.v/ entrez/query.fcgi?db=Books]and UniGene[(14)available from ncbi.nlm.n

In preparation for a forward e-PCR arch,the ur must specify one or more query quences,either by pasting the actual quence data(FASTA format)or by entering GenBank accession numbers.The only STS databa provided is UniSTS,which is a comprehensive collection covering all species,with data inputs from both GenBank STS quence entries and published STS maps(as of May2004,the databa contained265380distinct primer pairs).There are provisions for changing any of the parameters shown in Table1as well as an option to exclude STSs,from the databa that have been ﬂagged as too repetitive.Once the arch is complete,results are prented in tabular form.For each STSs found,the posi-tion within the query quence,the marker name,the chromo-some(if known)and the species of origin are given.Each marker has a hypertext link to the corresponding UniSTS entry,which provides the primer quences and expected product sizes,alternate names by which the marker may be known,mapping results and additional pre-computed e-PCR results.

Setting up a rever e-PCR arch requires entering one or more STSs and lecting a quence d

ataba.It is mandatory that a species be lected,together with a choice of either genome or transcriptome.It is possible to change the values for some of the parameters,but the choices are limited for W and F becau they areﬁxed at the time the quence databa is hashed.Several interfaces are provided for entering STS information,using either parate inputﬁelds or a single text area into which formatted information can be pasted.Once the arch is complete,the results are summarized in a tabular format,giving the number of hits for each marker.The soft-ware also performs a lookup of the primers in UniSTS to determine if any of them correspond to markers that have already been developed.When arching a genome quence, each hit has a link to a graphical display in the NCBI Map Viewer,where it is possible to e where the STS is found relative to other annotated features.When the transcriptome option is ud,each hit is linked to a Gene or UniGene data-ba entry.

DISCUSSION

With the genomic quence in hand,a major application of e-PCR is the integration of legacy maps with the quence.By doing so,all STSs—whether they come from a high-resolution clone-bad map of a dia susceptibility locus or radiation hybrid map of the whole genome—can be placed in a common coordinate system.Indeed,the STS track prented in the NCBI Map Viewer is generated

using e-PCR to compare all UniSTS entries to the genomic quence.Furthermore, comparison of this computationally generated STS map with experimentally determined genetic and physical maps provides a certain level of validation of the genome asmbly. However,it should be noted that only gross rearrangements are likely to be found given the resolution of the maps. Integration of genetic linkage maps with the genomic quence has the added beneﬁt of allowing regions of par-ticularly high and low rates of meiotic recombination to be identiﬁed.This is of interest becau recombination can have a profound effect on the evolution of chromosomal gments.In a previous study(15),e-PCR was ud to localize polymorphic STSs from a human linkage map(16)within an older(‘work-ing draft’)version of the human genome(17).By looking at the ratio between genetic distances measured in centiMorgans (cM,deﬁned as1%recombination)and physical ba-pair distances,veral recombination‘derts’(low)and‘jungles’(high)were identiﬁed.It was noted that regions of linkage diquilibrium extended for greater distances in the derts than in the jungles.

The e-PCR program has increasingly important applications to the process of designing new PCR primer pairs.Primers are usually chon using software,such as Primer3(18),that lects DNA oligos with a desired melting temperature and applies various heuristics to avoid problems such as low-complexity quences and lf-annealing primer pairs.A uful adjunct to this process is to u e-PC

R to compare the chon primers to the genomic quence.Primers that match multiple locations in the genome can be discarded before expending

Nucleic Acids Rearch,2004,Vol.32,Web Server issue W111

resources synthesizing the oligos and using them in an experi-ment.This is particularly important given the trend toward construction of large arrays containing tens of thousands of PCR products,which are commonly ud to study gene expression patterns or to identify DNA-binding factors. SOFTWARE AVAILABILITY

The e-PCR software is in the public domain and source code is freely available by FTP from ftp://bi.v/repos-itory/e-PCR/.The code is compatible with,but does not require,the NCBI C++Software Toolkit(19),available from ncbi.v/entrez/query.fcgi?db=Books]. Entry to the web rvices is bi.nlm.nih. gov/sutils/e-pcr/.

thumbs是什么意思ACKNOWLEDGEMENTS

We would like to thank the reviewers and the many urs of e-PCR who have reported bugs and have made helpful sugges-tions for improvements.

REFERENCES

1.Olson,M.,Hood,L.,Cantor,C.and Botstein,D.(1989)A common

language for physical mapping of the human genome.Science,245,

1434–1435.

2.Schuler,G.D.(1997)Sequence mapping by electronic PCR.Genome Res.,

7,541–550.

3.Schuler,G.D.(1998)Electronic PCR:bridging the gap between genome

mapping and genome quencing.Trends Biotechnol.,16,456–459.

4.Califano,A.and Rigoutsos,I.(1993)FLASH:a fast look-up algorithm for

string homology.Proc.Int.Conf.Intell.Syst.Mol.Biol.,1,56–64.

5.Ma,B.,Tromp,J.and Li,M.(2002)PatternHunter:faster and more

grenoble

2019高考估分nsitive homology arch.Bioinformatics,18,440–445.

6.Schwartz,S.,Kent,W.J.,Smit,A.,Zhang,Z.,Baertsch,R.,Hardison,R.C.,

Haussler,D.,Miller,W.,Ma,B., al.(2003)Human–mou

alignments with BLASTZ.Genome Res.,13,103–107.

7.Posfai,J.,Bhagwat,A.S.,Posfai,G.and Roberts,R.J.(1989)Predictive

motifs derived from cytosine methyltransferas.Nucleic Acids Res.,17, 2421–2435.

8.Schuler,G.D.,Boguski,M.S.,Stewart,E.A.,Stein,L.D.,Gyapay,G.,

Rice,K.,White,R.E.,Rodriguez-Tom e,P.,Aggarwal,A., al.

(1996)A gene map of the human genome.Science,274,540–546. 9.Deloukas,P.,Schuler,G.D.,Gyapay,G.,Beasley,E.M.,Soderlund,C.,

Rodriguez-Tome,P.,Hui,L.,Mati,T.C.,McKusick,K.B.,

Beckmann, al.(1998)A physical map of30,000human genes.

Science,282,744–746.

10.Dib,C.,Faure,S.,Fizames,C.,Samson,D.,Drouot,N.,Vignal,A.,

Millasau,P.,Marc,S.,Hazan,J., al.(1996)A

comprehensive genetic map of the human genome bad on5,264

microsatellites.Nature,380,152–154.

11.Gyapay,G.,Schmitt,K.,Fizames,C.,Jones,H.,Vega-Czarny,N.,

Spillett,D.,Mulet,D.,Prud’homme,J.F.,Dib,C., al.

美女纸牌(1996)A radiation hybrid map of the human genome.Hum.Mol.Genet., 5,339–346.

12.Stewart,E.A.,McKusick,K.B.,Aggarwal,A.,Bajorek,E.,Brady,S.,

Chu,A.,Fang,N.,Hadley,D.,Harris,M., al.(1997)An

STS-bad radiation hybrid map of the human genome.Genome Res., 7,422–433.

13.Dombrowski,S.M.and Maglott,D.(2002)Using the Map Viewer to

explore genomes.In McEntyre,J.(ed.),The NCBI handbook[Internet].

National Library of Medicine(US),National Center for Biotechnology Information,Bethesda(MD).

14.Pontius,J.U.,Wagner,L.and Schuler,G.D.(2002)UniGene:a

unified view of the transcriptome.In McEntyre,J.(ed.),The NCBI

handbook[Internet].National Library of Medicine(US),

Bethesda(MD).

15.Yu,A.,Zhao,C.,Fan,Y.,Jang,W.,Mungall,A.J.,Deloukas,P.,Oln,A.,

Doggett,N.A.,Ghebranious,N.,Broman, al.(2001)Comparison of human genetic and quence-bad physical maps.Nature,409,center

951–953.

16.Broman,K.W.,Murray,J.C.,Sheffield,V.C.,White,R.L.and Weber,J.L.

(1998)Comprehensive human genetic maps:individual and

x-specific variation in recombination.Am.J.Hum.Genet.,63,

861–869.

17.The International Human Genome Sequencing Consortium(2001)

Initial quencing and analysis of the human genome.Nature,409,

860–921.

18.Rozen,S.and Skaletsky,H.(2000)Primer3on the WWW for general

董事长助理英文

urs and for biologist programmers.Methods Mol.Biol.,132,

365–386.

19.NCBI(2003)The NCBI C++Toolkit[Internet].National Library of

Medicine(NLM),Bethesda,MD.

cum是啥意思

W112Nucleic Acids Rearch,2004,Vol.32,Web Server issue

本文发布于:2023-07-24 07:30:50，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/187059.html

上一篇：元音字母aeiou的发音规则汇总

下一篇：测序常用名词解释整理

标签：翻译高考美女董事长兼职估分

留言与评论（共有 0 条评论）