电子PCR技术

更新时间:2023-07-24 07:30:50 阅读: 评论:0

A web rver for performing electronic PCR
Kirill Rotmistrovsky,Wonhee Jang and Gregory D.Schuler*
National Center for Biotechnology Information,National Library of Medicine,National Institutes of Health,Bethesda,MD 20984,USA
Received February 27,2004;Revid and Accepted April 21,2004
ABSTRACT
‘Electronic PCR’(e-PCR)refers to a computational procedure that is ud to arch DNA quences for quence tagged sites (STSs),each of which is defined by a pair of primer quences and an expected PCR product size.To gain speed,our implementation extracts short ‘words’from the 30end of each primer and stores them in a sorted hash table that can be accesd efficiently during the arch.One recent improvement is the u of overlapping discontinuous words to allow matches to be found despite the pre-nce of a mismatch.Moreover,it is possible to allow gaps in the alignment between the primer and the quence.The effect of the changes is to improve nsitivity without significantly affecting specificity.The new software provides a arch mode using a query STS again
st a quence databa to augment the previously available mode using a query quence against an STS databa.Finally,e-PCR may now be ud through a web rvice,with arch results linked to other web resources such as the UniSTS databa and the MapViewer genome browr.The e-PCR web rver may be found bi.v/sutils/e-pcr.
INTRODUCTION
A major milestone in the history of genome map construction was the notion of a quence tagged site (STS),which is defined by a pair of oligonucleotide primers that can be ud in a PCR to amplify a unique site within the genome (1).STS markers have formed the basis for virtually all physical and genetic maps constructed over the last decade,rapidly replacing the earlier generation of cloned DNA gment mar-kers.PCR primer pairs can also be ud to probe the tran-scriptome,yielding large-scale profiles of gene expression.In an era when the large-scale quencing of genomes and tran-scriptomes is routinely undertaken,there is significant utility
in being able to cross-reference large collections of PCR pri-mer pairs and quences.
We have previously described the concept of ‘electronic PCR’(e-PCR)as a computational procedure for finding quence tagged sites within DNA quences and provided an efficient implementation of
this procedure (2).To gain speed we employed the commonly ud strategy of hashing,in which the bas from a window of size W (a ‘word’)are ud as an index into a hash table (for an overview of program parameters,e Table 1).Each time a matching word is found,a portion of the quence is checked for an alignment to the corresponding primer.Finally,a match is reported if both primers are found in the correct orientation and imply a product size that is within M bas of the expected size (Figure 1).Increasing the value of W accelerates the arch by reducing the background of word matches that must be investigated.In the original implementation of the program,only one word was hashed per primer and the requirement that its W contiguous bas match exactly led to a loss of nsi-tivity.Even though mismatches were allowed in the primer alignment step,their prence within the hashed word was sufficient to deny a match.
In previous reports,we have described applications of e-PCR for binding genomic and transcribed quences to map positions (2)and for using STS maps to asss the quality and completeness of a genomic quence (3).Here we describe algorithmic changes that result in improved n-sitivity,a arch mode in which a query STS can be compared to a quence databa,and a web rver for performing e-PCR.
*To whom correspondence should be addresd.Tel:+13014962475;Fax:+13014809241;Email:schul
er@ncbi.v
The online version of this article has been published under an open access model.Urs are entitled to u,reproduce,disminate,or display the open access version of this article provided that:the original authorship is properly and fully attributed;the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given;if an article is subquently reproduced or disminated not in its entirety but only in part or as a derivative work this must be clearly indicated.ª2004,the authors
Table 1.Electronic PCR program parameters Parameter Meaning W Number of bas ud as a word for hashing
F Number of discontiguous words hashed (F =0for contiguous)N Number of mismatches allowed in primer alignment
G Number of gaps allowed in primer alignment
M
Number
of
bas the STS size may differ from expected size
Additional options may be added in the future.Invoking the program with -h as the argument produces a list of all available options.
Nucleic Acids Rearch,Vol.32,Web Server issue ªOxford University Press 2004;all rights rerved
W108–W112Nucleic Acids Rearch,2004,Vol.32,Web Server issue DOI:10.1093/nar/gkh450
SEARCH SENSITIVITY
The original e-PCR implementation was fairly rigid in its match criteria,but there are a number of reasons why a more ‘fuzzy’matching strategy might be desired.For exam-ple,when arching against single-pass (low-quality)quences,such as expresd quence tags and clone end quences,a certain rate of quencing error is expected.Alternatively,the quence may be free of errors and the goal to find near-matches that may cau confounding signals.In any event,depending on the particular bas involved,the PCR biochemical reaction may tolerate some mispairing within the primers.To improve arch nsitivity,we have modified both the hashing and th
cim primer alignment steps of e-PCR.
To reduce the likelihood that a true STS will be misd due to mismatches,we have changed the way in which hash table values are generated.Instead of using a single exact word,multiple,discontiguous words are ud,each of which has groups of significant positions parated by ‘wildcard’posi-tions that are not required to match.They have been variously called ‘templates’,‘patterns’and ‘motifs’and have been ud previously in DNA databa arching (4–6)and multiple protein quence alignment (7).In e-PCR,the F parameter specifies the number of words generated as well as the spacing of the wildcard positions.For example,using F =3(as in Figure 1b),the wildcards occur every third position.By having this template successively shifted by one position in each of the three words,every ba corresponds to a wildcard in some word.Thus,a word match is guaranteed for any ca where there is just one mismatch.Two or more mismatches will still
po a problem (except in the fortuitous ca where their spacing is a multiple of F ).However,as will be shown later,allowing more than one mismatch greatly increas the number of fal positives.
The primer alignment step that is invoked following each word hit has been modified to allow gaps (inrtions or dele-tions)in the alignment.This feature is enabled by the G para-meter,who value sp
ecifies the maximum number of gaps allowed in each primer.Although the algorithm does not place any constraints on where the gap may be,it is important to note that gaps within the W bas ud for hashing will gen-erally prevent getting a word hit.Allowing gaps is very uful when arching low-quality quences or when using primers designed from low-quality data.However,this option will also increa the running time and may generate fal positives.Given the modifications for improving nsitivity,it is of interest to e how effective they are and to what extent they affect specificity.To test this,we cho a t of 584micro-satellite STSs that were ud as reference markers for the human transcript map (8,9).They have all been mapped with high confidence,and in a consistent order,by meiotic linkage mapping (10)and radiation hybrid (RH)mapping using two different RH panels,the GeneBridge 4panel (11)and the Stanford G3panel (12).Thus,we may be fairly certain that they reprent unique sites in the genome.However,one caveat is that if a site appears multiple times within a window smaller than the map resolution,it would have appeared as a unique site in the mapping studies.Of the three mapping resources,Stanford G3is the most preci,with an average resolution of  500kb (12).In fact,we found two instances of STSs reacting with multiple sites that were <500kb apart and we treated each pair as a single site.If an STS was found only once in the genome,and on the expected chromosome,it was assumed to be a true positive.For tho that hit multiple times,one was counted as a true positive (if on the correct chromo-
some)and all the rest were counted as fal positives.Of cour,any STS not found was regarded a fal negative.We compared the test STS t to the complete quence of the human genome using various e-PCR parameters to e how specificity and nsitivity would be affected.Table 2provides the numbers of true and fal matches,the number of STSs not found and calculations of nsitivity (fraction of STSs report-ing a match)and specificity (fraction of matches that are true).Not surprisingly,a specificity of 1.0is obtained when primers are required to match exactly (N =0,G =0),but 80of the markers were not found,yielding a nsitivity of only 0.863.Although the human genomic quence is known to contain gaps,a recent analysis suggests that it includes 99%of the euchromatin (The International Human Genome Sequencing Consortium,manuscript submitted),suggesting an upper bound of  0.990on nsitivity.Using discontiguous words and allowing one mismatch and one gap per primer gives the best balance of nsitivity and specificity (0.983and 0.991).Allowing two gaps resulted in only one additional STS being found but more than doubled the number of fal positives.The most drastic loss of specificity is en when two mis-matches are allowed.The five markers that were not found under any conditions were investigated further and found to have alignment gaps within the hashed W bas.In the comparisons,the word size was kept constant (W =12).It should be noted that for tests using discontiguous words,
the
Figure 1.Electronic PCR concepts.(a )An STS is defined by a pair of primers which anneal to the target DNA in opposite orientations.Each primer is extended on its 30end in the direction of the other primer by Taq polymera.Multiple cycles of annealing and extension lead to a substantial amplification of the STS quence,also known as the ‘amplicon’.(b )By default,a single contiguous word of W bas (W =9in this example)is extracted from the 30end of each primer.Any mismatch or
gap within this region is sufficient to eliminate a match.With discontiguous words enabled (F =3),three words are indexed,each with a ‘wildcard’position every third ba.Staggering of the wildcard positions ensures that no single mismatch in this region will deny a match.
Nucleic Acids Rearch,2004,Vol.32,Web Server issue
在线兼职翻译W109
‘effective word size’is eight(excluding four wildcard posi-tions),which caus the program to run more slowly but does not significantly change the results shown in Table2.It should be noted that the tests were performed using a t of well-mapped markers,which will surely have different properties from tho chon at random.
REVERSE SEARCHING
Although the original e-PCR program constructed a hash table from the STS databa,there are situations in which it is more desirable to hash the quence databa.We have imple-mented this strategy and refer to it as‘rever e-PCR’.Con-verly,the previous u of a quence query against an STS databa would be‘forward e-PCR’.The main motivation for implementing rever e-PCR wa
s to make it feasible to arch the human genome quence(and other large genomes)in an interactive web rvice.Before performing a rever e-PCR
arch,the quence databa must be procesd using a specific word size and discontiguous word option.Sequences are scanned,examining each word in turn,and ultimately a data structure is created in which each possible word has an associated list of all quence coordinates(pairs of quence identifier and ba position)at which it is found.This step is time-consuming,but only needs to be done once(unless the underlying quence changes).Thereafter,an STS can be compared against the genome by extracting a few words from each primer,retrieving lists of positions to examine and reading only the necessary portions of the quence into memory to test for primer alignments.However,the index—actually a t of veralfiles organized for efficient memory mapping—requires storage that is 10–15·the size of the original quence databa.In other words,space is traded to gain time.
It should be noted that speed will degrade significantly when a primer contains a highly repetitive word.This is due to the fact that its list of quence coordinates will be large,and following up on each one requires reading a gment of quence data from the storage device.In the initial indexing of the databa,it is possible to identify words that occur too frequently and simply mark the
m as repetitive rather than storing all of their positions.This ems reasonable becau, as a general rule,urs will only want to know that a candidate marker is repetitive(so that it can be eliminated from further consideration)and not e a full account of its positions. Eliminating repetitive word coordinates from the index both increas speed and decreas storage requirements.However, it also may result in occasionally missing a true STS becau it is possible that the W bas ud for the lookup are repetitive, even if the primer as a whole is not.
To provide a better n of which arch strategy is most appropriate for certain situations,we have devid a ries of test cas using ts of quences and STSs of different sizes. The large quence databa consists of all of the quence contigs from the2.86Gb human genome quence.For the small quence,we cho a single270kb quence entry(AB026898)corresponding to a region of human chromosome3.This reprents 1/10000of the genome and falls at the high end of a typical size distribution for large-inrt clones.A t of132648non-repetitive human markers from UniSTS constitutes the large STS databa. Of the,a t of13(again,1/10000of the large t)markers (all of which fall within AB026898)was chon as the small STS databa.In addition,the program was run using just a single STS(marker D3S3333).The time and disk space required for each situation are shown in Table3.Overall, rever e-PCR is faster when using small numbers of STSs,
hotel california
Table2.e-PCR nsitivity and specificity with different arch options
Search parameters Search results
Word type Mismatches allowed Gaps allowed True positives Fal positives Fal negatives Sensitivity Specificity Contiguous005040800.863  1.000 Contiguous105430410.930  1.000 Contiguous115563280.9520.995 Discontiguous105600240.959  1.000 Discontiguous115745100.9830.991 Discontiguous125751490.9850.976 Discontiguous2157817260.9900.771 Discontiguous2257987450.9910.398 Tests were conducted with a t of584microsatellite STSs with a consistent order among three maps.They were compared to the complete human genome (build34;July2003)using W=12and M=200.Discontiguous words were activated with F=3and the number of mismatches and gaps allowed were varied using the N and G parameters(Table1).Sensitivity(Sn)is defined as Sn=TP/(TP+FN),where TP is true positives and FN is fal negatives.Specificity(Sp)is defined as Sp=TP/(TP+FP),where FP is fal positives.
Table3.Relative running times and storage requirements for forward and
rever e-PCR
Datats Forward e-PCR Rever e-PCR
Time(s)Space(MB)Time(s)Space(MB)
Small quence(270kb)
Versus single STS4<136
Versus small STS t4<1166
Versus large STS t11127817
Large quence(2865kb)
Versus single STS117829063835837
Versus small STS t1161290615535837
Versus large STS t545402917n.d.35844
The large quence datat is the human genome and the small quence
is GenBank entry AB026898.The large STS databa consists of all
132648non-repetitive human markers from UniSTS and the small t is a
group of13markers found within AB026898.All tests were conducted
using discontiguous words d.:the time for arching a large
quence against a large STS t using rever e-PCR was not determined
exactly,but is estimated to take 10days.
W110Nucleic Acids Rearch,2004,Vol.32,Web Server issue
while forward e-PCR is better with larger STS ts.Comparing a single STS to the human genome required38s with rever e-PCR,about a40-fold increa in speed compared to the equivalent arch with forward e-PCR.However,with the small(13marker)t,the advantage is clor to10-fold, and with the large t,the rever arch takes too long to be feasible.The basis for any performance benefit with rever e-PCR lies in avoiding a scan of the entire genome.As the number of STSs increas,we rapidly approach the situation where most of the databa must be examined.Furthermore, there are more data involved and they are retrieved in a ran-dom-access fashion rather than quentially.Conquently, rever e-PCR is best suited for its intended u in int
eractive arching of large quences.It should be appreciated that the actual arch times that may be expected with the web rvice may vary significantly due to system load,network latency and properties of the data.
THE e-PCR WEB SERVER
Although urs may download the e-PCR software and apply it to any datat they wish,there are a number of advantages to having a centralized web rver dedicated to this task.Though not particularly difficult,downloading,installation and main-tenance of the software are chores that occasional urs will probably want to avoid.A more significant issue is the main-tenance of the STS and quence databas,which,as we have en,mayrequiresubstantialdiskresources.Wehavedeveloped an e-PCR web rver on the NCBI site,which provides a comprehensive STS databa,bi.nlm.nih. gov/genome/unists),and DNA quence datats for the genomes and transcriptomes of veral well-studied organ-isms.Another advantage of a web-bad implementation is that results can be linked to related resources and more sophist-icated views can be easily provided.As described in more detail below,e-PCR results may be linked to UniSTS,the NCBI MapViewer[(13)available from ncbi.v/ entrez/query.fcgi?db=Books]and UniGene[(14)available from ncbi.nlm.n
In preparation for a forward e-PCR arch,the ur must specify one or more query quences,either by pasting the actual quence data(FASTA format)or by entering GenBank accession numbers.The only STS databa provided is UniSTS,which is a comprehensive collection covering all species,with data inputs from both GenBank STS quence entries and published STS maps(as of May2004,the databa contained265380distinct primer pairs).There are provisions for changing any of the parameters shown in Table1as well as an option to exclude STSs,from the databa that have been flagged as too repetitive.Once the arch is complete,results are prented in tabular form.For each STSs found,the posi-tion within the query quence,the marker name,the chromo-some(if known)and the species of origin are given.Each marker has a hypertext link to the corresponding UniSTS entry,which provides the primer quences and expected product sizes,alternate names by which the marker may be known,mapping results and additional pre-computed e-PCR results.
Setting up a rever e-PCR arch requires entering one or more STSs and lecting a quence d
ataba.It is mandatory that a species be lected,together with a choice of either genome or transcriptome.It is possible to change the values for some of the parameters,but the choices are limited for W and F becau they arefixed at the time the quence databa is hashed.Several interfaces are provided for entering STS information,using either parate inputfields or a single text area into which formatted information can be pasted.Once the arch is complete,the results are summarized in a tabular format,giving the number of hits for each marker.The soft-ware also performs a lookup of the primers in UniSTS to determine if any of them correspond to markers that have already been developed.When arching a genome quence, each hit has a link to a graphical display in the NCBI Map Viewer,where it is possible to e where the STS is found relative to other annotated features.When the transcriptome option is ud,each hit is linked to a Gene or UniGene data-ba entry.
DISCUSSION
With the genomic quence in hand,a major application of e-PCR is the integration of legacy maps with the quence.By doing so,all STSs—whether they come from a high-resolution clone-bad map of a dia susceptibility locus or radiation hybrid map of the whole genome—can be placed in a common coordinate system.Indeed,the STS track prented in the NCBI Map Viewer is generated
using e-PCR to compare all UniSTS entries to the genomic quence.Furthermore, comparison of this computationally generated STS map with experimentally determined genetic and physical maps provides a certain level of validation of the genome asmbly. However,it should be noted that only gross rearrangements are likely to be found given the resolution of the maps. Integration of genetic linkage maps with the genomic quence has the added benefit of allowing regions of par-ticularly high and low rates of meiotic recombination to be identified.This is of interest becau recombination can have a profound effect on the evolution of chromosomal gments.In a previous study(15),e-PCR was ud to localize polymorphic STSs from a human linkage map(16)within an older(‘work-ing draft’)version of the human genome(17).By looking at the ratio between genetic distances measured in centiMorgans (cM,defined as1%recombination)and physical ba-pair distances,veral recombination‘derts’(low)and‘jungles’(high)were identified.It was noted that regions of linkage diquilibrium extended for greater distances in the derts than in the jungles.
The e-PCR program has increasingly important applications to the process of designing new PCR primer pairs.Primers are usually chon using software,such as Primer3(18),that lects DNA oligos with a desired melting temperature and applies various heuristics to avoid problems such as low-complexity quences and lf-annealing primer pairs.A uful adjunct to this process is to u e-PC
R to compare the chon primers to the genomic quence.Primers that match multiple locations in the genome can be discarded before expending
Nucleic Acids Rearch,2004,Vol.32,Web Server issue W111
resources synthesizing the oligos and using them in an experi-ment.This is particularly important given the trend toward construction of large arrays containing tens of thousands of PCR products,which are commonly ud to study gene expression patterns or to identify DNA-binding factors. SOFTWARE AVAILABILITY
The e-PCR software is in the public domain and source code is freely available by FTP from ftp://bi.v/repos-itory/e-PCR/.The code is compatible with,but does not require,the NCBI C++Software Toolkit(19),available from ncbi.v/entrez/query.fcgi?db=Books]. Entry to the web rvices is bi.nlm.nih. gov/sutils/e-pcr/.
thumbs是什么意思ACKNOWLEDGEMENTS
We would like to thank the reviewers and the many urs of e-PCR who have reported bugs and have made helpful sugges-tions for improvements.
REFERENCES
1.Olson,M.,Hood,L.,Cantor,C.and Botstein,D.(1989)A common
language for physical mapping of the human genome.Science,245,
1434–1435.
2.Schuler,G.D.(1997)Sequence mapping by electronic PCR.Genome Res.,
7,541–550.
3.Schuler,G.D.(1998)Electronic PCR:bridging the gap between genome
mapping and genome quencing.Trends Biotechnol.,16,456–459.
4.Califano,A.and Rigoutsos,I.(1993)FLASH:a fast look-up algorithm for
string homology.Proc.Int.Conf.Intell.Syst.Mol.Biol.,1,56–64.
5.Ma,B.,Tromp,J.and Li,M.(2002)PatternHunter:faster and more
grenoble
2019高考估分nsitive homology arch.Bioinformatics,18,440–445.
6.Schwartz,S.,Kent,W.J.,Smit,A.,Zhang,Z.,Baertsch,R.,Hardison,R.C.,
Haussler,D.,Miller,W.,Ma,B., al.(2003)Human–mou
alignments with BLASTZ.Genome Res.,13,103–107.
7.Posfai,J.,Bhagwat,A.S.,Posfai,G.and Roberts,R.J.(1989)Predictive
motifs derived from cytosine methyltransferas.Nucleic Acids Res.,17, 2421–2435.
8.Schuler,G.D.,Boguski,M.S.,Stewart,E.A.,Stein,L.D.,Gyapay,G.,
Rice,K.,White,R.E.,Rodriguez-Tom e,P.,Aggarwal,A., al.
(1996)A gene map of the human genome.Science,274,540–546. 9.Deloukas,P.,Schuler,G.D.,Gyapay,G.,Beasley,E.M.,Soderlund,C.,
Rodriguez-Tome,P.,Hui,L.,Mati,T.C.,McKusick,K.B.,
Beckmann, al.(1998)A physical map of30,000human genes.
Science,282,744–746.
10.Dib,C.,Faure,S.,Fizames,C.,Samson,D.,Drouot,N.,Vignal,A.,
Millasau,P.,Marc,S.,Hazan,J., al.(1996)A
comprehensive genetic map of the human genome bad on5,264
microsatellites.Nature,380,152–154.
11.Gyapay,G.,Schmitt,K.,Fizames,C.,Jones,H.,Vega-Czarny,N.,
Spillett,D.,Mulet,D.,Prud’homme,J.F.,Dib,C., al.
美女纸牌(1996)A radiation hybrid map of the human genome.Hum.Mol.Genet., 5,339–346.
12.Stewart,E.A.,McKusick,K.B.,Aggarwal,A.,Bajorek,E.,Brady,S.,
Chu,A.,Fang,N.,Hadley,D.,Harris,M., al.(1997)An
STS-bad radiation hybrid map of the human genome.Genome Res., 7,422–433.
13.Dombrowski,S.M.and Maglott,D.(2002)Using the Map Viewer to
explore genomes.In McEntyre,J.(ed.),The NCBI handbook[Internet].
National Library of Medicine(US),National Center for Biotechnology Information,Bethesda(MD).
14.Pontius,J.U.,Wagner,L.and Schuler,G.D.(2002)UniGene:a
unified view of the transcriptome.In McEntyre,J.(ed.),The NCBI
handbook[Internet].National Library of Medicine(US),
Bethesda(MD).
15.Yu,A.,Zhao,C.,Fan,Y.,Jang,W.,Mungall,A.J.,Deloukas,P.,Oln,A.,
Doggett,N.A.,Ghebranious,N.,Broman, al.(2001)Comparison of human genetic and quence-bad physical maps.Nature,409,center
951–953.
16.Broman,K.W.,Murray,J.C.,Sheffield,V.C.,White,R.L.and Weber,J.L.
(1998)Comprehensive human genetic maps:individual and
x-specific variation in recombination.Am.J.Hum.Genet.,63,
861–869.
17.The International Human Genome Sequencing Consortium(2001)
Initial quencing and analysis of the human genome.Nature,409,
860–921.
18.Rozen,S.and Skaletsky,H.(2000)Primer3on the WWW for general
董事长助理 英文
urs and for biologist programmers.Methods Mol.Biol.,132,
365–386.
19.NCBI(2003)The NCBI C++Toolkit[Internet].National Library of
Medicine(NLM),Bethesda,MD.
cum是啥意思
W112Nucleic Acids Rearch,2004,Vol.32,Web Server issue

本文发布于:2023-07-24 07:30:50,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/90/187059.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:翻译   高考   美女   董事长   兼职   估分
相关文章
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图