Computability of Models for Sequence Asmbly
Paul Medvedev,Konstantinos Georgiou,Gene Myers,and Michael Brudno
University of Toronto,Canada Janelia Farms,Howard Hughes Medical Institute,USA pashadag,cgeorg,o.edu,myersg@ Abstract.Graph-theoretic models have come to the forefront as some of the
most powerful and practical methods for quence asmbly.Simultaneously,the
computational hardness of the underlying graph algorithms has remained open.
Here we prent two theoretical results about the complexity of the models for
quence asmbly.In thefirst part,we show quence asmbly to be NP-hard
under two different models:string graphs and de Bruijn graphs.Together with
an earlier result on the NP-hardness of overlap graphs,this demonstrates that all
of the popular graph-theoretic quence asmbly paradigms are NP-hard.In our
cond result,we give thefirst,to our knowledge,optimal polynomial time al-
gorithm for genome asmbly that explicitly models the double-strandedness of
DNA.We solve the Chine Postman Problem on bidirected graphs using bidi-
cad怎么学
rectedflow techniques and show to how to u it tofind the shortest double-
stranded DNA quence which contains a given t of-long words.This algo-
rithm has applications to quencing by hybridization and short read asmbly.
1Introduction
Most current technologies for quencing genomes rely on the shotgun method–the genome(or its portion)is broken into many small gments(reads)who quence is then determined.The problem of combining the reads to reconstruct the source genome is known as quence(or genome)asmbly,and is one of the fundamental algorithmic problems within bioinformatics.One basic assumption made by asmbly algorithms is that every read in the input must be prent in the original genome.This follows from the fact that it was read from the genome.Motivated by parsimony,
some methods made another,less justifiable assumption:that the original genome should be the shortest quence that contains every read as a substring.This assumption lead to the casting of the genome asmbly problem as the Shortest Common Superstring(SCS) problem,which is known to be NP-hard[4].
The problem of modeling genome asmbly as the SCS problem is that most genomes have repeats–multiple identical,or nearly identical,stretches of DNA,while the SCS solution would reprent each of the repeats only once in the asmbled genome.This problem is known as over-collapsing the repeats.One way of solving this problem is to build reprentative strings or structures for each repeat,and allow the asmbly al-gorithm to u the multiple times.Pevzner et al.[12]had the insight that by dividing the reads into shorter-long stretches(called-mers),all of the instances of a repeat collap into a single t of vertices.They reprent each read as a walk on a de Bruijn graph(defined below),and the asmbly could then be reprented as a superwalk–a
ATTGCC
G G C A A
T A B C
5’5’3’3’Fig.1.A .An example of double stranded DNA.The quence read from this DNA can be either ATTGCC or GGCAAT.B .Three possible types of overlaps between two reads:each read can be in either of two orientations,but two of the cas (both to the left and both to the right)are symmetric.C .The three corresponding types of bidirected edges.The left node corresponds to the lower read.Note that the arrow points into a node if and only if the overlap covers the start (5’)of the read.
walk that includes all of the input walks.In this formulation every edge of the de Bruijn graph has to be prent in any solution and can be ud multiple times.The solution to the asmbly problem is formulated as a variation on finding an Eulerian tour,and becau the Eulerian tour problem is solvable in polynomial time this lead to the hope of a polynomial algorithm for quence asmbly.This approach was later expanded to A-Bruijn graphs [13],where the initial subdivision into -mers is not necessary,but the basic algorithmic problem of arching for a superwalk remains.
Myers [10]provides for an alternative model of quence asmbly,using a string graph.Instead of dividing the reads into -mers,he builds an overlap graph –a graph where nodes correspond to reads and edges correspond to overlaps (the prefix of one read is the suffix of the other).Through the process of removing redundant edges he is able to classify all edges as either required or optional,and the goal of the asmbly is to find the shortest walk which includes all of the required edges.The main algorithmic difference between the de Bruijn /A-Bruijn and the string graph models for quence asmbly is that while in the latter some edges are required and others are optional,in the former all edges are required,but walks have been pre-specified and must be included in the solution.In our first result,we show that quence asmbly with both string graphs and de Bruijn graphs is NP-hard by reduction from Hamiltonian Cycle and Shortest Common Superstring,respectively.Together,the two proofs demonstrate that both of the popular graph-theoretic quence asmbly paradigms are unsolvable by optimal polynomial-time algorithms unless .
Another algorithmic problem faced by asmbly algorithms is the treatment of double-stranded DNA (e Figure 1A).A DNA molecule consists of two strands which are rever compliments of each other.The start (called 5’)of one strand is comple-menting the end (called 3’)of the other.Whenever D
NA is quenced,the molecule is always read in the same direction,from 5’to 3’,but it is impossible to know from which of the two strands the quence is read.Many quence asmbly algorithms u heuristics to determine the strand for each read.The EULER method [12]us both the reads and their rever-complements to build the de Bruijn graph and arches heuristi-cally for two “complementary”paths.In the work of Kececioglu and Myers [6]strand lection for a read is formulated as the NP-hard maximum weight cut problem.
2字母笔顺
In1992,Kececioglu[8]introduced an elegant method for dealing with double-strandedness by modeling overlaps between DNA molecules using a bidirected graph. Each read is reprented by a single node,and each overlap(edge)has an orientation at both endpoints.The three types of bidirected edges correspond to the three possible ways in which the overlap can occur(e Figure1B&C).Bidirected graphs were further ud for quence asmbly in[9,10]and to model breakpoint graphs in[7]. Remarkably,however,bidirected graphs have been studied within the context of graph theory already in the1960s when Edmonds formulated the problem of bidirectedflow(a generalization of networkflow to bidirected graphs)and showed it equivalent to perfect b-matchings[1].Edmonds’work was later extended by Gabow[3],who gave the fastest to-date algorithm
for bidirectedflow.In our cond result,we extend Gabow’s and Edmonds’work to give a polynomial time algorithm for solving the Chine Postman Problem in bidirected graphs.By combining this algorithm with Pevzner’s work on de Bruijn graphs[11,12]and Kececioglu’s work on modeling strandedness with bidirected graphs[8],we show how it can be ud tofind the shortest(double-stranded)DNA quence with a given t of-long DNA fragments.To the best of our knowledge,this is thefirst optimal polynomial time asmbly algorithm which explicitly deals with the double-stranded nature of DNA.
2Preliminaries
In this ction,we give the background and definitions needed for the rest of this paper.
2.1Strings,Overlap Graphs,de Bruijn Graphs,and Molecules
Let and be two strings over the alphabet.The concatenation of the strings is denoted as.The length of is denoted by.The th character of is denoted by.If,then is the substring beginning at the th position and ending at the th position,inclusive.If there exists such that,then we say is a substring of.For,is concatenated with itlf times if ,and otherwi.A string of length is called a-mer.The-spectrum of is the t of all-mers that are substrings of.A-molecule is a pair of-mers which are revers
e compliments of each other.We say a-molecule corresponds to each of its two constitutive-mers.The-molecule-spectrum of a DNA molecule is the t of all-molecules corresponding to the-mers of the-spectrum of either of the DNA strands.
We say overlaps if there exists a maximal length non-empty string which is a prefix of and a suffix of(notice this definition is not symmetric).The length of the overlap is.If does not overlap then.Let be a t of non-empty strings over an alphabet.An overlap graph of is a complete weighted directed graph where each string in is a vertex and the length of the edge is.
We say is a superstring of if for all is a substring of.The Shortest Common Superstring(SCS)problem is tofind the shortest superstring of.It was proven to be NP-hard for[4,5].We define the de Bruijn graph as a
3
清蒸鲈鱼最简单的做法-1
氢动力-1
E
-1000Z 02-10Y -101-1X 00
01W D C B A Fig.2.This is an example of a bidirected graph and its incidence matrix.We draw an edge that is positive incident to a vertex using an arrow that is pointing out of the vertex,but this choice of graphical reprentation is arbitrary.
directed graph,using a positive integer parameter
.The vertices of are such that is a substring of .We identify a vertex by the -mer associated with it.We abu notation here by referring to a vertex in by the -mer associated with it.The edges are such that is a substring of
.
2.2Bidirected Graphs and Flow
Consider an undirected (multi)graph with a t of vertices and a
t of edges .The multiplicity of an edge
is the number of edges in who endpoints are the same as ’s If the endpoints are distinct,the edge is called a link ,otherwi it is a loop .Additionally,we assign orientations to the edges.Every link has two orientations,
one with respect to each of its endpoints,while every loop has one orientation.There are two kinds of orientations –positive and negative –and thus we can say an edge is positive-incident or negative-incident to an endpoint.When taken together with the orientations of its edges,is called a bidirected graph .If there is additionally a weight function associated with the edges,we say the graph is weighted .The weight of a graph is the sum of the weights of its edges.A bidirected graph is connected if its underlying undirected graph is connected.
The orientations of the edges can be reprented by an incidence matrix
(we omit when it is obvious from the
眠词语context).If an edge is not incident to a vertex then .For a link and a vertex ,if is positive-incident to ,and if is negative-incident to
.For a loop and a vertex ,has the value of +2if is positive-incident to ,and the value of -2if is negative-incident to .See Figure 2for an example of a bidirected graph and its incidence matrix.The in-degree of a vertex
in graph is defined as .Similarly,the out-degree is defined as .Let
be the balance at each vertex.is balanced if the balance of each vertex is 0.A -walk is a quence where is an edge inci-dent to and ,and for all ,and have opposite orientations at .Since the specification of vertices is redundant,we may omit them sometimes and specify a walk as just a quence of edges.A walk is said to be cyclical if its endpoints
4
are the same and and have opposite orientations at.A bidirected graph is strongly connected if it is connected and for every edge there is a cyclical walk con-taining it.
Note that we can view a loopless directed graph as a special kind of bidirected graph, where every e
dge is positive-incident to one of its endpoints and negative-incident to the other one.In this ca,the definition of a walk reduces to its usual meaning in directed graphs.However,there are some caveats.For example,it is possible for the shortest walk between two vertices to repeat a vertex in a bidirected graph.In Figure 2,obrve that there does not exist a walk between and which does not repeat a vertex,something that is not possible in a directed graph.
A Chine walk is a cyclical walk that travers every edge at least once.Given a weighted bidirected graph,the Chine Postman Problem(CPP)is tofind a mini-mum weight Chine walk(called a Chine Postman Tour),or report that one doesn’t exist.An Eulerian tour of a graph is a cyclical walk that contains every edge of the graph exactly once,and a graph which contains an Eulerian tour is called Eulerian. The following is a generalization of a well-known fact for directed graphs who proof is almost identical to the directed ca and is therefore ommited.
Lemma1.A bidirected graph contains an Eulerian tour if and only if it is connected and balanced.
Given a bidirected graph,and vectors and,a minimum cost bidirectedflow problem[14]is an integer linear program where the goal is tofind that minimizes subject to the constraints that
and.Here,refers to the inner product between two vectors,and is a component-wi comparison oper
ator.
3The String Graph Framework
In[10],Myers introduces a string graph framework for quence asmbly.A string graph is built from an overlap graph through the process of transitively inferable edge reduction–whenever and overlap,and overlaps,the overlap of to is said to be inferable from the other two overlaps,and is removed from the graph.Myers demonstrates a fast algorithm for removing transitively inferable edges from the graph, which,in combination with statistical methods,associates a”lection”constraint with each edge.The lection constraint states that the edge must appear in the target genome either at least once(it is required),exactly once(it is exact),or any number of times (it is optional).The key property of string graphs is that any cyclical walk that respects the lection constraints reprents a valid asmbly of the genome.and the weight of the walk is the length of the asmbled genome.After building the string graph,the algorithmic problem is tofind a cyclical walk that visits each edge in accordance with its lection constraint.Appealing to parsimony,the goal is tofind a walk with minimum weight.In this ction,we show that this problem is NP-hard.
Formally,a lection function is a function that classifies each edge into one of three categories:optio
nal,required,exact.We call a walk which contains all the required edges at least once,all the exact edges exactly once,and all the optional
香菜
5
edges any number of times an s-walk.The Minimum-Walk Problem(MSWP)for a weighted directed graph and a lection function is the problem offinding a minimum weight cyclical-walk of,or report that one doesn’t exist.
Theorem1.The Minimum-Walk Problem is NP-hard.
The proof works by reducing the Hamiltonian Cycle problem in directed graphs to MSWP.A cycle is Hamiltonian if it visits every vertex exactly once.The reduction works by splitting each vertex into’in’and’out’counterparts and adding a required edge between them,while making all other edges optional.Having optional edges is esntial for the reduction;if they are not prent,the problem can be efficiently solved using a variant of the algorithm of Section5.1.Also note that in[10]the edges of the string graph are bidirected in order to reflect the double strandedness of DNA.Since directed graphs are a special type of bidirected graphs,Theorem1holds for bidirected graphs as well.
Proof.Let be a directed graph,with vertices,for which we wish tofind a Hamiltonian cycle.Let be a directed graph with vertex t
and edge t,where and
.The weight of each edge is1.Let be a lection function on that labels all the edges as optional and all the edges as required.We show that has a Hamiltonian cycle if and only if has a cyclical-walk of weight at most.
First,suppo is a Hamiltonian cycle of.Then
is a cyclical-walk in of weight.For the other direction,let be a cyclical-walk in of length at most.Becau the edges form a matching and all of them must be in,the edges of must alternate between and edges,and thus have a total of edges of each kind.If we remove all the edges from and map all the vertices of to their counterparts in,we get a Hamiltonian cycle of.
4The de Bruijn Graph Framework.
One of the original graph-theoretic frameworks for quence asmbly was propod by Pevzner,Tang,and Waterman in[12].They note that by tiling every read by-mers they can view the rea
d as a walk in a de Bruijn graph,where the vertices are -mers and edges are-mers.Thus,any walk that contains all the reads as sub-walks reprents a valid asmbly.Conquently,they formulate the asmbly problem asfinding the shortest superwalk,a problem cloly related to the polynomial time Eu-lerian tour problem(which was previously ud to solve the problem of quencing by hybridization[11]).What we show in this ction is that the de Bruijn graph framework does not make the problem of read asmbly more tractable.
鲁滨逊漂流记主要人物Let be a t of strings over an alphabet and let
be the de Bruijn graph of for some.The strings correspond to walks in
via the function.A walk is called a superwalk of if,for all,it contains as a subwalk.Thus,a superwalk reprents a valid asmbly of the reads into a genome.Within this framework,the goal
纽约摄影学院6