nature biotechnology volume 26 number 8 august 2008 897
don’t know the coin ud for each t of toss. However, if we had some way of completing the data (in our ca, guessing correctly which coin was ud in each of the five ts), then we could reduce parameter estimation for this problem with incomplete data to maximum likelihood estimation with complete data.
One iterative scheme for obtaining comple-tions could work as follows: starting from some
initial parameters, θˆˆˆ= θΑ
,θΒ (t )(t )(t )(), determine for each of the five ts whether coin A or coin B was more likely to have generated the obrved flips (using the current parameter estimates). Then, assume the completions (that is, guesd coin assignments) to be correct, and apply the regular maximum likelihood estima-tion procedure to get θˆ(t +1). Finally, repeat the two steps until convergence. As the estimated model improves, so too will the quality of the resulting completions.
The expectation maximization algorithm is a refinement on this basic idea. Rather than picking the single most likely completion of the missing coin assignments on each iteration, the expectation maximi江畔独步寻花其六
zation algorithm computes probabilities for each possible completion of the missing data, using the current parameters θˆ(t ). The probabilities are ud to create a weighted training t consisting of all possible completions of the data. Finally, a modified version of maximum likelihood estimation that deals with weighted training examples provides new parameter estimates, θˆ(t +1). By using weighted training examples rather than choosing the single best completion, the expec-tation maximization algorithm accounts for the confidence of the model in each comple-tion of the data (Fig. 1b ).
In summary, the expectation maximiza-tion algorithm alternates between the steps
z = (z 1, z 2,…, z 5), where x i ∈ {0,1,…,10} is the number of heads obrved during the i th t of toss, and z i ∈ {A,B } is the identity of the coin ud during the i th t of toss. Parameter esti-mation in this tting is known as the complete data ca in that the values of all relevant ran-dom variables in our model (that is, the result of each coin flip and the type of coin ud for each flip) are known.
Here, a simple way to estimate θA and θB is to return the obrved proportions of heads for each coin: (1)θΑˆ=
# of heads using coin A total # of flips using coin A
and
θΒˆ=
# of heads using coin B total # of flips using coin B
This intuitive guess is, in fact, known in the statistical literature as maximum likelihood estimation (roughly speaking, the maximum likelihood method asss the quality of a statistical model bad on the probability it assigns to the obrved data). If log P (x ,z ;θ) is the logarithm of the joint probability (or log-likelihood) of obtaining any particular vector of obrved head counts x and coin types z , then the formulas in (1) solve for the param-eters θˆˆˆ= θA ,θB
() that maximize log P (x ,z ;θ).Now consider a more challenging variant of the parameter estimation problem in which we are given the recorded head counts x but not the identities z of the coins ud for each t of toss. We refer to z as hidden variables or latent factors. Parameter estimation in this new tting is known as the incomplete data ca. This time, computing proportions of heads for each coin is no longer possible, becau we
P
robabilistic models, such as hidden Markov models or Bayesian networks, are com-monly ud to model biological data. Much of their popularity can be attributed to the existence of efficient and robust procedures for learning parameters from obrvations. Often, however, the only data available for training a probabilistic model are incomplete. Missing values can occur, for example, in medi-cal diagnosis, where patient histories generally include results from a limited battery of tests. Alternatively, in gene expression clustering, incomplete data ari from the intentional omission of gene-to-cluster assignments in the probabilistic model. The expectation maximi-zation algorithm enables parameter estimation in probabilistic models with incomplete data.A coin-flipping experiment
As an example, consider a simple coin-flip-ping experiment in which we are given a pair of coins A and B of unknown bias, θA and θB , respectively (that is, on any given flip, coin A will land on heads with probability θA and tails with probability 1–θA and similarly for coin B ). Our goal is to estimate θ = (θA ,θB ) by repeating the following procedure five times: randomly choo one of the two coins (with equal probability), and perform ten indepen-dent coin toss with the lected coin. Thus, the entire procedure involves a total of 50 coin toss (Fig. 1a ).
During our experiment, suppo that we keep track of two vectors x = (x 1, x 2, …, x 5) and
粘土画
What is the expectation maximization algorithm?
Chuong B Do & Serafim Batzoglou
The expectation maximization algorithm aris in many computational biology applications that involve probabilistic models. What is it good for, and how does it work?
Chuong B. Do and Serafim Batzoglou are in the Computer Science Department, Stanford University, 318 Campus Drive, Stanford, California 94305-5428, USA. e-mail: chuong@cs.stanford.edu
P r i m e r
©2008 N a t u r e P u b l i s h i n g G r o u p h t t p ://w w w .n a t u r e .c o m /n a t u r e b i o t e c h n o l o g y
898 volume 26 number 8 august 2008 nature biotechnology
log probability log P (x ;θ) of the obrved data. Generally speaking, the optimization problem addresd by the expectation maximization algorithm is more difficult than the optimiza-tion ud in maximum likelihood estimation. In the complete data ca, the objective func-tion log P (x ,z ;θ) has
a single global optimum, which can often be found in clod form (e.g., equation 1). In contrast, in the incomplete data ca the function log P (x ;θ) has multiple local maxima and no clod form solution.
T o deal with this, the expectation maximi-zation algorithm reduces the difficult task of optimizing log P (x ;θ) into a quence of simpler optimization subproblems, who objective functions have unique global maxima that can often be computed in clod form. The sub-problems are chon in a way that guarantees their corresponding solutions θˆ(1),θˆ(2),… and will converge to a local optimum of log P (x ;θ).More specifically, the expectation maxi-mization algorithm alternates between two phas. During the E-step, expectation maxi-mization choos a function g t that lower bounds log P (x ;θ) everywhere, and for which θˆ(t )g t ( )=log P (x ; )θˆ(t ). During the M-step, the expectation maximization algorithm moves to a new parameter t θˆ(t +1) that maximizes g t . As the value of the lower-bound g t matches the objective function at θˆ(t ), it follows that
g t ( )=log P (x ; )θˆ(t )θˆ(t )g t ( )≤θ
ˆ(t +1)=log P (x ; )θˆ(t +1)—s o the objective function monotonically increas during each iteration of expectation maximiza-tion! A graphical illustration of this argument is provided in Supple
xxx自由管mentary Figure 1 online, and a conci mathematical derivation of the expectation maximization algorithm is given in Supplementary Note 1 online.
As with most optimization methods for nonconcave functions, the expectation maxi-mization algorithm comes with guarantees only of convergence to a local maximum of the objective function (except in degenerate cas). Running the procedure using multiple initial starting parameters is often helpful; similarly, initializing parameters in a way that breaks symmetry in models is also important. With this limited t of tricks, the expectation maximization algorithm provides a simple and robust tool for parameter estimation in models with incomplete data. In theory, other numerical optimization techniques, such as gradient descent or Newton-Raphson, could be ud instead of expectation maximization; in practice, however, expectation maximization has the advantage of being simple, robust and easy to implement.
Applications
Many probabilistic models in computational biology include latent variables. In some
was analyzed more generally by Hartley 2 and by Baum et al.3 in the context of hidden Markov models, where it is commonly known as the Baum-Welch algorithm. The standard refer-ence on the
expectation maximization algo-rithm and its convergence is Dempster et al 4.Mathematical foundations
How does the expectation maximization algo-rithm work? More importantly, why is it even necessary?
The expectation maximization algorithm is a natural generalization of maximum likeli-hood estimation to the incomplete data ca. In particular, expectation maximization attempts to find the parameters θˆ that maximize the
of guessing a probability distribution over completions of missing data given the current model (known as the E-step) and then re-estimating the model parameters using the completions (known as the M-step). The name ‘E-step’ comes from the fact that one does not usually need to form the probability distribu-tion over completions explicitly, but rather need only compute ‘expected’ sufficient statis-tics over the completions. Similarly, the name ‘M-step’ comes from the fact that model reesti-mation can be thought of as ‘maximization’ of the expected log-likelihood of the data.
Introduced as early as 1950 by Ceppellini et al.1 in the context of gene frequency estima-tion, the expectation maximization algorithm
96超级床上接班人>奥斯卡王尔德名言H T T T H H H H T T H H H H T H H H H H H H H H H H T T H H H T H T T T H H T T H H T H H H T H H
T Maximum likelihood
9 H, 1 T 8 H, 2 T
7 H, 3 T 24 H, 6 T
5 H, 5 T
4 H, 6 T
9 H, 11 T
5 ts, 10 toss per t
θA ˆ==2424 + 60.80θB ˆ==9
9 + 11
0.45a
Expectation maximization
b
1西湖民间故事
2
34
福建好玩的地方E-step
H T T T H H T H T H
H H H H T H H H H H H T H H H H H T H H H T H T T T H H T T T H H H T H H H T H
θA
ˆ=0.60θB
ˆ=0.50(0)(0)
θA ˆ21.3
21.3 + 8.6
0.71θB ˆ11.711.7 + 8.4
0.58(1)
(1)
≈≈≈≈M-step
θA
ˆ0.80θB
ˆ0.52(10)
(10)≈≈0.45 x 0.80 x 0.73 x 0.35 x
0.65 x
0.55 x 0.20 x 0.27 x 0.65 x 0.35x
≈ 2.2 H, 2.2 T ≈ 7.2 H, 0.8 T ≈ 5.9 H, 1.5 T ≈ 1.4 H, 2.1 T ≈ 4.5 H, 1.9 T ≈ 21.3 H, 8.6 T
≈ 2.8 H, 2.8 T ≈ 1.8 H, 0.2 T ≈ 2.1 H, 0.5 T ≈ 2.6 H, 3.9 T ≈ 2.5 H, 1.1 T ≈ 11.7 H, 8.4 T
Figure 1 Parameter estimation for complete and incomplete data. (a ) Maximum likelihood estimation.
For each t of ten toss, the maximum likelihood procedure accumulates the counts of heads and tails for coins A and B parately. The counts are then ud to estimate the coin bias.
(b ) Expectation maximization. 1. EM starts with an initial guess of the parameters. 2. In the E-step, a probability distribution over possible completions is computed using the current parameters. The counts shown in the table are the expected numbers of heads and tails according to this distribution. 3. In the M-step, new parameters are determined using the current completions. 4. After veral repetitions of the E-step and M-step, the algorithm converges.
P r I M E r
©2008 N a t u r e P u b l i s h i n g G r o u p h t t p ://w w w .n a t u r e .c o m /n a t u r e b i o t e c h n o l o g y
nature biotechnology volume 26 number 8 august 2008 899
transcriptional modules 10, tests of linkage diquilibrium 11, protein identification 12 and medical imaging 13.
In each ca, expectation maximization provides a simple, easy-to-implement and effi-cient tool for learning parameters of a model; once the parameters are known, we can u probabilistic inference to ask interesting que-ries about the model. For example, what cluster does a particular gene most likely belong to? What is the most likely starting location of a motif in a particular quence? What are the most likely haplotype blocks making up the genotype of a specific individual? By provid-ing a straightforward mechanism for param-eter learning in all of the models, expectation maximization provides a mechanism for build-ing and training rich probabilistic models for biological applications.
Note: Supplementary information is available on the Nature Biotechnology website.
ACKNOWLEDGMENTS
C.B.
D. is supported in part by an National Science Foundation (NSF) Graduate Fellowship. S.B. wishes to acknowledge support by the NSF CAREER Award. We thank four anonymous referees for helpful suggestions.
1. Ceppellini, r., Siniscalco, M. & Smith, C.A. Ann. Hum.
Genet. 20, 97–115 (1955).
2. Hartley, H. Biometrics 14, 174–194 (1958).
3. Baum, L.E., Petrie, T., Soules, G. & Weiss, N. Ann.
Math. Stat. 41, 164–171 (1970).
4. Dempster, A.P ., Laird, N.M. & rubin, D.B. J. R. Stat.
Soc. Ser. B 39, 1–38 (1977).
5. D’haeleer, P . Nat. Biotechnol. 23, 1499–1501
(2005).
6. Lawrence, C.E. & reilly, A.A. Proteins 7, 41–51
(1990).
7. Excoffier, L. & Slatkin, M. Mol. Biol. Evol. 12, 921–927
(1995).
8. Krogh, A., Brown, M., Mian, I.S., Sjölander, K. &
Haussler, D. J. Mol. Biol. 235, 1501–1543 (1994).9. Eddy, S.r. & Durbin, r. Nucleic Acids Res. 22, 2079–
2088 (1994).
10. Segal, E., Yelensky, r. & Koller, D. Bioinformatics 19,
i273–i282 (2003).
11. Slatkin, M. & Excoffier, L. Heredity 76, 377–383
(1996).
12. Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, r.
Anal. Chem. 75, 4646–4658 (2003).
13. De Pierro, A.r. IEEE Trans. Med. Imaging 14, 132–137
(1995).
and the remaining letters in each quence as coming from some fixed background distribu-tion. The obrved data x consist of the letters of quences, the unobrved latent factors z include the starting position of the motif in each quence and the parameters θ describe the position-specific letter frequencies for the motif. Here, the expectation maximiza-tion algorithm involves computing the prob-ability distribution over motif start positions for each quence (E-step) and updating the motif letter frequencies bad on the expected letter counts for each position in the motif (M-step).
In the haplotype inference problem 7, we are given the unphad genotypes of indi-viduals from some population, where each unphad genotype consists of unordered pairs of single-nucleotide polymorphisms (SNPs) taken from homologous chromo-somes of the individual. Contiguous blocks
of SNPs inherited from a single chromo-some are called haplotypes. Assuming for simplicity that each individual’s genotype is a combination of two haplotypes (one mater-nal and one paternal), the goal of haplotype inference is to determine a small t of hap-lotypes that best explain all of the unphad genotypes obrved in the population. Here, the obrved data x are the known unphad genotypes for each individual, the unobrved latent factors z are putative assignments of unphad genotypes to pairs of haplotypes and the parameters θ describe the frequen-cies of each haplotype in the population. The expectation maximization algorithm alternates between using the current haplo-type frequencies to estimate probability dis-tributions over phasing assignments for each unphad genotype (E-step) and using the expected phasing assignments to refine esti-mates of haplotype frequencies (M-step).Other problems in which the expectation maximization algorithm plays a prominent role include learning profiles of protein domains 8 and RNA families 9, discovery of
cas, the latent variables are prent due to missing or corrupted data; in most appli-cations of expectation maximization to com-putational biology, however, the latent factors are intentionally included, and parameter learning itlf provides a mechanism for knowledge discovery.
For instance, in gene expression cluster-ing 5, we are given microarray gene expression measurem
ents for thousands of genes under varying conditions, and our goal is to group the obrved expression vectors into distinct clusters of related genes. One approach is to model the vector of expression measurements for each gene as being sampled from a multi-variate Gaussian distribution (a generalization of a standard Gaussian distribution to multi-ple correlated variables) associated with that gene’s cluster. In this ca, the obrved data x correspond to microarray measurements, the unobrved latent factors z are the assign-ments of genes to clusters, and the parameters θ include the means and covariance matrices of the multivariate Gaussian distributions reprenting the expression patterns for each cluster. For parameter learning, the expectation maximization algorithm alternates between computing probabilities for assignments of each gene to each cluster (E-step) and updat-ing the cluster means and covariances bad on the t of genes predominantly belonging to that cluster (M-step). This can be thought of as a ‘soft’ variant of the popular k-means clustering algorithm, in which one alternates between ‘hard’ assignments of genes to clus-ters and reestimation of cluster means bad on their assigned genes.
建行登录
In motif finding 6, we are given a t of unaligned DNA quences and asked to identify a pattern of length W that is prent (though possibly with minor variations) in every quence from the t. To apply the expecta-tion maximization algorithm, we model the instance of the motif in each quence as hav-ing each letter sampled independently from a position-specific distribution over letters,
P r I M E r
©2008 N a t u r e P u b l i s h i n g G r o u p h t t p ://w w w .n a t u r e .c o m /n a t u r e b i o t e c h n o l o g y