Expectation Maximization (EM) algorithm is a special case of MLE where the observations (data samples ) are inherently related with some hidden variables (). First of all, we need to review the basics of MLE.
Maximum Likelihood Estimation (MLE)
Let be a set of independent and identically distributed observations, and be the parameters of the data distribution which are unknown for us. The maximum likelihood estimation of the parameters is the parameters which can maximize the joint distribution
More commonly, we choose to maximize the joint loglikelihood:
We use an example to illustrate how it works (referred from EM算法详解知乎).
Suppose that we have a coin A, the likelihood of a heads is . We denote one observation as tossing the coin A 5 times and record the heads (1) or tails (0) of each tossing. For example, can be 01001, 01110, 10010, … etc. The likelihood of the observation is:
Therefore, the log likelihood of the joint distribution of observations is:
The MLE of is
To get we can solve the equation .
Therefore, we have
This is actually equivalent to compute the average value of all tossing results. For example, if we have 10 observations as below:
01011  01110  
01111  01110  
11011  11011  
00011  00100  
01010  01001 
The sum of all tossing is 28, and the total number of tossing is 50, so MLE of is
MLE with hidden variables
Now things become more complicated. Suppose we have two coins: A and B. The likelihood of a heads of coin A and B are and respectively. We want to find the MLE of using observations . Each observation has the same form as above. The challenging part is that for each observation , we don’t know which coin it comes from. For example, , the observation set is the same as the table above. In this case how to find the MLE of and ?
This is an simple example where our observation is closely related with some hidden (unknown) variables. In other words, the information of the data is incomplete. The ExpectationMaximization algorithm can be used to solve these problems.
ExpectationMaximization (EM) Algorithm
Before introducing EM algorithm, we need to known an important inequality: JensenShannon Inequality.
JensenShannon Inequality
If a function is strictly convex, where is a random variable and the Hessian matrix is positive definite, we have
The equality holds if and only if with the probability 1 ( is a constant). Note that of is strictly concave, the direction of the inequality needs to be reversed.
We can use an example to illustrate JensenShannon inequality more intuitively (not proof). This example is referred from Andrew Ng’s lecture note on EM. As shown in this figure, the random variable has only two possible values: and , each with the probability 0.5. Therefore, and . According to the convexity of the function , we have , and the equality holds if and only if , which means .
Now we have a powerful tool, and we will use it to deduce the EM algorithm.
EM algorithm
Recall the MLE problem:
If is related with a latent variable , we write as the marginal likelihood of the joint distribution:
Now our loglikelihood function becomes:
The expression of the joint distribution is not known to us. To solve this maximization problem we introduce a distribution , and rewrite the loglikelihood function as:
We know that the function is strictly concave, so according to the JensenShannon inequality, we have the following inequality:
The equality holds if and only if is a constant (with respect to variable ). To achieve this we can set . This means is the posterior of given .
People may ask: why do we try to find a proper to make the equality hold?
Our initial goal is to find the MLE of $\theta$ w.r.t $l(\theta)$. However, the original expression of $l(\theta)$ is not explicit, so we take advantage of the JensenShannon inequality, by selecting a proper distribution of $Q^{(i)}(\mathbf{z})$, to make $l(\theta)=L(\theta, Q^{(i)})$. Then we can instead maximize $L(\theta, Q^{(i)})$ w.r.t $\theta$ to find the MLE in an iterative way.
Now, we can summarize the EM algorithm:
 Heuristically (or randomly) initialize parameters
 Repeat:
 Expectation (E) step:
Based on current parameter and , set .  Maximization (M) step:
Update
 Expectation (E) step:
Until: converges
In Andrew Ng’s lecture notes, it is proven that can guarantee that is steadily maximized. Suppose that after iterations we have the loglikelihood . At the iteration, after the E step, we have ; After the M step, updated is selected such that . Then at the iteration, by selecting as the posterior of , we have . Therefore, we have
So we have . This guarantees that the overall loglikelihood can only keep increasing or stay unchanged, but not decrease.
This deduction shows that the EM algorithm is heading to the right direction. However, this direction may not be the ideal one. It is pretty obvious that if is globally concave, EM algorithm can always converge at the global optimum. If is not globally concave, the property will guarantee that EM algorithm will converge at some point (assume that is not delta function), but the converge point may not be globally optimum.
Moreover, the EM algorithm is sensitive to the initialization. Different initialization may results in pretty different converge points, as shown in the figure below. As shown in this figure, if the initialization is at point , then EM will converge at point , while the EM will converge at point if initialization is . Obviously is the global optimum and not.
So how to make EM algorithm less sensitive to initialization and be more likely to find the global optimum? One simple, straightforward but effective way is to randomly initialize the parameters and rum EM algorithm multiple times, and choose the parameters with the largest converged loglikelihood (objective function).
Apply EM algorithm to practical questions
Tossing two coins with different heads probability
Let recall the question raised in the first section:
Suppose we have two coins: A and B. The likelihood of a heads of coin A and B are and respectively. We want to find the MLE of using observations . Each observation has dimension, which means times of tossing for each observation. The challenging part is that for each observation , we don’t know which coin it comes from. In this case how to find the MLE of and ?
In this case, is related with a hidden variable . can only have 2 values: for coin and for coin . We want to apply EM algorithm to this case.
 Randomly initialize , , and the prior distribution of is , .
Note that the choice of prior distribution of will influence the final learned parameters very much. If the chosen prior is pretty different from the real prior, the estimated parameters will be inaccurate. To solve this problem, we can update the prior by setting the prior of current iteration as the posterior of previous iteration, averaged over all observations. This is commonly used when data comes as a sequence.
 Repeat:
at the iteration: E step:
We need to compute the posterior:$$Q^{(i)}_l(z) = P(z\mathbf{x}^{(i)}; \theta_{l1}) = \frac{P(\mathbf{x}^{(i)}\vert z; \theta_{l1}) P(z) }{ P(\mathbf{x}^{(i)}) }$$ Here represents the parameter set . Furthermore, we known that . Moreover we have the prior and . Therefore, we have$$\begin{align} & Q^{(i)}_{A,l}= Q^{(i)}_ l(z=A)\\=&\frac{P(\mathbf{x}^{(i)}\vert z=A; \theta_{l1})P(z=A) }{ P(\mathbf{x}^{(i)}\vert z=A; \theta_{l1})P(z=A) +P(\mathbf{x}^{(i)}\vert z=B; \theta_{l1}) P(z=B) }\\ =&\frac{P(\mathbf{x}^{(i)}\vert z=A; \theta_{l1})}{ P(\mathbf{x}^{(i)}\vert z=A; \theta_{l1}) +P(\mathbf{x}^{(i)}\vert z=B; \theta_{l1}) } \end{align}$$ and$$\begin{align} &Q^{(i)}_{B,l}=Q^{(i)}_ l(z=B)\\=&\frac{P(\mathbf{x}^{(i)}\vert z=B; \theta_{l1}) }{ P(\mathbf{x}^{(i)}\vert z=A; \theta_{l1}) +P(\mathbf{x}^{(i)}\vert z=B; \theta_{l1}) }\end{align}$$  M step:
The objective function is$$\begin{align}&L(\theta_{l1}, Q^{(i)}_l)\\ =& \sum_{i=1}^{n} \sum_z Q^{(i)}_l(z) \log \frac{P(\mathbf{x}^{(i)},z;\theta_{l1})}{Q^{(i)}_ l(z)}\\ =& \sum_{i=1}^{n}\Big( Q_l^{(i)}(z=A)\log \frac{ P(\mathbf{x}^{(i)}z=A; \theta_{l1} )P(z=A) }{Q_l^{(i)}(z=A) } +\\ &\ \ \ Q_l^{(i)}(z=B)\log \frac{ P(\mathbf{x}^{(i)}z=B; \theta_{l1} )P(z=B) }{Q_l^{(i)}(z=B) }\Big)\\ =& \sum_{i=1}^{n} \sum_{j=1}^{d}\Big[Q_{A,l}^{(i)}\Big(x_{i,j}\log \theta_{A,l1} +(1x_{i,j})\log (1\theta_{A,l1}) \Big)+\\ &\ \ \ Q_{B,l}^{(i)}\Big(x_{i,j}\log \theta_{B,l1} +(1x_{i,j})\log (1\theta_{B,l1}) \Big)\Big]+C \end{align}$$ Where is a term which is not related with or . To update , we need to compute the partial derivate ad set them to 0:$$\begin{align} \frac{\partial{L(\theta_{l1}, Q_l^{(i)})}}{\partial{ \theta_{A,l1} }}= \frac{\sum_{i=1}^{n}\sum_{j=1}^{d}Q_{A,l}^{(i)}x_{i,j}}{\theta_{A,l1}} + \frac{\sum_{i=1}^{n}\sum_{j=1}^{d}Q_{A,l}^{(i)}(x_{i,j}1)}{1\theta_{A,l1}} =0 \\ \frac{\partial{L(\theta_{l1}, Q_l^{(i)})}}{\partial{ \theta_{B,l1} }}= \frac{\sum_{i=1}^{n}\sum_{j=1}^{d}Q_{B,l}^{(i)}x_{i,j}}{\theta_{B,l1}} + \frac{\sum_{i=1}^{n}\sum_{j=1}^{d}Q_{B,l}^{(i)}(x_{i,j}1)}{1\theta_{B,l1}} =0 \end{align}$$ By solving these two equations, we get the update rule:$$ \theta_{A,l} = \frac{ \sum_{i}^{n} \sum_{j=1}^d Q_{A,l}^{(i)} x_{i,j} }{ \sum_{i}^{n}Q_{A,l}^{(i)}d}\\ \theta_{B,l} = \frac{ \sum_{i}^{n} \sum_{j=1}^d Q_{B,l}^{(i)} x_{i,j} }{ \sum_{i}^{n}Q_{B,l}^{(i)}d} $$  Update prior : ,
 E step:
Until converges.
Implementation and Analysis
To test the effectiveness of the EM algorithm, I wrote a small demo for the coin tossing problem:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
## Define a tossing function, to generate our observations
## theta is the head likelihood; num is the number of tossing for a single observation
def tossing( theta, num ):
return (np.random.uniform(size=num)<theta).astype(np.int32)
## the load data is used to generate a set of observations
## prior_coin_A is the prior of the hidden variable z;
## theta_A, theta_B is heads probability of coin A and B separately.
## this method return a dataset X, without any explicit information about prior_coin_A, theta_A, theta_B
def load_data( num_samples, prior_coin_A = 0.8 , theta_A=0.2, theta_B = 0.7, num_tossing_per_sample = 5 ):
X=[]
for _ in range(num_samples):
random_v = np.random.uniform()
if random_v < prior_coin_A:
##generate a tossing observation using coin A
X.append( tossing( theta_A, num_tossing_per_sample) )
else:
##generate a tossing observation using coin B
X.append( tossing( theta_B, num_tossing_per_sample ) )
X = np.asarray(X)
return X
## The task of EM is to found the MLE of theta_A, theta_B using only obtained observations X
def EM( X, epsilon = 1e8, update_prior = True , is_return_prior_list = False):
## initialization
prior_coin_A = 0.5
prior_coin_B = 1 prior_coin_A
theta_A = np.random.uniform()
theta_B = np.random.uniform()
prior_coin_A_list=[prior_coin_A]
prev_theta_A = theta_A
prev_theta_B = theta_B
count = 0
while True:
## E step:
P_X_with_z_eq_A = theta_A**( np.sum(X, axis=1) )* (1theta_A)**(np.sum( 1X, axis=1 ))
P_X_with_z_eq_B = theta_B**( np.sum(X, axis=1) )* (1theta_B)**(np.sum( 1X, axis=1 ))
Q_A = P_X_with_z_eq_A*prior_coin_A/(P_X_with_z_eq_A*prior_coin_A+P_X_with_z_eq_B*prior_coin_B)
Q_B = P_X_with_z_eq_B*prior_coin_B/(P_X_with_z_eq_A*prior_coin_A+P_X_with_z_eq_B*prior_coin_B)
## M step:
theta_A = np.sum( Q_A * np.sum(X,axis=1))/np.sum( X.shape[1]*Q_A)
theta_B = np.sum( Q_B * np.sum(X,axis=1))/np.sum(X.shape[1]*Q_B)
if abs(theta_A prev_theta_A) + abs(theta_B prev_theta_B) < epsilon:
break
prev_theta_A = theta_A
prev_theta_B = theta_B
## update prior
if update_prior:
prior_coin_A= np.mean(Q_A)
prior_coin_B = np.mean(Q_B)
prior_coin_A_list.append(prior_coin_A)
if is_return_prior_list:
return theta_A, theta_B, {"prior_coin_A_list":np.array(prior_coin_A_list),"prior_coin_B_list":1np.array(prior_coin_A_list)}
else:
return theta_A, theta_B
First, let’s load the coin tossing data. The true prior distribution of $z$ is $P(z=A)=0.8$ and $P(z=B)=0.2$. For coin A, the true heads probability is 0.2; for coin B, the true heads probability is 0.7. For each observation, there are 10 tossing results.
1
2
3
4
true_prior_coin_A = 0.7
true_theta_A = 0.2
true_theta_B = 0.7
X = load_data(1000, prior_coin_A = true_prior_coin_A , theta_A=true_theta_A , theta_B = true_theta_B, num_tossing_per_sample = 10)
We can have a look at the loaded data (the first 10 observations)
1
print(X[:10])
1
2
3
4
5
6
7
8
9
10
[[0 1 1 1 1 1 1 1 1 0]
[1 1 1 1 1 1 1 1 0 1]
[0 1 1 0 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 1 0 1 0 0 0 0]
[0 0 0 1 1 0 0 1 1 1]
[0 0 0 0 0 0 0 0 1 0]
[0 1 0 0 1 0 0 1 1 1]
[1 1 0 1 0 1 0 0 0 1]
[0 0 0 0 0 0 0 1 0 1]]
The influence of whether dynamically updating the prior or not
EM algorithm with dynamically updated prior distribution of $z$
1
2
3
estimated_theta_A,estimated_theta_B, params = EM(X, update_prior=True, is_return_prior_list=True)
## This problem is (strictly) concave,
print("Estimated theta_A: %.4f, Estimated theta_B: %.4f"%( estimated_theta_A, estimated_theta_B))
1
Estimated theta_A: 0.1923, Estimated theta_B: 0.7053
Wow, the estimated theta_A is almost equal to the true theta_A (0.2), and the same holds for estimated theta_B. Note that the EM output may sometimes be “Estimated theta_A: 0.7, Estimated theta_B: 0.2”. This is OK because EM doesn’t know estimated theta_A is corresponding to coin A literally. It only knows that there are two coins, one with heads prob 0.7 and another on with 0.2.
EM algorithm with fixed prior distribution of $z$: $P(z=A)=0.5$ and $P(z=B)=0.5$.
1
2
estimated_theta_A,estimated_theta_B = EM(X, update_prior=False)
print("Estimated theta_A: %.4f, Estimated theta_B: %.4f"%( estimated_theta_A, estimated_theta_B))
1
Estimated theta_A: 0.6621, Estimated theta_B: 0.1762
From this result it’s obvious that if we use a fixed prior distribution of $z$ which is pretty different from the true prior, the final estimate of model parameters will be less accurate.
In fact, if we choose to dynamically update prior we check how the prior distribution changes, we will see the prior distribution will gradually approach the true prior. This can be shown by plotting the prior_coin_A_list variable:
1
2
3
4
5
6
7
import matplotlib.pyplot as plt
plt.plot( params["prior_coin_A_list"] )
plt.plot( np.ones_like( params["prior_coin_A_list"])*true_prior_coin_A )
plt.legend(["dynamically updated prior","true prior"])
plt.ylabel("$p(z=A)$")
plt.xlabel("iteration")
plt.show()
We can see that the prior gradually approachs the true prior as we expected. However, we also notice that there always exists some gap. This might be analytically explained. I will try to think about this in the future.
Conclusion
 EM algorithm does work on this example.
 To better estimate the parameters, it’s advisable to dynamically update the prior distribution of the hidden variables.

Previous
An Introduction to Support Vector Machines (SVM): A Python Implementation 
Next
EM Algorithm and Gaussian Mixture Model for Clustering