Theory - EM

The source for most of the theory here is Machine Learning: A Probablistic Perspective by Kevin P. Murphy [Mur12].

Let us represent all the parameters generally as \(\boldsymbol \theta\).

\[ \boldsymbol \theta = \{\boldsymbol \pi, \boldsymbol \mu, \boldsymbol \Sigma\} \]

The Need for EM

For computing Maximum Likelihood Estimate (MLE) or Maximum A Posteriori (MAP) Estimate, one approach is to use a generic gradient-based optimizer to find a local minimum of the Negative Log Likelihood:

\[ \text{NLL}(\boldsymbol \theta) = - \frac{1}{N} \log P(\boldsymbol X | \boldsymbol \theta) \]

However, there are constraints that need to be enforced, like

  • Covariance matrices must be positive semi-definite

  • Proportions (\(\pi_k\)) must sum to one

This can be tricky. It can be simpler to use an iterative algorithm like Expectation Maximization (EM).

EM alternates between inferring the missing values given the parameters (the E step), and then optimizing the parameters given the “filled in” data (the M step).

EM in GMM

Initialization

We first initialize our parameters. We can use the results of K-Means Clustering as a starting point.

The E Step

Consider the Auxiliary Function of expected complete data log likelihood:

\[ Q(\boldsymbol \theta^*, \boldsymbol \theta) = \mathbb{E}[\log P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*)] = \sum_{\boldsymbol Z} P(\boldsymbol Z | \boldsymbol X, \boldsymbol \theta) \log P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*) \]

Out of this, \(P(\boldsymbol Z | \boldsymbol X, \boldsymbol \theta)\) is just \(\gamma(z_{nk})\), as seen in Equation (6).

And for the other part, use Equations (3) and (4).

\[\begin{split} \begin{aligned} P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*) &= P(\boldsymbol X | \boldsymbol Z, \boldsymbol \theta) P(\boldsymbol Z | \boldsymbol \theta) = \prod_{n=1}^{N} \prod_{k=1}^{K} \mathcal{N} (\boldsymbol x_n | \boldsymbol \mu_k, \boldsymbol \Sigma_k)^{z_{nk}} \pi_k^{z_{nk}} \\ \\ \implies \log P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*) &= \sum_{n=1}^N \sum_{k=1}^K z_{nk} \left[ \log \pi_k + \log \mathcal{N}(\boldsymbol x_n | \boldsymbol \mu_k, \boldsymbol \Sigma_k) \right] \end{aligned} \end{split}\]

Since the latent variable is only 1 once anytime we evaluate the summation, our final auxiliary function becomes:

(7)\[Q(\boldsymbol \theta^*, \boldsymbol \theta) = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \left[ \log \pi_k + \log \mathcal{N}(\boldsymbol x_n | \boldsymbol \mu_k, \boldsymbol \Sigma_k) \right]\]

The E Step is about computing the missing values.

In practice, however, we only need to compute the responsibilities \(\gamma(z_{nk})\) in the E step. The auxiliary function in its full form is required to get some results in the M step.

The M Step

The M step is about using the “filled in data” in the E step and the existing parameters \(\boldsymbol \theta\) to compute revised parameters \(\boldsymbol \theta^*\) such that:

\[ \boldsymbol \theta^* = \underset{\boldsymbol \theta}{\text{argmax}}\ Q(\boldsymbol \theta^*, \boldsymbol \theta) \]

For \(\boldsymbol \pi\), we have:

(8)\[\pi_k^* = \frac{1}{N} \sum_{n=1}^{N} \gamma(z_{nk})\]

For the revised values of mean and covariances, we compute the partial derivatives of \(Q\) with respect to these parameters and equate to zero. One can show that the new parameter estimates are given by:

(9)\[\boldsymbol \mu_k^* = \frac{\sum_{n=1}^N \gamma(z_{nk}) \boldsymbol x_n}{\sum_{n=1}^N \gamma(z_{nk})}\]
(10)\[\boldsymbol \Sigma_k^* = \frac{\sum_{n=1}^N \gamma(z_{nk}) (\boldsymbol x_n - \boldsymbol \mu_k) (\boldsymbol x_n - \boldsymbol \mu_k)^\intercal}{\sum_{n=1}^N \gamma(z_{nk})}\]

Computing these revised parameters constitutes the M step.


References

Mur12

Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. URL: http://probml.ai/.