Theory - EM¶

The source for most of the theory here is Machine Learning: A Probablistic Perspective by Kevin P. Murphy [Mur12].

Let us represent all the parameters generally as \(\boldsymbol \theta\).

\[ \boldsymbol \theta = \{\boldsymbol \pi, \boldsymbol \mu, \boldsymbol \Sigma\} \]

The Need for EM¶

For computing Maximum Likelihood Estimate (MLE) or Maximum A Posteriori (MAP) Estimate, one approach is to use a generic gradient-based optimizer to find a local minimum of the Negative Log Likelihood:

\[ \text{NLL}(\boldsymbol \theta) = - \frac{1}{N} \log P(\boldsymbol X | \boldsymbol \theta) \]

However, there are constraints that need to be enforced, like

Covariance matrices must be positive semi-definite
Proportions (\(\pi_k\)) must sum to one

This can be tricky. It can be simpler to use an iterative algorithm like Expectation Maximization (EM).

EM alternates between inferring the missing values given the parameters (the E step), and then optimizing the parameters given the “filled in” data (the M step).

EM in GMM¶

Initialization¶

We first initialize our parameters. We can use the results of K-Means Clustering as a starting point.

The E Step¶

Consider the Auxiliary Function of expected complete data log likelihood:

\[ Q(\boldsymbol \theta^*, \boldsymbol \theta) = \mathbb{E}[\log P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*)] = \sum_{\boldsymbol Z} P(\boldsymbol Z | \boldsymbol X, \boldsymbol \theta) \log P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*) \]

Out of this, \(P(\boldsymbol Z | \boldsymbol X, \boldsymbol \theta)\) is just \(\gamma(z_{nk})\), as seen in Equation (6).

And for the other part, use Equations (3) and (4).

\[\begin{split} \begin{aligned} P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*) &= P(\boldsymbol X | \boldsymbol Z, \boldsymbol \theta) P(\boldsymbol Z | \boldsymbol \theta) = \prod_{n=1}^{N} \prod_{k=1}^{K} \mathcal{N} (\boldsymbol x_n | \boldsymbol \mu_k, \boldsymbol \Sigma_k)^{z_{nk}} \pi_k^{z_{nk}} \\ \\ \implies \log P(\boldsymbol X, \boldsymbol Z | \boldsymbol \theta^*) &= \sum_{n=1}^N \sum_{k=1}^K z_{nk} \left[ \log \pi_k + \log \mathcal{N}(\boldsymbol x_n | \boldsymbol \mu_k, \boldsymbol \Sigma_k) \right] \end{aligned} \end{split}\]

Since the latent variable is only 1 once anytime we evaluate the summation, our final auxiliary function becomes:

(7)¶\[Q(\boldsymbol \theta^*, \boldsymbol \theta) = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \left[ \log \pi_k + \log \mathcal{N}(\boldsymbol x_n | \boldsymbol \mu_k, \boldsymbol \Sigma_k) \right]\]

The E Step is about computing the missing values.

In practice, however, we only need to compute the responsibilities \(\gamma(z_{nk})\) in the E step. The auxiliary function in its full form is required to get some results in the M step.

The M Step¶

The M step is about using the “filled in data” in the E step and the existing parameters \(\boldsymbol \theta\) to compute revised parameters \(\boldsymbol \theta^*\) such that:

\[ \boldsymbol \theta^* = \underset{\boldsymbol \theta}{\text{argmax}}\ Q(\boldsymbol \theta^*, \boldsymbol \theta) \]

For \(\boldsymbol \pi\), we have:

(8)¶\[\pi_k^* = \frac{1}{N} \sum_{n=1}^{N} \gamma(z_{nk})\]

For the revised values of mean and covariances, we compute the partial derivatives of \(Q\) with respect to these parameters and equate to zero. One can show that the new parameter estimates are given by:

(9)¶\[\boldsymbol \mu_k^* = \frac{\sum_{n=1}^N \gamma(z_{nk}) \boldsymbol x_n}{\sum_{n=1}^N \gamma(z_{nk})}\]

(10)¶\[\boldsymbol \Sigma_k^* = \frac{\sum_{n=1}^N \gamma(z_{nk}) (\boldsymbol x_n - \boldsymbol \mu_k) (\boldsymbol x_n - \boldsymbol \mu_k)^\intercal}{\sum_{n=1}^N \gamma(z_{nk})}\]

Computing these revised parameters constitutes the M step.

References¶

Mur12: Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. URL: http://probml.ai/.

Gaussian Mixture Models

Theory - EM

Contents