Introduction

Gaussian Mixture Models can be used to represent subpopulations which are normally distributed within an overall population. It is, in essence, a superposition of multiple Gaussians.

We will be considering Multivariate Gaussians, as the univariate case easily follows. The source of most of the material is Pattern Recognition and Machine Learning by C. Bishop [Bis06]

Multivariate Normal Distribution

The Multivariate Normal \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) is

(1)\[P(\boldsymbol x | \boldsymbol \mu, \boldsymbol \Sigma) = \frac{1}{\sqrt{\det(2\pi\boldsymbol\Sigma)}} \exp\left(-\frac{1}{2}(\boldsymbol x - \boldsymbol \mu)^\intercal\ \boldsymbol \Sigma^{-1}\ (\boldsymbol x - \boldsymbol \mu)\right)\]

Suppose the number of dimensions is \(n\).

Here, \(\boldsymbol \mu \in \mathbb{R}^n\) is the mean, and \(\boldsymbol \Sigma \in \mathbb{R}^{n \times n}\) is the covariance, which must be a positive semi-definite matrix.

Definitions

Assume

  • There is a mixture of \(K\) Gaussians

  • Each Gaussian is in a \(D\)-dimensional space

  • Denote a single \(D\)-dimensional observation/data point as \(\boldsymbol x\)

  • There are \(N\) such observations/points.

Since there are \(K\) Gaussians (or Clusters from here onwards), we also have the corresponding \(K\) means and covariance matrices.

  • \(\boldsymbol\mu = \{\boldsymbol \mu_1, \boldsymbol \mu_2, \dots, \boldsymbol \mu_K\}\)

  • \(\boldsymbol\Sigma = \{\boldsymbol \Sigma_1, \boldsymbol \Sigma_2, \dots, \boldsymbol \Sigma_K\}\)

And the entire data/observation set: \(X = \{\boldsymbol x_1, \boldsymbol x_2, \dots, \boldsymbol x_N\}\)

A GMM is a mixture of Gaussians. The proportions in which these Gaussians are present are represented as \(\pi_k\). So, \(\pi_k\) is the proportion, also called mixing probability, which shows how big the \(k^{th}\) Gaussian will be in the mixture.

  • \(\boldsymbol \pi = \{\pi_1, \pi_2, \dots, \pi_K\}\)

The proportions or mixing probabilities satisfy the following:

\[\begin{split} 0 \leqslant \pi_k \leqslant 1\quad \forall\ k \\\\ \sum_{k=1}^{K} \pi_k = 1 \end{split}\]

Basic Ideas

We have \(N\) observations in a \(D\)-dimensional space. We want to fit the \(K\) Gaussians in a mixture model. So, at the end, we wish to find the parameters of \(K\) normally distributed subpopulations in our data/observations and we want to know the probabilities of a point belonging to each of the clusters/gaussians.

Mixture Distribution

We have K Gaussians, each having a proportion. The mixture distribution can be written as:

(2)\[P(\boldsymbol x) = \sum_{k=1}^{K} \pi_k\ \mathcal{N}(\boldsymbol x | \boldsymbol \mu_k, \boldsymbol \Sigma_k)\]

where \(\mathcal{N}(\boldsymbol x | \boldsymbol \mu_k, \boldsymbol \Sigma_k)\) is the Multivariate Normal described in Equation (1).

Introducing a Latent Variable

Latent Variables are also called Hidden Variables. Here, we introduce \(\boldsymbol z\), a \(K\)-dimensional binary random variable which follows a 1-of-K representation.

For a given observation point, when it belongs to cluster \(k\), then \(z_k = 1\) and all other \(z_i = 0\). So, \(z_k \in \{0, 1\}\) and \(\sum_k z_k = 1\).

So, \(P(z_{k} = 1 | \boldsymbol x)\) would represent the probability of \(\boldsymbol x\) belonging to cluster \(k\). This is one of the things we wish to find.

The marginal distribution over \(\boldsymbol z\) is specified in terms of the mixing probabilities (or proportions) \(\pi_k\) such that

\[ p(z_k = 1) = \pi_k \]

Intuitively, the marginal distribution over \(\boldsymbol z\) should represent the probability of cluster \(k\), which is exactly what the proportions are.

Because \(\boldsymbol z\) uses a 1-of-K representation, we can write this distribution in the form:

(3)\[P(\boldsymbol z) = \prod_{k=1}^{K} \pi_k^{z_k}\]

Similarly, the conditional distribution of \(\boldsymbol x\) given a particular value for \(\boldsymbol z\) is a Gaussian:

\[ P(\boldsymbol x | z_k = 1) = \mathcal{N}(\boldsymbol x | \boldsymbol \mu_k, \boldsymbol \Sigma_k) \]

which can be written in the form

(4)\[P(\boldsymbol x | \boldsymbol z) = \prod_{k = 1}^{K} \mathcal{N}(\boldsymbol x | \boldsymbol \mu_k, \boldsymbol \Sigma_k)^{z_k}\]

This works because only one \(z_k = 1\) at a time, and the rest are 0.

The joint probability distribution is \(P(\boldsymbol x, \boldsymbol z) = P(\boldsymbol z) P(\boldsymbol x | \boldsymbol z)\), so we can find the marginal distribution of \(\boldsymbol x\) by summing over \(\boldsymbol z\).

\[ P(\boldsymbol x) = \sum_{z} P(\boldsymbol z)P(\boldsymbol x | \boldsymbol z) = \sum_{k = 1}^{K} \pi_k\ \mathcal{N} (\boldsymbol x | \boldsymbol \mu_k, \boldsymbol \Sigma_k) \]

For each data point / observation \(\boldsymbol x_n\), there is a corresponding latent variable \(\boldsymbol z_n\). We have obtained the same formulation of a Gaussian Mixture Model as Equation (2), this time involving a latent variable.

And so, for the complete observation set \(\boldsymbol X\), we have:

(5)\[P(\boldsymbol X) = \prod_{n=1}^{N} \sum_{k = 1}^{K} \pi_k\ \mathcal{N} (\boldsymbol x_n | \boldsymbol \mu_k, \boldsymbol \Sigma_k)\]

Responsibilites

The quantity \(P(z_k = 1 | \boldsymbol x)\), which was the probability that the observation \(\boldsymbol x\) belongs to cluster \(k\), is also denoted as \(\gamma(z_k)\).

(6)\[\begin{split}\begin{aligned} \gamma(z_k) \equiv P(z_k = 1 | \boldsymbol x) &= \frac{P(z_k = 1) P(\boldsymbol x | z_k = 1)}{\sum_{j=1}^{K} P(z_j = 1) P(\boldsymbol x | z_j = 1)} \\ \\ \implies \gamma(z_k) &= \frac{\pi_k\ \mathcal{N}(\boldsymbol x | \boldsymbol \mu_k, \boldsymbol \Sigma_k)}{\sum_{j=1}^{K} \pi_j\ \mathcal{N}(\boldsymbol x | \boldsymbol \mu_j, \boldsymbol \Sigma_j)} \end{aligned}\end{split}\]

This quantity is also called the responsibility that cluster/gaussian/component \(k\) takes for explaining the observation \(\boldsymbol x\).

References

Bis06

Christopher Bishop. Pattern Recognition and Machine Learning. Springer, January 2006. URL: https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/.