EM vs Gibbs Sampler - Results
Contents
EM vs Gibbs Sampler - Results¶
This is an experiment to compare performance of Expectation Maximization (EM) and Gibbs Sampler (GS) in the context of Gaussian Mixture Models.
500 runs each for K = 3 and K = 6 clusters
1000 data points in each
Univariate
During Data Generation, Means were generated from a Uniform [-10, 10] distribution. Standard Deviations were generated from a Uniform [0.25, 5] distribution.
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
# Load Data
gs3 = pd.read_csv("gs-k-3.csv")
gs6 = pd.read_csv("gs-k-6.csv")
em3 = pd.read_csv("em-k-3.csv")
em6 = pd.read_csv("em-k-6.csv")
The Data¶
Gibbs Sampler Results have the following data
RS: Rand Score
ARS: Adjusted Rand Score
SS: Silhouette Score
for each of the three methods
Base GS
GS with Multiple Initializations
GS with Burn In
# GS with K = 3
gs3
filename | gs_base_rs | gs_base_ars | gs_base_ss | gs_multi_rs | gs_multi_ars | gs_multi_ss | gs_burnin_rs | gs_burnin_ars | gs_burnin_ss | |
---|---|---|---|---|---|---|---|---|---|---|
0 | data-k-3-0.csv | 0.485956 | -0.000537 | -0.255483 | 0.547105 | 0.017980 | 0.245627 | 0.536328 | 0.003276 | -0.020430 |
1 | data-k-3-1.csv | 0.620989 | 0.230360 | 0.023708 | 0.683604 | 0.317101 | 0.473936 | 0.648256 | 0.271258 | 0.085944 |
2 | data-k-3-2.csv | 0.797538 | 0.534419 | 0.491607 | 0.765169 | 0.508474 | 0.488942 | 0.803904 | 0.549906 | 0.500306 |
3 | data-k-3-3.csv | 0.729241 | 0.335387 | -0.061440 | 0.686635 | 0.367634 | 0.295017 | 0.676150 | 0.236650 | 0.249176 |
4 | data-k-3-4.csv | 0.609772 | 0.238812 | 0.384276 | 0.588667 | 0.191067 | 0.519122 | 0.614503 | 0.252042 | 0.619329 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
492 | data-k-3-495.csv | 0.663618 | 0.315058 | 0.227498 | 0.759620 | 0.516875 | 0.588444 | 0.717594 | 0.429114 | 0.455768 |
493 | data-k-3-496.csv | 0.625051 | 0.254764 | 0.189085 | 0.668693 | 0.344932 | 0.499449 | 0.638194 | 0.283535 | 0.225605 |
494 | data-k-3-497.csv | 0.848749 | 0.279924 | 0.187651 | 0.644931 | 0.155684 | 0.370003 | 0.821467 | 0.258756 | 0.108153 |
495 | data-k-3-498.csv | 0.469934 | -0.009351 | -0.380859 | 0.539754 | 0.059766 | 0.372744 | 0.507011 | -0.001281 | -0.084711 |
496 | data-k-3-499.csv | 0.765938 | 0.436257 | -0.053071 | 0.583195 | 0.235119 | 0.438554 | 0.736348 | 0.335428 | 0.280845 |
497 rows × 10 columns
The dataframe for EM has the Adjusted Rand Score (ARS) results for EM in 2 modes:
EM with Many Random Initializations (
gmm_mri_ars
)EM with K-Means Initialization (
gmm_kmeans_ars
)
And the final column is the standard K-Means Clustering results (kmeans_ars
)
# EM with K = 3
em3
file | gmm_mri_ars | gmm_kmeans_ars | kmeans_ars | |
---|---|---|---|---|
0 | data-k-3-0.csv | -0.018932 | 0.024058 | 0.024468 |
1 | data-k-3-1.csv | 0.365575 | 0.119006 | 0.142713 |
2 | data-k-3-2.csv | 0.675526 | 0.306482 | 0.221317 |
3 | data-k-3-3.csv | -0.003321 | 0.023570 | 0.053617 |
4 | data-k-3-4.csv | 0.260538 | 0.159359 | 0.189642 |
... | ... | ... | ... | ... |
495 | data-k-3-495.csv | 0.538792 | 0.540371 | 0.455710 |
496 | data-k-3-496.csv | 0.481009 | 0.253717 | 0.267321 |
497 | data-k-3-497.csv | 0.376996 | 0.061681 | 0.040540 |
498 | data-k-3-498.csv | 0.130798 | 0.070550 | 0.086350 |
499 | data-k-3-499.csv | 0.472284 | 0.100273 | 0.112142 |
500 rows × 4 columns
Results¶
The plots are interactive.
K = 3¶
fig = go.Figure()
fig.add_trace(go.Box(y=gs3['gs_base_ars'], name="GS Base"))
fig.add_trace(go.Box(y=gs3['gs_burnin_ars'], name="GS Burn In"))
fig.add_trace(go.Box(y=gs3['gs_multi_ars'], name="GS Multi"))
fig.add_trace(go.Box(y=em3['gmm_mri_ars'], name="EM Multi Init"))
fig.add_trace(go.Box(y=em3['gmm_kmeans_ars'], name="EM K-Means Init"))
fig.add_trace(go.Box(y=em3['kmeans_ars'], name="Standard K-Means"))
fig.update_layout(title_text="K = 3")
fig.show()
K = 6¶
fig = go.Figure()
fig.add_trace(go.Box(y=gs6['gs_base_ars'], name="GS Base"))
fig.add_trace(go.Box(y=gs6['gs_burnin_ars'], name="GS Burn In"))
fig.add_trace(go.Box(y=gs6['gs_multi_ars'], name="GS Multi"))
fig.add_trace(go.Box(y=em6['gmm_mri_ars'], name="EM Multi Init"))
fig.add_trace(go.Box(y=em6['gmm_kmeans_ars'], name="EM K-Means Init"))
fig.add_trace(go.Box(y=em6['kmeans_ars'], name="Standard K-Means"))
fig.update_layout(title_text="K = 6")
fig.show()
Conclusions¶
Gibbs Sampler with Multi Init, and both of the EM versions perform better than the standard K-Means.
Between GS and EM, GS with Multi Init seems to be performing the best, by a slight margin over EM.