### Gaussian Mixtures¶

• GMs are probabilistic models that assume all data comes from a set of Gaussian distributions with unknown parameters.
• They use the expectation-maximization (EM) algorithm for model fitting.
• They can draw confidence ellipsoids for multivariate models.
• They can find the Bayesian Info Criterion (BIC) to assess the number of clusters in the data.
• Once fitted, a GM can assign test samples to a most likely Gaussian using predict.
• Different covariance matrix options are supported:
• spherical
• diagonal
• tied
• full

### Expectation Maximization (EM)¶

• Unlabeled GM problems are made so by not knowing which samples came from which latent (hidden) component.
• EM resolves this with an iterative approach. It first assumes random components & finds, for each sample, a probability of it being generated from each component in the model. Parameters are tweaked to maximize their likelihoods given those assignments. Repeating this process guarantees converging to (at least) a local optimum.

### Example: GMM clustering, Iris toy dataset¶

• Plots predicted labels on training & test data using multiple GMM covariance types.
• Training data = dots; test data = crosses.
• Iris toy dataset is 4-dimensional - only the first two are used.

### Example - Gaussian Mixtures - Density Estimation¶

• Data is generated from two Gaussians with different centers & covariance matrices.
• 1) spherical data centered at (20,20)
• 2) zero-centered stretched Gaussian
• Stack 1+2
• concatenate into final training set

### Example: Using Bayes Info Criterion (BIC) to select #Components¶

• Model Selection concerns both the covariance type & number of model components. AIC also provides a correct result (not shown) - BIC is better if the problem is identify the right model.

### Variational Bayesian GM¶

• Variational inference is an extension of EM. It maximizes a lower bound on model evidence, including priors, instead of data likelihood.
• Variational methods add regularization by using information from prior distributions. This avoids the singularities found in some EM solutions, but introduces some biases.
• Inference is notably slower.
• Variational inference requires more parameters due to its Bayes background. The most important is weight_concentration_prior.
• low values cause the model to more heavily weigh a few components, while setting the remaining weights to zero.
• high values add a larger proportion of components to the mix.
• Two types of priors are available for the weights distribution:
• A finite mixture model with a Dirichlet distribution
• An infinite mixture model with the Dirichlet process. (In practice, this algorithm is usually approximated & uses a truncated, "stick-breaking" representation.)

### Example: Concentration Prior Analysis - Variational Bayesian GMs¶

• Generate a toy dataset (mixture of 3 Gaussians)
• Fit with a Bayes GM using 1) a Dirichlet distribution prior (weight_concentration_prior="dirichlet_distribution") and 2) a Dirichlet process prior (weight_concentration_prior="dirichlet_process").
• Plot the ellipsoids for the three values of the weight concentration prior.