### Density Estimation¶

• Density estimation is a combination of unsupervised learning, feature engineering, and data modeling. Some of the most popular techniques are mixture models (Gaussian Mixtures) and neighbor-based approaches (KernelDensity).

• Gaussian Mixtures are discussed more fully in the context of clustering, because the technique is also useful as an unsupervised clustering scheme.

• Most people are already familiar with one common density estimation technique: the histogram.

### Histograms¶

• A histogram is a simple visualization of data where bins are defined, and the number of data points within each bin is tallied.

• The choice of bins can have a major effect on the visualization. (See upper-right panel, above). It shows a histogram over the same data, with the bins shifted right. The results of the two visualizations look entirely different, and might lead to different interpretations of the data.

• Think of a histogram as a stack of blocks, one block per point. By stacking the blocks in the appropriate grid space, we recover the histogram. What if we center each block on the point it represents, and sum the total height at each location? (See lower-left panel, above). It is not as clean as a histogram, but letting the data drive the block locations is a much better representation.

• This visualization is an example of a kernel density estimation, in this case with a top-hat kernel (i.e. a square block at each point).

• We can recover a smoother distribution with a smoother kernel. A Gaussian kernel density estimate, where each point contributes a Gaussian curve to the total, (See lower-right panel) returns a powerful non-parametric model of the distribution.

### Kernel Density Estimation (KDE)¶

• Uses the Ball Tree or KD Tree for queries (see Nearest Neighbors).

• Can be done in any #dimensions - performance degrades in high dimensions.

• Kernel are positive functions controlled by bandwidth. The density estimate at a point within a group of points is given by $\rho_K(y) = \sum_{i=1}^{N} K(y - x_i; h)$.

• bandwidth acts as a "smoothing" parameter, controlling the bias/variance tradeoff. Large bandwidth = smooth (high-bias) distribution; Small bandwidth = unsmooth (high-variance) distribution.

• Kernel options:

• kernel="gaussian": $K(x; h) \propto \exp(- \frac{x^2}{2h^2} )$
• kernel="tophat": $K(x; h) \propto 1$
• kernel="epanechnikov": $K(x; h) \propto 1 - \frac{x^2}{h^2}$
• kernel="exponential": $K(x; h) \propto \exp(-x/h)$
• kernel="linear": $K(x; h) \propto 1 - x/h$
• kernel="cosine": $K(x; h) \propto \cos(\frac{\pi x}{2h})$
• KDE can be used with any valid distance metric, though the results are properly normalized only for the Euclidean metric.

• One useful metric is the Haversine distance which measures the angular distance between points on a sphere.

• KDE can learn a non-parametric generative model of a dataset to draw new samples from this generative model. See below.

### Example: KDE for learning generative models (Digits)¶

• enables drawing new samples which reflect underlying data model.