### Gaussian Processes¶

• Generic supervised method
• Returns a probabilistic (Gaussian) prediction - this enables building confidence intervals.
• A range of kernels - both standard and custom - can be used.
• Gaussians lose efficiency in high-D spaces (when #features > a few dozen).

### Gaussian Process Regression (GPR)¶

• The prior of a Gaussian Process* needs to be specified.
• The mean can be constant & zero (normalize_y=False) or derived from training data (normalize_y=True).
• The covariance is specified by passing a kernel object; the kernel's parameters are optimized during fitting by maximizing the log-marginal-likelihood (LML) of the optimizer.
• LML can have multiple local optima - you can restart the optimizer with n_restarts_optimizer.
• The 1st run uses the kernel's initial parameter values.
• Subsequent runs use parameters randomly chosen from a range of allowed values, or kept constant if the optimizer is specified as None.
• Target noise is specified using alpha - either globally or as per datapoint.

### Example: GPR with noise-level estimate¶

• Demonstrates how GPR with a sum kernel can estimate noise level in a dataset.
• Two local LML maxima exist:
• The 1st corresponds to a model with high noise & large length scale.
• The 2nd has a smaller noise level and short length scale, which explains most variation by the noise-free function.
• It is important repeat the optimization multiple times with different initial values.

### GPR vs Kernel Ridge Regression¶

• GPR and KRR learn target functions by using "the kernel trick".
• KRR learns a linear function, induced by a kernel, that corresponds to a non-linear function in the original space. The linear function is chosen based on MSE (mean squared error) loss with ridge regularization.
• GPR uses the kernel to find the covariance of the prior distribution over the target functions, and uses the training data to find a likelihood function.
• A Gaussian posterior distribution is then defined using Bayes theorem. the distribution's mean is used for predictions.
• GPR can choose a kernel's parameters vias gradient ascent of the marginal likelihood function - KRR needs to do a grid search on a cross-validated MSE (mean squared loss) function.
• GPR learns a probabilistic model of the target function, therefore can provide confidence intervals. KRR can only provide predictions.