### Validation Curves¶

• Every estimator has tradeoffs. Generalization error can be decomposed into bias, variance and noise.

• Bias is its average error for different training sets.

• Variance indicates how sensitive it is to varying training sets.
• Noise is a property of the data.

• Below:

### Example: Underfit/Overfit¶

• We want to approximate a cosine function with models that have varying polynomial features.

• A linear function (polynomial with degree 1) is not sufficient (underfitting).

• A polynomial of degree 4 approximates the true function almost perfectly.

• Higher-degree models overfit the training data (it learns the noise of the training data).

• We calculate the mean squared error (MSE) on the validation set. Higher MSEs indicate the less likely the model generalizes correctly from the training data.

### Validation Curve¶

• To validate a model we need a scoring function. The best ways to choose multiple hyperparameters are grid search or similar methods.

• If we optimized the hyperparameters based on a validation score, the validation score is biased and not a good estimate of the generalization any longer. To get a proper estimate of the generalization we have to compute the score on another test set.

• However, it is sometimes helpful to plot the influence of a single hyperparameter on the training & validation scores to learn if an estimator is overfitting or underfitting.

• If the training and validation scores are both low, the estimator is underfitting. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible.

### Learning Curve¶

• A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It shows how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.

• For the naive Bayes, both the validation score and the training score converge to a value that is quite low with increasing size of the training set. Thus, we will probably not benefit much from more training data.

In contrast, for small amounts of data, the training score of the SVM is much greater than the validation score. Adding more training samples will most likely increase generalization.

### Example: Learning Curve Analysis¶

• Upper left: the learning curve of a naive Bayes classifier is shown for the digits dataset. The training & cross-validation score are both not very good at the end.

• However, the shape of the curve is very common: the training score is high at the beginning and decreases; the cross-validation score is low at the beginning and increases.

• Upper right: the learning curve of an SVM with RBF kernel. The training score is near the maximum and the validation score increass with more training samples.

• 2nd row: Training times vs training dataset sizes.

• 3rd row: Training times vs fit_times.