### Cross Validation¶

• Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model would just repeat the labels of the samples that it has just seen - but fail to predict anything useful on yet-unseen data. This is called overfitting.

• When performing a (supervised) machine learning experiment, reserve part of the available data as a test set (X_test, y_test). The best parameters can be determined by grid search techniques.

• Use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) helper function to quickly split a dataset into training & test subsets:

• There is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. Knowledge about the test set can “leak” into the model.

• Yet another part of the dataset can be reserved validation to avoid this problem. Training proceeds on the training set, preliminary evaluation is done on the validation set. When the experiment seems to be successful, final evaluation can be done on the test set.

• Partitioning datasets like this drastically reduce the number of samples available for learning, so cross-validation ("CV") solves this problem. A test set is still reserved for final evaluation, but the validation set is no longer needed. The basic approach (k-fold CV) splits the training data into k subsets. For each of the k “folds”:

• Train using $k-1$ folds as training data.
• Validate the result on the remainder (use it as test set to compute a performance measure.)
• The performance measure is the average of the values computed in the loop. It can be computationally expensive but doesn't waste data - which is a major advantage in problems such as inverse inference where the #samples is very small.

### Metrics¶

• The simplest way to do CV is to call cross_val_score on the estimator object & dataset.

• cv determines the number of "folds". Cross_val_score uses Kfold or Stratified Kfold if cv is an integer.

• Use the score parameter to specify the scoring method. Options are available in the Scikit's User Guide.

• In this case (using the Iris dataset), the accuracy (default) and F1-score metrics are nearly equal. This is partially due to the Iris samples being balanced across target classes.

• Other CV methods are available by passing an iterator, either off-the-shelf or custom.

### cross_validate (vs cross_val_score)¶

• cross_validate allows using multiple metrics. It returns both test scores and timing values in a dict.

• For single metrics (where scoring is a string, callable or None), it returns these keys: ['test_score', 'fit_time', 'score_time'].

• For multiple metrics, it returns these keys: ['test_', 'test_', 'test_<scorer...>', 'fit_time', 'score_time']

• return_train_score=False by default to save computation time. Set it to True to return training set scores too.

• Use return_estimator=True to retain the estimator trained on each fit.

• Multiple metrics can be specified as a list, tuple or set of predefined scorer names.

### cross_val_predict (CVP)¶

• Similar to cross_val_score but returns a prediction for each element in a test set.

• CVP can only be used with CV strategies that assign all elements to a test set exactly once.

• CVP is appropriate if you are viewing predictions from different models, or are using predictions from one estimator to train another. (ie "model blending" i n ensembles.)

### Example: ROC classifier metrics with CV¶

• ROC curves display true positive rate (TPR) on the Y axis, and false positive rate (FPR) on the X axis.

• The top left corner is the “ideal” point: FPR = 0, and TPR = 1. It means that a larger area under the curve (AUC) is usually better.

• The “steepness” of ROC curves is important. The goal is to maximize TPR while minimizing FPR.

• This example shows the ROC curve of different K-fold CV iterations. We can calculate the mean area under curve, and see the variance from different subsets. It shows how the classifier output is affected by changes in training data, and how different the splits generated by K-fold CV are from one another.

### Example: Text Feature Evaluation Pipeline with CV¶

• Using the 20 newsgroups dataset. Adjust the #categories by giving their names to the dataset loader or None to get all 20.

### Example: Prediction Plots with CV¶

• Diabetes toy dataset; linear regression model.

### Example: Nested vs Non-Nested CV¶

• Iris toy dataset.
• Model selection without nested CV uses the same data, so information may “leak” into the model and overfit the data. The magnitude of the problem is mostly dependent on dataset size & model stability.

• Nested CV effectively uses a series of train/validation/test set splits. In the inner loop (here: GridSearchCV), the score is found by fitting a model to each training set, then maximized in selecting (hyper)parameters over the validation set. In the outer loop (here: cross_val_score), generalization error is found by averaging test set scores over several dataset splits.

• The example uses a support vector classifier (SVC) with a non-linear kernel to build the model with optimized hyperparameters by grid search. We compare the performance of non-nested and nested CV strategies by taking the difference between their scores.

### KFold¶

• KFold divides samples into $k$ groups, called folds of equal sizes (if possible). The prediction is learned using $k-1$ folds. The remaining fold is used for test.

### Stratified KFold¶

• RepeatedKFold repeats K-Fold n times. Use it to produce different splits in each of n repetitions.

Example of 2-fold K-Fold repeated 2 times:

### Leave One Out (LOO)¶

• Each learning set is created by taking all the samples except one (the test set). For $n$ samples we have $n$ training sets and $n$ different test sets. This method does not waste data as only one sample is removed from the training set.

• When compared with Kfold cross validation, one builds $n$ models from $n$ samples instead of $k$ models, where $n>k$. Each is trained on $n-1$ samples, not $(k-1)n/k$.

• LOO is more computationally expensive & often returns high test error estimate variance. Since $n-1$ of $n$ samples are used to build each model, models constructed from folds are virtually identical to each other and to the model built from the entire training set.

• If the learning curve is steep for the training size in question, then 5- or 10- fold cross validation can overestimate the generalization error.

• 5- or 10- fold cross validation is preferred to LOO.

### Leave P Out (LPO)¶

• It creates all possible training & test sets by removing $p$ samples from the complete set. This produces ${n \choose p}$ training/test pairs from $n$ samples.

• Unlike LOO and KFold, the test sets will overlap for $p>1$.

### Shuffle & Split¶

• Generates a user-defined number of independent training / test splits. Samples are first shuffled and then split.

• You can control the randomness for reproducibility by seeding the random_state pseudo random number generator.

### Class label-based stratfication¶

• Some classification problems have large class distribution imbalances - for instance, several times more negative samples than positive samples.

• In such cases use stratified sampling to ensure relative class frequencies are preserved in each train and validation fold.

### Stratified KFold¶

• Notice how SKF preserves class ratios in the training & test sets.

### Stratified Shuffle Split¶

• Creates splits by preserving the same percentage for each target class as in the complete set.

### Cross Validation on "Grouped" Data¶

• The i.i.d. assumption is broken if the underlying generative process yields groups of dependent samples.

• Data groupings are domain specific, for example medical data from multiple patients with multiple samples taken per patient. In this case the patient id for each sample will be its group identifier.

• We want to know if a model trained on a specific set of groups generalizes well to unseen groups. To measure this, we need to ensure that all samples in the validation fold come from groups that are not represented in the paired training fold.

### Group Kfold¶

• Ensures the same group is not represented in both testing and training sets.

• For example if data is obtained from different subjects with several samples per subject, and the model can learn from highly person specific features it could fail to generalize to new subjects. GroupKFold makes it possible to detect this kind of overfitting situations.

### Leave One Group Out (LOGO)¶

• Reserves samples according to a third-party-provided array of integer groups. This group information can be used to encode arbitrary domain-specific cross-validation folds.

• Each training set thus contains all the samples except the ones related to a specific group.

• For example, in the cases of multiple experiments, LOGO can be used to create a cross-validation based on the different experiments.

• Another common application is to use time information: for example, groups could be the year of collection of the samples and thus allow for cross-validation against time-based splits.

### Leave P Groups Out (LPGO)¶

• Similar as LOGO, but removes samples from $P$ groups for each training/test set.

### Group Shuffle Split (GSS)¶

• Acts as a combination of ShuffleSplit and LPGO. It generates a sequence of random partitions in which a subset of groups are held out for each split.

• This method is useful when the behavior of LPGO is desired, but the #groups is very large (generating all possible partitions with groups withheld would be prohibitively expensive). GSS provides a random sample (with replacement) of the train / test splits generated by LPGO.

### Predefined Split Methods¶

• For some datasets, a predefined split of the data already exists. For example, when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.

### Using CV Iterators as Dataset Splitters¶

• Note the convenience function train_test_split is a wrapper around ShuffleSplit - it only allows for stratified splitting (using class labels) and cannot account for groups.

• To perform the split, use the indices for the train and test subsets yielded by the generator output by the split method of the cross-validation splitter.

### Time Series Split (TSS)¶

• Time series data contains correlation between observations that are near in time (autocorrelation).

• Classical CV techniques such as KFold and ShuffleSplit assume the samples are i.i.d., and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error).

• Therefore, it is important to evaluate time series data on the “future” observations least like those that are used to train the model.

• TSS returns the 1st $k$ folds as the training set and the $k+1$th fold as the test set.

• Unlike classic CV methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.

• This method is used to CV samples that are observed at fixed time intervals.

### Permutation Testing¶

• permutation_test_score offers another way to evaluate classifier performance. It returns a permutation-based p-value (how likely an observed performance of the classifier would be obtained by chance).

• The null hypothesis in this test is the classifier fails to leverage any statistical dependency between the features and the labels to make correct predictions on left out data.

• permutation_test_score generates a null distribution by calculating n_permutations different permutations of the data. In each permutation the labels are randomly shuffled, thereby removing any dependency between the features and the labels.

• The p-value output is the fraction of permutations for which the average CV score is better than the CV score using the original data. For reliable results n_permutations should typically be >100 and cv between 3-10 folds.

• A low p-value indicates the dataset contains a real dependency between features and labels - and the classifier was able to use this to obtain good results.

• A high p-value could be due to a lack of dependency between features and labels (there is no difference in feature values between the classes) or because the classifier was not able to use the dependency in the data. In the latter case, using a more appropriate classifier that is able to utilize the structure in the data, would result in a low p-value.

• A classifier trained on a high dimensional dataset with no structure may still perform better than expected on CV, just by chance. This can happen with small datasets. permutation_test_score indicates whether the classifier has found a real class structure and can help in evaluating its performance.

• This test can produce low p-values even if there is only a weak structure in the data, because in the corresponding permutated datasets there is absolutely no structure. This test is therefore only able to show when the model reliably outperforms random guessing.

• permutation_test_score is computed using brute force and interally fits (n_permutations+1)n_cv* models. It is therefore only tractable with small datasets for which fitting an individual model is very fast.

### Example: Permutation testing of Classification Performance¶

• Plot a histogram of the permutation scores.
• Red line: classifier score, original data - much better than that obtained with permuted data. p-value is very low. This says there is low chance of this good score being obtained by chance. It provides evidence the Iris dataset contains real dependencies between features & labels. The classifier was able to use this.
• original data: Below we plot a histogram of the permutation scores (the null distribution). The red line indicates the score obtained by the classifier on the original data. The score is much better than those obtained by using permuted data and the p-value is thus very low. This indicates that there is a low likelihood that this good score would be obtained by chance alone. It provides evidence that the iris dataset contains real dependency between features and labels and the classifier was able to utilize this to obtain good results.
• random data: plot the null distribution for the randomized data. The permutation scores are similar to those obtained using the original iris dataset because the permutation always destroys any feature label dependency present. The score obtained on the original randomized data in this case though, is very poor. This results in a large p-value, confirming that there was no feature label dependency in the original data.