### Metrics/Scoring¶

• Three Scikit APIs of note:

• estimator score methods: provides a default evaluation method.
• scoring parameter: used by Cross Validation tools.
• metric functions: used to compute prediction error in specific situations.
• dummy estimators provide baseline metric values for random predictions.

• pairwise metrics provide metrics between samples. They are not estimators or predictors.

### Accuracy score¶

• returns the fraction (default) or count (normalize=False) of correct predictions. Returns subset accuracy for multilabel classification. 1.0 & 0.0 equal perfect and worse-case scores respectively.

### Top K Accuracy¶

• A prediction is correct as long as the true label matches one $k$ highest predicted scores. accuracy_score is a a special case of k=1.

• Covers binary & multiclass classification problems (but not multilabel).

### Balanced Accuracy¶

• Avoids inflated performance estimates on imbalanced datasets.

### Cohen's kappa¶

• Designed to compare labelings by different human annotators - not a classifier versus a ground truth.

• The kappa score is a number between [-1..+1]. Scores above .8 are generally considered good agreement; zero or lower means no agreement (practically random labels).

• Applicable for binary or multiclass problems - not for multilabel problems (except by manually computing a per-label score) and not for more than two annotators.

### Confusion matrix¶

• Each entry ($i$,$j$) is the number of observations actually in group $i$, but predicted to be in group $j$.
• normalize reports ratios instead of counts, using sums of each column, each row or the entire matrix.
• To get counts of true negatives, false positives, false negatives and true positives in binary problems:

### Classification Report¶

• Builds a text report of the main classification metrics.

### Hamming Loss¶

• Returns the average Hamming loss or Hamming distance between two sets of samples.

### Precision, Recall & F-measure¶

• precision: the ability to avoid labeling a negative sample as positive: $\text{precision} = \frac{tp}{tp + fp}$

• recall: the ability to find all positive samples: $\text{recall} = \frac{tp}{tp + fn}$

• f-measure: the weighted harmonic mean of precision & recall: $F_\beta = (1 + \beta^2) \frac{\text{precision} \times \text{recall}}{\beta^2 \text{precision} + \text{recall}}$

### Precision Recall Curve¶

• Returns a precision-recall curve using the ground truth label and the classifier's score by varying a decision threshold.

• Only applicable to binary problems.

### Average Precision Score¶

• Returns average precision from the predictions scores. Answer can range between [0..1]; higher is better.

• Only applicable to binary classification & multilabel indicator formats.

### Binary vs Multiclass vs Multilabel Classification¶

• Binary classification outcomes:
• TP (true positive) = correct result
• FP (false positive) = unexpected result
• FN (false negative) = missing result
• TN (true negative) = correct absence of result
• Multiclass/Multilabel classification: precision, recall & F-measure stats can be independently applied to each label. Several metrics include ways to combine results across labels. using the average param.

### Jaccard Similarity¶

• Returns an average of the Jaccard similarity coefficient between pairs of label sets.

• The Jaccard coefficient of an $i$th sample with ground truth label $y_i$ and predicted label $\hat{y}_i$ is $J(y_i, \hat{y}_i) = \frac{|y_i \cap \hat{y}_i|}{|y_i \cup \hat{y}_i|}$.

### Hinge Loss¶

• wikipedia: A distance metric that only considers prediction errors. Used in max margin classifiers such as SVMs.

### Log Loss¶

• Also called logistic regression loss or cross-entropy loss.

• Commonly used in multinomial logistic regression, neural nets, and some expectation-maximization problems.

• Binary classification: log loss per sample = the negative log-likelihood of the classifier given a true label: $L_{\log}(y, p) = -\log \operatorname{Pr}(y|p) = -(y\log(p)+(1-y)\log(1-p))$

• Multiclass classification: using samples coded in a 1-of-K binary indicator matrix: log loss of entire set: $L_{\log}(Y, P) = -\log \operatorname{Pr}(Y|P) = - \frac{1}{N} \sum_{i=0}^{N-1} \sum_{k=0}^{K-1} y_{i,k} \log p_{i,k}$

### Matthews Correlation Coefficient¶

• Wikipedia

• Given TP = #true positives, FP = #false positives, TN = #true negatives, FN = #false negatives:

• binary: $MCC = \frac{tp \times tn-fp \times fn}{\sqrt{(tp+fp)(tp+fn)(tn+fp)(tn+fn)}}.$

### Confusion Matrix (Multilabel)¶

• Returns a class-wise (default) or sample-wise (samplewise=True) confusion matrix to evaluation classifier accuracy.

• Plots True Positive Rate (TPR) vs False Positive Rate (FPR) for binary classifiers.

### Example: Area Under ROC, Multiclass¶

• ROC_AUC_score can be used for multiclass classification.

• Multiclass OvO compares every unique class pair. Below: calculate AUC using OvR & OvO schemes, report macro average, report prevalence-weighted average.

• Plots error rates for binary classifiers - false reject rates vs false acceptance rates. The axes are are scaled non-linearly, which provides curves that are more linear than ROC curves.

### Zero One Loss¶

• Find the sum or average of the 0-1 classification loss (the percentage of imperfectly predicted subsets). The function normalizes the answer by default. Use normalize=False to get the sums instead.

• In multilabel classification: it scores a subset as if one of its labels strictly match the predictions - and as a zero if any errors.

### Brier Score¶

• Returns a Brier Score for binary classes. Values are [0..1]. The smaller the value, the more accurate the prediction.

• Defined as the MSE of actual outcome and predicted probability estimate p=Pr(y=1): $BS = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}} - 1}(y_i - p_i)^2$