### Semi-supervised Learning¶

• Used in situations where some of your training data is not labeled.
• Scikit-Learn expects an identifier to mark unlabeled points during training; this implementation uses integer -1.
• Semi-supervised algorithms need assumptions about the dataset distribution for effective use. Wikipedia

### Self-Training Classifier¶

• Based on Yarowsky's algorithm.
• Can be called with any classifier (specified with base_classifier) that can return predict_proba. The classifier predicts labels for unlabeled data with each iteration & adds a subset of the predictions to the labeled dataset.
• The subset choice is controlled by threshold (of prediction probabilities) or the k_best samples.
• The labels used in the final fit, and the iteration on which each label was labeled as available, are attributes.
• max_iter controls algorithm execution time.

### Example: Threshold vs Self-Training Performance¶

• Labels are deleted from the breast_cancer dataset - only 50 of the 569 samples are labeled.
• top graph: number of labeled samples, and accuracy score, vs threshold value.
• bottom graph: number of iterations at which a sample is labeled.
• Values are 3-fold cross validated.
• At thresholds 0.4..0.5, classifier is learning from low-confidence labels - so the accuracy is poor.
• At thresholds 0.9..1.0, the classifier quits adding to its dataset.
• Optimal accuracy is around 0.7.
• Manual cross validation required so that -1 isn't treated as a separate class when computing accuracy...

### Example: Comparison of decision boundaries: Label Spreading, Self-Training & SVM¶

• Demonstrates LS & ST can learn reasonable boundaries even when small amounts of labeled data are available.

### Label Propagation & Label Spreading¶

• LP & LS both construct a similarity graph over all items in the dataset.
• LP uses the raw similarity graph with no modifications, and "hard clamps" input labels (setting $\alpha=0$)
• LS minimizes a loss function with regularization, which aids with robustness to noise. It uses a modified version of the original graph & normalizes edge weights via a normalized graph Laplacian matrix.
• LP models have two built-in kernels; the choice effects both scalability and performance.
• rbf (keyword "gamma"): $\exp(-\gamma |x-y|^2), \gamma>0$
• Returns a fully connected graph (a dense matrix).
• Matrix size + full matrix multiplication during each iteration means this option can lead to long runtimes.
• knn (keyword "n_neighbors"): $1[x' \in kNN(x)]$
• Returns a sparse matrix = shorter runtimes.

### Example: Label Propagation on Complex Structure¶

• Outer circle label: "red"; inner circle label: "blue"

### Example: Digits Classification with Label Spreading¶

• Digits dataset = 1797 points, only 30 will be labeled.
• Results returned in a confusion matrix.