Scikit-Learn Guides - Jupyter Notebooks
(these are HTML pages, converted using nbconvert
. As such, they do not support Jekyll markup schemes.)
(Edits in progress. Not final.)
Getting Started
Estimator basics
Transformers & preprocessors
Model evaluation
Automatic parameter searches
Linear Models
Details
Ordinary Least Squares (OLS)
Ridge
regression
Lasso
regression
Akeike
& Bayes
info criteria
Elastic Net
regression
Least Angle (LARS)
regression
OrthogonalMatchingPursuit
(OMP)
BayesianRidge
regression
General Linear Regression (GLR)
GLR with Tweedie
Stochastic Gradient Descent (SGD)
regressor & classifier
Passive Aggressive
algos
RANSAC, Huber, Theil-Sen
robustness algos
Polynomial
regression
Nearest Neighbors (NNs)
Details
Options
KNN
vs Radius
-based
Ball tree vs KD tree vs Brute Force
NearestCentroid
NeighborhoodComponentsAnalysis
(NCA)
Decision Trees (DTs)
Details
DT classifier
Graphviz
DT regressor
Multiple outputs
Complexity
ID3
, C5.0
, CART
Impurity functions (Gini, Entropy, Misclassification, MSE, MAE)
Minimal cost-complexity pruning
Decision Trees / Boosting
Details
AdaBoost
Gradient Boosted
DTs
Shrinkage vs Learning Rate
Subsampling
Histogram-based Gradient Boosting
Stacked Generalization
Multiclass & Multioutput Algorithms
Details
Label Binarizer
One-vs-Rest
classifier
Multilabel
classifier
One-vs-One
classifier
Output Code
classifier
Multioutput
classifier
Classifier Chains
Multi Output
regressor
Regressor Chains
Feature Selection (FS)
Details
Variance-based
Univariate
Recursive
Model-based
Impurity-based
Sequential
FS & pipelines
Calibration Curves
Details
Using cross validation
Performance scores
Regressors
Multiclass support
Manifolds
Details
Isomap
Locally Linear Embedding
(LLE)
Modified LLE
Hessian LLE
Local Tangent Space Alignment
(LTSA)
Multi Dimensional Scaling
(MDS)
Random Tree Embedding
Spectral Embedding
t-distributed Stochastic Neighbor Embedding
(t-SNE)
Neighborhood Components Analysis
(NCA)
Clustering
Details
K-Means
Affinity Propagation
Mean Shift
Spectral
Agglomerative
Dendrograms
DBSCAN
OPTICS
Birch
Clustering Metrics
Details
rand_score
mutual_info_score
Homogeneity, completeness & v-measure
Fowlkes-Mallows score
Silhouette coefficient
Calinski-Harabasz index
Davies-Bouldin index
Contingency matrix
Pair confusion matrix
Biclustering
Details
Spectral co-clustering
Spectral bi-clustering
metrics
Component Analysis / Matrix Factorization
Details
Principal Component Analysis
(PCA)
Incremental PCA
PCA with random SVD
PCA & sparse data
Kernel PCA
Truncated SVD (aka Latent Semantic Analysis
, LSA)
Dictionary Learning
Factor Analysis
(FA)
Independent Component Analysis
(ICA)
Non-Negative Matrix Factorization
(NNMF)
Latent Dirichlet Allocation
(LDA)
Covariance
Details
Empirical (observed) covariance
Shrunk covariance
Ledoit-Wolf (LW) shrinkage
Oracle approx shrikage (OAS)
Precision matrix
Min covariance determinant (MCD) estimators
Mahalanaobis distances
Novelty & Outlier Detection
Details
Intro
section
One-class SVM
vs Elliptic Envelope
vs Isolation Forest
vs Local Outlier Factor
Novelties
Outliers
Cross Validation (CV)
Details
Intro
cross_val_score
cross_validate
cross_val_predict
Kfold
, stratified Kfold
Leave One Out
(LOO)
Leave P Out
(LPO)
CV on grouped data
Time series splits
Permutation testing
Visualizations
Hyperparameter Settings
Details
Grid search
Randomized search
Successive Halving (SH)
Alternatives to brute-force search
Info criteria (AIC,BIC) regularization
Classifier Metrics
Details
Accuracy
Top K accuracy
Balanced accuracy
Cohen's kappa
Confusion matrix
Classification report
Hamming loss
Precision, recall, F-measure
Precision-recall curve
Average precision
Jaccard similarity
Hinge loss
Log loss
Matthews correlation coefficient
Receiver operating characteristic
(ROC)
Detection error tradeoff
(DET)
Zero-one loss
Brier score
Regression Metrics
Details
Explained variance
Max error
Mean absolute error
(MAE)
Mean squared error
(MSE)
Mean squared log error
(MSLE)
Mean absolute pct error
(MAPE)
R2
(coefficient of determination)
Tweedie deviance error
Feature Extraction (Text)
Details
Bag of Words (BoW)
Count Vectorizer
TfIdf Transformer
TfIdf Vectorizer
Decoding text files
The Hashing Trick
Hashing Vectorizer
Custom vectorizers
Preprocessing Techniques
Details
Standard scaler
MinMax scaler
MaxAbs scaler
Robust scaler
Kernel centerer
Quantile transform
Power Map
Normalizer
Ordinal encoder
One Hot encoder
K Bins discretizer
(aka binning)
Polynomial feature generation
Imputation Techniques
Details
Simple
(univariate)
Iterative
(multivariate)
Nearest Neighbors
Missing Indicator
Kernel Approximations
Details
Nystroem approximation
RBF sampler
Additive Chi-squared sampler
Skewed Chi-squared sampler
Polynomial sampler
Pairwise Operations
Details
pairwise_distances
pairwise_kernels
Cosine similarity
Kernels: linear, polynomial, sigmoid, RBF, laplacian, chi-squared
Simple Datasets
Details
Boston house prices (classification)
Iris (classification)
Diabetes (regression)
Digits (classification)
Linnerud (regression)
Wine (classification)
Breast cancer (classification)
fetch_olivetti_faces
fetch_20newsgroups
fetch_lfw_people
(Labeled faces in the wild)
fetch_covtype
(Forest covertype)
fetch_rcv1
(Reuters Newswire corpus)
fetch_kddcup99
(KDD CUP - intrusion detection)
fetch_california_housing
Artificial Data Generators
Details
(classifications)
make_blobs
make_classification
make_gaussian_quantiles
make_circles
make_moons
(multilabel classifications)
make_multilabel
make_hastie
make_biclusters
make_checkerboard
(regression)
make_regression
make_sparse_uncorrelated
make_friedman(1,2,3)
(manifolds)
make_s_curve
make_swiss_roll
(decompositions)
make_low_rank_matrix
make_sparse_coded_signal
make_spd_matrix
(symmetric positive definite)
Other Example Datasets
Details
load_sample_images
fetch_openml
Other API tools - pandas, scipy, numpy, scikit-image, imageio
Performance / Latency
Details
Bulk vs Atomic mode
Validation overhead
#Features
Input datatypes
Feature extraction
Linear algebra - BLAS, LAPACK usage
Memory limits
Model reshaping