• A simple & efficient training algorithm
• Not the only one out there. For example, SGDClassifier(loss='log') is equivalent to Logistic Regression fitted via SGD.

### SGD Classification¶

• Accepts two input arrays: training data X (#samples,#features) and targets (labels) y (#labels).
• fit_intercept tells the model whether to use an intercept (a biased hyperplane).
• decision_function (a method) returns the signed distance to the hyperplane (the dot product between the coefficients & the input sample - plus the intercept.)
• loss sets the loss function for the model. options:

• loss="hinge" - linear support vector machine
• loss="modified_huber" - hinge loss (smoothed)
• loss="log" - logistic regression
• Using log and modified_huber loss functions enables the predict_proba method - which returns a vector of probability estimates per sample x: $P(y|x)$

• L1 & L2 norm penalties are set using penalty:

• penalty="l2": L2 norm penalty used on coef_.
• penalty="l1": L1 norm penalty used on coef_.
• penalty="elasticnet": convex combination of L1 & L2 norm penalties.
• The default is penalty=l2".

• l1_ratio controls the convex L1/L2 penalty combination.

### Multiclass Classification¶

• Implemented with a one-vs-all (OVA) scheme. A binary classifier is trained on each of K classes vs all other (K-1) classes.
• The confidence score (the signed distance to the hyperplane) is found for each classifier; the class with the highest score is returned.

### Weighted Classification¶

• SGDClassifier supports weighted classes (via class_weight) and instances (via sample_weight).

### Averaged SGD¶

• SGDClassifier supports averaged SGD (ASGD) via average=True.
• ASGD computes the same updates as SGD, except that coef_ is set to the average coefficient values across all updates. (The same distinction happens with intercept_.)
• ASGD can result in accelerated learning rates = learning rate speedups.

### Example: Solver comparison¶

• compares SGD, ASGD, Perceptron, Passive-Aggressive I/II, SAG

### SGD Regression¶

• Implementation that supports various loss functions & penalties
• Well suited for regression problems with >10K training samples
• Loss function options:
• loss="squared_loss": ordinary least squares regression
• loss="huber": huber loss for robust regression
• loss="epsilon_insensitive: linear support vector regression
• penalty controls regularization (same options as in classification)
• Averaged SGD (ASGD) is supported.
• Stochastic Average Gradient (SAG) is supported.

### SGD & Sparse Data¶

• Built-in support using data from scipy.sparse.
• For max efficiency, use CSR matrix format (scipy.sparse.csr_matrix)

### Computational Complexity¶

• Major advantage of SGD: near-linear relationship with #training samples

### Stopping & Convergence¶

• SGD Classifier & Regressor methods support two ways to stop when a given level of convergence is reached:
• early_stopping=True: stopping criteria is based on the prediction score (score) found on the validation set.
• early_stopping=False: model is fitted on entire input dataset. Stopping is based on the objective function found on the training dataset.
• Stopping criteria is evaluated once per epoch. The algorithm stops when the criterion does not improve by n_iter_no_change consecutive times.
• The algorithm stops regardless after a max #iterations max_iter.

### Tips¶

• scale your data. Easily done using StandardScaler.
• Find a reasonable regularization term $\alpha$ with automatic parameter search (GridSearchCV or RandomizedSearchCV). Consider using a range similar to 10.0**-np.arange(1,7)
• SGD typically converges after ~10^6 training samples.
• If applying SGD to extracted features (for example, using PCA), consider scaling feature values by a constant c such that the average L2 norm of the training data equals one.