### Perceptrons¶

• Learns a function with $m$-dimensional inputs and $o$-dimensional outputs. Given a set of features $X$ and a target $y$ it can learn non-linear approximations for either classifications or regressions.
• MLPs uses hidden layers. Each neuron in a hidden layer transforms the previous layer's values with a weighted linear summation $w_1x_1 + w_2x_2 + ... + w_mx_m$, followed by a nonlinear activation function $g(\cdot):R \rightarrow R$ - for example, a hyperbolic tangent function.
• The output layer receives values from the last hidden layer.
• coefs_ is a list of weight matrices. The $ith$ matrix is the weights between layer $i$ and $i+1$.
• intercepts_ is a list of bias vectors. The $ith$ vector is the biases added to layer $i+1$.
• Can learn non-linear models.
• Can learn models in real time (aka "online") using partial_fit.
• MLPs with hidden layers can have non-convex loss functions - multiple local minima can occur. Use multiple, random weight initializations to aid validation accuracy.
• Lots of hyperparameter tuning.
• Sensitive to feature scaling.

### Classification¶

• MLP trains on two arrays: training samples $X$ (#samples, #features), floating-point vectors; and $Y$ (class labels).
• MLP can fit non-linear models. clf_coefs_ contains the weight matrix.
• MLP supports the cross-entropy loss function which returns a vector of probability estimates $P(y|x)$ (via predict_proba).

### Multiclass & Multilabel Classification¶

• Multiclass classification is supported by using Softmax as the output function.
• Multilabel classification is supported. Each class passes through a logistic function, so values=>0.5 are rounded to 1 - otherwise to 0.

### Example: MLP Classifier learning strategies, compared¶

• Plot training loss curves for various stochastic learning/loss functions
• Using smaller dataset for runtime purposes.
• Results can heavily depend on learning_rate_init.

### Regression¶

• The regressor trains an MLP using backprop with no activation function (aka "the identity function") in the output layer.
• It therefore uses square error as the loss function. Outputs are continuous.
• Multi-output regression is supported.

### Regularization¶

• The classifier and regressor both use $\alpha$ for L2 regularization. This helps prevent overfitting by penalizing large-magnitude weights.