### Naive Bayes¶

• Supervised learning algorithms that apply a "naive" assumption of conditional dependence between every pair of features. Given class variable $y$ and dependent features $x_1$..$x_n$, Bayes' theorem defines the following relation:

$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}  {P(x_1, \dots, x_n)}$
• Naive Bayes classifiers useful in many use cases - relatively small training data requirements, and fast computation, and relatively immune to "curse of dimensionality" issues thanks to the decoupling of class-conditional feature distributions (each distribution can be independently evaluated as a 1D function.)

• Naive Bayes is a decent classifier, but a bad estimator. Probability outputs from predict_proba should not be taken seriously.

### Gaussian NB classification¶

• Feature probabilities are assumed to Gaussian: $P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$

• $\sigma_y$ and $\mu_y$ are estimated via maximum likelihood.

### Multinomial NB classification¶

• Implements NB for multinomial distributions.
• Heavily used in text classification where data is often represented as word vector counts.

• Distribution is modeled as $\theta_y = (\theta_{y1},\ldots,\theta_{yn})$ for each class $y$, #features $n$, and the probability of feature $i$ appearing in a sample belonging to class $y$: $\theta_{yi}$

• $\theta_{yi}$ is estimated using a smoothed version of max likelihood, aka "relative frequency counting": $\hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n}$, where $N_{yi} = \sum_{x \in T} x_i$ is the #times feature $i$ appears in a sample of class $y$ in the training set $T$.

• Smoothing $\alpha$ accounts for features not in the learning samples & prevents zero probabilities. $\alpha$=1 is "Laplace smoothing"; $\alpha$<1 is "Lidstone smoothing".

### Complement NB classification¶

• CNB is adapted from MNB & is well suited for imbalanced datasets. It uses the complement of each class to find model weights:

\begin{align}\begin{aligned}\hat{\theta}_{ci} = \frac{\alphai + \sum{j:yj \neq c} d{ij}}

                   {\alpha + \sum_{j:y_j \neq c} \sum_{k} d_{kj}}\\w_{ci} = \log \hat{\theta}_{ci}\\w_{ci} = \frac{w_{ci}}{\sum_{j} |w_{cj}|}\end{aligned}\end{align}


• using summations of all documents $j$ not in class $c$
• $d_{ij}$ is the count, or tf-idf value, of term $i$ in document $j$
• $\alpha_i$ is a smoothing parameter, similar to MNB
• The 2nd normalization addresses the tendency of longer documents to dominate MNB parameter estimates. The classification rule is:

$\hat{c} = \arg\min_c \sum_{i} t_i w_{ci}$

### Bernoulli NB classification¶

• BNB is used for multivariate Bernoulli distributions (multiple features, each being a binary/boolean_), so this method requires feature vectors to binary-valued.
• BNB can binarize other datatypes via the binarize parameter.
• The decision rule is $P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i)$.
• It penalizes the absence of feature $i$ for class $y$ - where MNB would simply ignore the non-occurrence.

### Categorical NB¶

• CNB assumes each feature has its own distribution.
• The probability of category $t$ in feature $i$, given class $c$, is:

P(xi = t \mid y = c \: ;\, \alpha) = \frac{ N{tic} + \alpha}{N_{c} +

                                 \alpha n_i},


• $N_{tic}$ is the #times category $t$ appears in samples $x_i$ in class $c$
• $N_c$ is the #samples with class $c$
• $\alpha$ is a smoothing parameter
• $n_i$ is the #available categories of feature $i$.
• CNB assumes the sample matrix $X$ is encoded so that all categories of each feature $i$ are represented with 0,....$n_i$-1, where $n_i$ is the #available categories.

### Out-of-core fitting¶

• If a training set cannot fit in main memory, MNB, BNB & GNB support a partial_fit method to enable incremental fitting. If used, the first call to partial fit requires inputting the entire class labels list.
• partial_fit introduces some computational overhead. Use larger data chunks whenever possible to avoid cache/disk thrashing.