Supervised learning algorithms that apply a "naive" assumption of

*conditional dependence*between every pair of features. Given class variable $y$ and dependent features $x_1$..$x_n$, Bayes' theorem defines the following relation:$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}

`{P(x_1, \dots, x_n)}$`

Naive Bayes classifiers useful in many use cases - relatively small training data requirements, and fast computation, and relatively immune to "curse of dimensionality" issues thanks to the

*decoupling of class-conditional feature distributions*(each distribution can be independently evaluated as a 1D function.)Naive Bayes is a decent classifier, but a bad estimator. Probability outputs from

`predict_proba`

should not be taken seriously.

Feature probabilities are assumed to Gaussian: $P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$

$\sigma_y$ and $\mu_y$ are estimated via maximum likelihood.

In [3]:

```
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("#mislabeled points:\t", (y_test != y_pred).sum())
print("#total points:\t\t", X_test.shape[0])
```

#mislabeled points: 4 #total points: 75

- Implements NB for
*multinomial distributions*. Heavily used in

*text classification*where data is often represented as word vector counts.Distribution is modeled as $\theta_y = (\theta_{y1},\ldots,\theta_{yn})$ for each class $y$, #features $n$, and the probability of feature $i$ appearing in a sample belonging to class $y$: $\theta_{yi}$

$\theta_{yi}$ is estimated using a smoothed version of

*max likelihood*, aka "relative frequency counting": $\hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n}$, where $N_{yi} = \sum_{x \in T} x_i$ is the #times feature $i$ appears in a sample of class $y$ in the training set $T$.Smoothing $\alpha$ accounts for features not in the learning samples & prevents zero probabilities. $\alpha$=1 is "Laplace smoothing"; $\alpha$<1 is "Lidstone smoothing".

In [4]:

```
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(); clf.fit(X,y); print(clf.predict(X[2:3]))
```

[3]

CNB is adapted from MNB & is

*well suited for imbalanced datasets*. It uses the*complement*of each class to find model weights:\begin{align}\begin{aligned}\hat{\theta}_{ci} = \frac{\alpha

*i + \sum*{j:y*j \neq c} d*{ij}}`{\alpha + \sum_{j:y_j \neq c} \sum_{k} d_{kj}}\\w_{ci} = \log \hat{\theta}_{ci}\\w_{ci} = \frac{w_{ci}}{\sum_{j} |w_{cj}|}\end{aligned}\end{align}`

- using summations of all documents $j$
**not**in class $c$ - $d_{ij}$ is the count, or tf-idf value, of term $i$ in document $j$
- $\alpha_i$ is a smoothing parameter, similar to MNB

- using summations of all documents $j$
The 2nd normalization addresses the tendency of longer documents to dominate MNB parameter estimates. The classification rule is:

$\hat{c} = \arg\min_c \sum_{i} t_i w_{ci}$

In [5]:

```
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import ComplementNB
clf = ComplementNB(); clf.fit(X, y); print(clf.predict(X[2:3]))
```

[3]

- BNB is used for multivariate Bernoulli distributions (multiple features, each being a binary/boolean_), so this method requires feature vectors to binary-valued.
- BNB can binarize other datatypes via the
`binarize`

parameter. - The decision rule is $P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i)$.
- It penalizes the absence of feature $i$ for class $y$ - where MNB would simply ignore the non-occurrence.

In [6]:

```
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB(); clf.fit(X, y); print(clf.predict(X[2:3]))
```

[3]

- CNB assumes each feature has its own distribution.
The probability of category $t$ in feature $i$, given class $c$, is:

P(x

*i = t \mid y = c \: ;\, \alpha) = \frac{ N*{tic} + \alpha}{N_{c} +`\alpha n_i},`

- $N_{tic}$ is the #times category $t$ appears in samples $x_i$ in class $c$
- $N_c$ is the #samples with class $c$
- $\alpha$ is a smoothing parameter
- $n_i$ is the #available categories of feature $i$.

- CNB assumes the sample matrix $X$ is encoded so that all categories of each feature $i$ are represented with 0,....$n_i$-1, where $n_i$ is the #available categories.

In [7]:

```
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import CategoricalNB
clf = CategoricalNB(); clf.fit(X, y); print(clf.predict(X[2:3]))
```

[3]

- If a training set cannot fit in main memory, MNB, BNB & GNB support a
`partial_fit`

method to enable*incremental fitting*. If used, the first call to`partial fit`

requires inputting the entire class labels list. `partial_fit`

introduces some computational overhead. Use larger data chunks whenever possible to avoid cache/disk thrashing.

In [ ]:

```
```