09_naive_bayes

Naive Bayes¶

Supervised learning algorithms that apply a "naive" assumption of conditional dependence between every pair of features. Given class variable $y$ and dependent features $x_1$..$x_n$, Bayes' theorem defines the following relation:

$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}
```
                           {P(x_1, \dots, x_n)}$
```
Naive Bayes classifiers useful in many use cases - relatively small training data requirements, and fast computation, and relatively immune to "curse of dimensionality" issues thanks to the decoupling of class-conditional feature distributions (each distribution can be independently evaluated as a 1D function.)
Naive Bayes is a decent classifier, but a bad estimator. Probability outputs from predict_proba should not be taken seriously.

Gaussian NB classification ¶

Feature probabilities are assumed to Gaussian: $P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$
$\sigma_y$ and $\mu_y$ are estimated via maximum likelihood.

In [3]:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)

print("#mislabeled points:\t", (y_test != y_pred).sum())
print("#total points:\t\t", X_test.shape[0])

#mislabeled points:	 4
#total points:		 75

Multinomial NB classification ¶

Implements NB for multinomial distributions.
Heavily used in text classification where data is often represented as word vector counts.
Distribution is modeled as $\theta_y = (\theta_{y1},\ldots,\theta_{yn})$ for each class $y$, #features $n$, and the probability of feature $i$ appearing in a sample belonging to class $y$: $\theta_{yi}$
$\theta_{yi}$ is estimated using a smoothed version of max likelihood, aka "relative frequency counting": $\hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n}$, where $N_{yi} = \sum_{x \in T} x_i$ is the #times feature $i$ appears in a sample of class $y$ in the training set $T$.
Smoothing $\alpha$ accounts for features not in the learning samples & prevents zero probabilities. $\alpha$=1 is "Laplace smoothing"; $\alpha$<1 is "Lidstone smoothing".

In [4]:

import numpy as np

rng = np.random.RandomState(1)
X   = rng.randint(5, size=(6, 100))
y   = np.array([1, 2, 3, 4, 5, 6])

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(); clf.fit(X,y); print(clf.predict(X[2:3]))

[3]

Complement NB classification ¶

CNB is adapted from MNB & is well suited for imbalanced datasets. It uses the complement of each class to find model weights:

\begin{align}\begin{aligned}\hat{\theta}_{ci} = \frac{\alphai + \sum{j:yj \neq c} d{ij}}
```
                   {\alpha + \sum_{j:y_j \neq c} \sum_{k} d_{kj}}\\w_{ci} = \log \hat{\theta}_{ci}\\w_{ci} = \frac{w_{ci}}{\sum_{j} |w_{cj}|}\end{aligned}\end{align} 
```
- using summations of all documents $j$ not in class $c$
- $d_{ij}$ is the count, or tf-idf value, of term $i$ in document $j$
- $\alpha_i$ is a smoothing parameter, similar to MNB
The 2nd normalization addresses the tendency of longer documents to dominate MNB parameter estimates. The classification rule is:

$\hat{c} = \arg\min_c \sum_{i} t_i w_{ci}$

In [5]:

import numpy as np

rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])

from sklearn.naive_bayes import ComplementNB

clf = ComplementNB(); clf.fit(X, y); print(clf.predict(X[2:3]))

[3]

Bernoulli NB classification ¶

BNB is used for multivariate Bernoulli distributions (multiple features, each being a binary/boolean_), so this method requires feature vectors to binary-valued.
BNB can binarize other datatypes via the binarize parameter.
The decision rule is $P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i)$.
- It penalizes the absence of feature $i$ for class $y$ - where MNB would simply ignore the non-occurrence.

In [6]:

import numpy as np

rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])

from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB(); clf.fit(X, y); print(clf.predict(X[2:3]))

[3]

Categorical NB ¶

CNB assumes each feature has its own distribution.
The probability of category $t$ in feature $i$, given class $c$, is:

P(xi = t \mid y = c \: ;\, \alpha) = \frac{ N{tic} + \alpha}{N_{c} +
```
                                 \alpha n_i},
```
- $N_{tic}$ is the #times category $t$ appears in samples $x_i$ in class $c$
- $N_c$ is the #samples with class $c$
- $\alpha$ is a smoothing parameter
- $n_i$ is the #available categories of feature $i$.

CNB assumes the sample matrix $X$ is encoded so that all categories of each feature $i$ are represented with 0,....$n_i$-1, where $n_i$ is the #available categories.

In [7]:

import numpy as np

rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])

from sklearn.naive_bayes import CategoricalNB

clf = CategoricalNB(); clf.fit(X, y); print(clf.predict(X[2:3]))

[3]

Out-of-core fitting¶

If a training set cannot fit in main memory, MNB, BNB & GNB support a partial_fit method to enable incremental fitting. If used, the first call to partial fit requires inputting the entire class labels list.
partial_fit introduces some computational overhead. Use larger data chunks whenever possible to avoid cache/disk thrashing.

Naive Bayes¶

Gaussian NB classification¶

Multinomial NB classification¶

Complement NB classification¶

Bernoulli NB classification¶

Categorical NB¶

Out-of-core fitting¶

Gaussian NB classification ¶

Multinomial NB classification ¶

Complement NB classification ¶

Bernoulli NB classification ¶

Categorical NB ¶