Supervised learning algorithms that apply a "naive" assumption of conditional dependence between every pair of features. Given class variable $y$ and dependent features $x_1$..$x_n$, Bayes' theorem defines the following relation:
$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}
{P(x_1, \dots, x_n)}$
Naive Bayes classifiers useful in many use cases - relatively small training data requirements, and fast computation, and relatively immune to "curse of dimensionality" issues thanks to the decoupling of class-conditional feature distributions (each distribution can be independently evaluated as a 1D function.)
Naive Bayes is a decent classifier, but a bad estimator. Probability outputs from predict_proba
should not be taken seriously.
Feature probabilities are assumed to Gaussian: $P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$
$\sigma_y$ and $\mu_y$ are estimated via maximum likelihood.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("#mislabeled points:\t", (y_test != y_pred).sum())
print("#total points:\t\t", X_test.shape[0])
#mislabeled points: 4 #total points: 75
Heavily used in text classification where data is often represented as word vector counts.
Distribution is modeled as $\theta_y = (\theta_{y1},\ldots,\theta_{yn})$ for each class $y$, #features $n$, and the probability of feature $i$ appearing in a sample belonging to class $y$: $\theta_{yi}$
$\theta_{yi}$ is estimated using a smoothed version of max likelihood, aka "relative frequency counting": $\hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n}$, where $N_{yi} = \sum_{x \in T} x_i$ is the #times feature $i$ appears in a sample of class $y$ in the training set $T$.
Smoothing $\alpha$ accounts for features not in the learning samples & prevents zero probabilities. $\alpha$=1 is "Laplace smoothing"; $\alpha$<1 is "Lidstone smoothing".
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(); clf.fit(X,y); print(clf.predict(X[2:3]))
[3]
CNB is adapted from MNB & is well suited for imbalanced datasets. It uses the complement of each class to find model weights:
\begin{align}\begin{aligned}\hat{\theta}_{ci} = \frac{\alphai + \sum{j:yj \neq c} d{ij}}
{\alpha + \sum_{j:y_j \neq c} \sum_{k} d_{kj}}\\w_{ci} = \log \hat{\theta}_{ci}\\w_{ci} = \frac{w_{ci}}{\sum_{j} |w_{cj}|}\end{aligned}\end{align}
The 2nd normalization addresses the tendency of longer documents to dominate MNB parameter estimates. The classification rule is:
$\hat{c} = \arg\min_c \sum_{i} t_i w_{ci}$
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import ComplementNB
clf = ComplementNB(); clf.fit(X, y); print(clf.predict(X[2:3]))
[3]
binarize
parameter.import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB(); clf.fit(X, y); print(clf.predict(X[2:3]))
[3]
The probability of category $t$ in feature $i$, given class $c$, is:
P(xi = t \mid y = c \: ;\, \alpha) = \frac{ N{tic} + \alpha}{N_{c} +
\alpha n_i},
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import CategoricalNB
clf = CategoricalNB(); clf.fit(X, y); print(clf.predict(X[2:3]))
[3]
partial_fit
method to enable incremental fitting. If used, the first call to partial fit
requires inputting the entire class labels list.partial_fit
introduces some computational overhead. Use larger data chunks whenever possible to avoid cache/disk thrashing.