### Restricted Boltzmann Machines (RBMs)¶

• RBMs are unsupervised nonlinear feature learners based on a probabilistic model. The features extracted by an RBM (or hierarchy of RBMs) can produce good results when fed into a linear classifier such as a linear SVM or a perceptron.

• The nodes are random variables with states defined by their connection's weights and biases. An energy function defines the quality of a joint assignment: $E(\mathbf{v}, \mathbf{h}) = -\sum_i \sumj w{ij}v_ih_j - \sum_i b_iv_i • \sum_j c_jh_j$.
• "Restricted" refers to the bipartite model structure (direct interaction between hidden units, and between visible units, is prohibited.) This means conditional independencies are used: $\begin{split}h_i \bot h_j | \mathbf{v} \\ v_i \bot v_j | \mathbf{h}\end{split}$

• The bipartite structure enables using block Gibbs sampling for inference.

• RBMs make assumptions about input distributions. At the moment, scikit-learn only provides BernoulliRBM (either binary or between [0..1]) values, each encoding the probability that the specific feature would be turned on.

• The conditional probability distribution of each unit is the logistic sigmoid activation function of its inputs: $\sigma(x) = \frac{1}{1 + e^{-x}}$

### RBM Learning¶

• RBMs use Stochastic Maximum Likelihood, "SML", or Persistent Contrastive Divergence, "PCD" for training: $\log P(v) = \log \sum_h e^{-E(v, h)} - \log \sum_{x, y} e^{-E(x, y)}$

• The positive and negative gradient terms (the 1st & 2nd terms above) are estimated using minibatches of samples. The 1st is can be efficiently computed, but the 2nd (negative) term is intractible by direct computation. (?)

• It can be approximated by Markov Chain Monte Carlo (MCMC) using block Gibbs sampling, by iterating over each $h$ and $v$ until the chain "mixes". (These are sometimes referred to as "fantasy particles". This is inefficient and hard to determine whether the Markov chain mixes.

• PCD keeps a number of fantasy particles that are updated $k$ Gibbs steps after each weight update. This allows the particles to more fully explore the space.

### Example: RBMs for digit classification¶

• Greyscale images (pixel values = degrees of blackness on a white background),

• Artificially generate more labeled data (by perturbing the training data with linear shifts of 1 pixel in each direction) to learn latent representations from this small dataset.

• Build a classification pipeline with a Bernoulli RBM feature extractor and a LogisticRegression classifier. The parameters (learning rate, hidden layer size, regularization) are optimized by grid search, but the search is not reproduced here because of runtime constraints.

• Logistic regression on raw pixel values is shown for comparison. The example shows the features extracted by the BernoulliRBM help improve classification accuracy.