Scaling Bayes

Published

Aug 28, 2019

Reading time

2 min read

Neural networks are good at making predictions, but they’re not so good at estimating how certain they are. If the training data set is small and many sets of model parameters fit the data well, for instance, the network may not realize this explicitly, leading to overly confident predictions. Bayesian models, on the other hand, theoretically can sample from the posterior distribution of parameters. However, the computational load becomes overwhelming as the number of parameters rises. New research allows Bayesian modeling of uncertainty to be applied even to large networks.

What’s new: Researchers at Google Brain built neural networks that integrate a Bayesian backpropagation method known as Stochastic Gradient Markov Chain Monte Carlo, fixing issues with noisy updates and slow convergence that affected earlier work. Their technique, Adaptive Thermostat Monte Carlo (ATMC), is the first based on SG-MCMC that scales to larger data sets such as ImageNet.

Key insight: Previous research using SG-MCMC failed to find training procedures that were robust to noise arising from parameter sampling in Bayesian methods. ATMC compensates for these issues by adjusting momentum and noise applied to parameter updates.

How it works: Non-Bayesian learning techniques compute the loss from outputs and labels only. Bayesian techniques add a prior distribution on learnable parameters. All methods based on SG-MCMC are derived from a stochastic differential equation that modifies a neural network’s parameter distribution based on the sampled output.

ATMC samples learnable parameters from the distribution, and the network backpropagates its errors.
Then it modifies the computed gradients to ensure that noisy sampling doesn’t overly influence shifts in the parameter distribution.
It makes convergence faster and more stable than prior variations of SG-MCMC by dynamically adjusting momentum and noise added to each parameter update.
In addition, the authors provide an adjusted ResNet architecture better suited for Bayesian training. The new model replaces batch normalization with SELU activation and uses a different weight initialization.

Results: ATMC is the first SG-MCMC method successfully trained on ImageNet. An ATMC-trained network gains a 1 percent increase over a batch-normalized ResNet in ImageNet top-1 accuracy.

Why it matters: Estimating uncertainty can be crucial in applications such as medical imaging and autonomous driving. ATMC confers this capability on neural networks even when learning large, complex data sets such as ImageNet.

We’re thinking: Bayesian methods have been studied longer than neural networks, and they still define the state of the art in some tasks. The fusion of Bayesian models and neural networks is still evolving. ATMC suggests that such hybrids could deliver the advantages of both approaches.

Subscribe to The Batch