Batch normalization is a technique that normalizes layer outputs to accelerate neural network training. But new research shows that it has other effects that may be more important.
What’s new: Jonathan Frankle and colleagues at MIT, CUNY, and Facebook AI showed that batch normalization’s trainable parameters alone can account for much of a network’s accuracy.
Key insight: As it adjusts a network’s intermediate feature representations for a given minibatch, batch normalization itself learns how to do so in a consistent way for all minibatches. The researchers probed the impact of this learning by training only the batch normalization parameters, gamma (????) and beta (β), while setting all other parameters to random values.
How it works: The researchers trained variously sized ResNet and Wide ResNet models, which include batch normalization layers, on the CIFAR-10 image dataset.
- Batch normalization normalizes the output of intermediate layers according to their average and variance across a minibatch. Then it scales them by ???? and shifts them by β. The values of those variables are learned.
- After training batch normalization parameters only, the researchers found that nearly half of ???? values were close to zero. Pruning those values had a negligible impact on performance.
- Deep ResNets had much higher accuracy than wide ResNets with similar numbers of trainable Batch normalization parameters. Batch normalization is known to have greater impact on deeper networks, and apparently scaling and shifting do as well.
Results: Training all the parameters in a ResNet-866 yielded 93 percent accuracy, while training only ???? and β brought 83 percent accuracy. This finding is further evidence that networks can be accurate even with a large number of random weights.
Why it matters: Batch normalization is often standard procedure in deep learning, but previous studies failed to recognize the power of its trainable parameters. And batchnorm isn’t the only normalization method that scales and shifts parameter values; so do weight normalization and switchable normalization. Further research may illuminate the impact of these normalizations on network performance.
We’re thinking: Why batch normalization works has been a subject of heated debate since Sergey Ioffe and Christian Szegedy introduced it in 2015. The authors’ original explanation of “reducing internal covariate shift” sounded a bit like black magic. This work sheds light on the mystery.