Pile on the Layers! DeepNorm Allows Transformers to Accommodate More Layers

Published

Jun 22, 2022

Reading time

2 min read

Adding layers to a neural network puts the “deep” in deep learning, but it also increases the chance that the network will get stuck during training. A new approach effectively trains transformers with an order of magnitude more layers than previous methods.

What’s new: A team at Microsoft led by Hongyu Wang and Shuming Ma developed DeepNorm, a normalization function that enables transformers to accommodate up to 1,000 layers. (Their models, dubbed DeepNet, topped out at 3.8 billion parameters.)

Key insight: When training a transformer, layer normalization often is used to scale layer inputs, promoting faster learning. The magnitude of a layer normalization’s input is inversely proportional to the total change in the parameter values of all previous layers in a training step. The authors found that the greater the number of layers, the higher the likelihood of a very large update. This results in larger inputs to layer normalization, so earlier layers receive smaller and smaller updates until parameter values stop changing and performance stops improving. (This issue is related to the familiar vanishing gradient problem, but its cause is different. In the familiar scenario, gradients from later layers diminish as they backpropagate through the network. In this case, the combination of layer normalization and unusually large updates results in significantly smaller gradients.) Limiting the total change in parameter values would prevent large updates, which should enable deeper networks to continue training without getting stuck.

How it works: The authors trained a transformer, applying DeepNorm to the residual connections in each attention and feed-forward layer.

To avoid large parameter updates, DeepNorm scaled up each residual connection’s computation by an author-derived constant. Mathematically, residual connections usually output x+f(x), where f(x) is the function computed by the previous layer. DeepNorm changes them to output a*x+f(x).
Given the output of the residual connections, DeepNorm applied layer normalization.
DeepNorm also scaled down the initial parameter values to avoid large updates in early training.

Results: The authors evaluated DeepNets of various depths on tasks that involve translating text between English and over 100 other languages. The DeepNets outperformed all competitors of equal depth, between 36 and 1,000 layers, as well as some with an order of magnitude fewer layers (and an order of magnitude more parameters). For instance, translating English into German and back, a 200-layer DeepNet achieved 28.9 BLEU, while a 200-layer dynamic linear combination of layers (a state-of-the-art transformer variant) achieved 27.5 BLEU. Seven other 200-layer models, including a transformer without the authors’ modifications, diverged during training. On the OPUS-100 multilingual dataset, a DeepNet with 200 layers and 3.2 billion parameters achieved 23.0 BLEU, while M2M-100 (a transformer variant with 48 layers and 12 billion parameters) achieved 18.4 BLEU.

Why it matters: Scaling up neural networks has driven a lot of improvement over the past decade. This work points a way toward even deeper models.

We’re thinking: DeepNets are deep and narrow, making previous models look shallow and wide by comparison. Since training ginormous (1,000 layer, super-wide) models is very expensive, we’d do well to find the ideal tradeoff between deep and narrow versus shallow and wide.

Subscribe to The Batch