Researchers discovered a new way to reduce memory requirements when training large machine learning models.
What's new: Tim Dettmers and colleagues at University of Washington released 8-bit optimizers that store gradient statistics as 8-bit values, instead of the usual 32-bit, while maintaining the same accuracy.
Key insight: Popular optimizers like Adam use statistics derived from gradients to accelerate training. Adam uses an estimate of the change in the gradient of each weight over time, which can occupy as much as 50 percent of the memory required during training. However, at any given time, the optimizer needs only the estimates pertinent to the weights it’s currently processing. The remaining part can be quantized temporarily — that is, the numbers can be converted into fewer bits — to take up less memory.
How it works: The authors used block-wise quantization, which means that gradient statistics were split into blocks and each block was quantized independently.
- During training, an optimizer updated parameters in groups (for example, the group of weights in a neural network’s first layer). After it updated the weights of one group, it quantized the group’s gradient statistics, stored them, and updated the next group.
- To perform quantization, the algorithm split the gradient statistics of one group into blocks of 2,048 numbers. For each block, it recorded the maximum absolute value, then divided the block’s elements using that value, so the maximum absolute value became 1. For each divided element, it looked up the closest 8-bit value, then stored the index (0...255) of that value.
- When it returned to a particular group, it dequantized the gradient statistics for that group by reversing the steps above. Then it performed another update and quantized the statistics again.
Results: The authors used their method on a few language tasks including machine translation and GLUE. Models trained on the 8-bit version of Adam achieved BLEU and accuracy scores on those tasks, respectively, nearly identical to those achieved by the 32-bit version. Using 8-bit Adam, authors fine-tuned a 1.5 billion-parameter GPT-2-large on an Nvidia V100 GPU with 24GB of memory. Using the 32-bit Adam optimizer, the hardware maxed out on a 762-million parameter GPT-2-medium.
Why it matters: Using an 8-bit optimizer makes it possible to train bigger models —in this work, roughly twice as big — on a given hardware configuration. For instance, now we can train Roberta-large — which is 1 percent to 5 percent more accurate than Roberta, according to the original paper — within the previous memory requirement for the smaller version.
We're thinking: Details like how much memory an optimizer uses may not seem worthy of attention when you’re designing and training a model — but, given the memory and processing requirements of deep learning models, sometimes they can have a big impact.