Trillions of Parameters Are AI models with trillions of parameters the new normal?

Published

Dec 22, 2021

Reading time

2 min read

The trend toward ever-larger models crossed the threshold from immense to ginormous.

What happened: Google kicked off 2021 with Switch Transformer, the first published work to exceed a trillion parameters, weighing in at 1.6 trillion. Beijing Academy of Artificial Intelligence upped the ante with WuDao 2.0, a 1.75 trillion-parameter behemoth.

Driving the story: There's nothing magical about the number of zeros in a model’s parameter count. But as processing power and data sources have grown, what once was a tendency in deep learning has become a principle: Bigger is better. Well-funded AI companies are piling on parameters at a feverish pace — both to drive higher performance and to flex their muscles — notably in language models, where the internet provides mountains of unlabeled data for unsupervised and semi-supervised pretraining. Since 2018, the parameter-count parade has led through BERT (110 million), GPT-2 (1.5 billion), MegatronLM (8.3 billion), Turing-NLG (17 billion), and GPT-3 (175 billion) to the latest giants.

Yes, but: The effort to build bigger and bigger models brings its own challenges. Developers of gargantuan models must overcome four titanic obstacles.

Data: Large models need lots of data, but large-scale sources like the web and digital libraries can lack high-quality data. For example, researchers found that BookCorpus, a collection of 11,000 ebooks that has been used to train over 30 large language models, could propagate bias toward certain religions because it lacks texts that discuss beliefs other than Christianity and Islam. The AI community is increasingly aware that data quality is critical, but no consensus has emerged on efficient ways to compile large-scale, high-quality datasets.
Speed: Today’s hardware struggles to process immense models, which can bog down as bits shuttle repeatedly in and out of memory. To reduce latency, the Google team behind Switch Transformer developed a method that processes a select subset of the model’s layers for each token. Their best model rendered predictions around 66 percent faster than one that had 1/30th as many parameters. Meanwhile, Microsoft developed the DeepSpeed library, which processes data, individual layers, and groups of layers in parallel and cuts redundant processing by dividing tasks between CPUs and GPUs.
Energy: Training such giant networks burns a lot of kilowatts. A 2019 study found that, using fossil fuel, training a 200 million-parameter transformer model on eight Nvidia P100 GPUs emitted nearly as much carbon dioxide as an average car over five years of driving. A new generation of chips that promise to accelerate AI, such Cerebras’ WSE-2 and Google’s latest TPU, may help reduce emissions while wind, solar, and other cleaner energy sources ramp up to meet demand.
Delivery: These gargantuan models are much too big to run on consumer or edge devices, so deploying them at scale requires either internet access (slower) or slimmed-down implementations (less capable).

Where things stand: Natural language modeling leaderboards remain dominated by models with parameter counts up to hundreds of billions — partly due to the difficulties of processing a trillion-plus parameters. No doubt their trillionaire successors will replace them in due course. And there’s no end in sight: Rumors circulate that OpenAI’s upcoming successor to GPT-3 will comprise a server-melting 100 trillion parameters.

Subscribe to The Batch