A worldwide collaboration produced the biggest open source language model to date.
What’s new: BLOOM is a family of language models built by the BigScience Research Workshop, a collective of over 1,000 researchers from 250 institutions around the globe.
How it works: BLOOM is a transformer model that emulates OpenAI’s GPT-3. It was trained on a custom 1.6 billion terabyte dataset to generate output in any of 46 human languages and 13 programming languages.
- The BigScience team hand-curated much of the data in an effort to mitigate bias. For instance, team members filtered out a significant amount of pornographic content, which they believe is over-represented in other datasets.
- The team trained BLOOM to generate incomplete text one word at a time using Megatron-DeepSpeed, which combines a transformer framework and a deep learning optimization library for distributed training. Megatron-DeepSpeed accelerated training by splitting the data and model across 384 GPUs.
- BLOOM is available in six sizes from 350 million to 176 billion parameters. Anyone with a Hugging Face account can query the full-size version through a browser app.
Behind the news: BigScience began in May 2021 as a year-long series of workshops aimed at developing open source AI models that are more transparent, auditable, and representative of people from diverse backgrounds than their commercial counterparts. Prior to BLOOM, the collaboration released the T0 family of language models, which were English-only and topped out at 11 billion parameters.
Why it matters: Developing large language models tends to be the province of large companies because they can afford to amass gargantuan datasets and expend immense amounts of processing power. This makes it difficult for independent researchers to evaluate the models’ performance, including biased or harmful outputs. Groups like BigScience and EleutherAI, which released its own open source large language model earlier this year, show that researchers can band together as a counterweight to Big AI.
We’re thinking: Just over two years since GPT-3’s debut, we have open access to large language models from Google, Meta, OpenAI, and now BigScience. The rapid progress toward access is bound to stimulate valuable research and commercial projects.