Reading time
3 min read
LAION Roars: The story of LAION, the dataset behind Stable Diffusion

The largest dataset for training text-to-image generators was assembled by volunteers for roughly $10,000. Now it’s implicated in fights over whether copyrighted works can be used for training.

What’s new: Christoph Schuhmann, a German high school teacher who helped found the Large-scale Artificial Intelligence Open Network (LAION), told Bloomberg how a cadre of outsiders came together to ensure that large tech companies aren’t the only ones with access to large quantities of training data. The nonprofit group’s datasets — notably LAION-5B (5 billion text-image pairs) — have been used to train Stability AI’s Stable Diffusion, Google’s Imagen, and other text-to-image models.

Volunteer work: Schuhmann and two co-founders met on a Discord server for AI enthusiasts. Catalyzed by the launch of OpenAI’s DALL•E in January 2021, they decided to build their own image dataset. They established a separate Discord server in March 2021, which continues to act as LAION’s nerve center.

  • The group used a Python script to trawl through raw HTML in the Common Crawl dataset to identify images paired with alt text. They used OpenAI’s CLIP to calculate a similarity score between a linked image and its corresponding text and selected pairs with sufficiently high scores.
  • They probed image hosting sites like Pinterest and DeviantArt, ecommerce services like Shopify, cloud services like Amazon Web Services, thumbnails from YouTube, photos from U.S. government websites, and images from news sites. The team did not filter out objectionable content.
  • The team covered its server fees through a combination of crowdfunding, a 2021 donation from Hugging Face for an unspecified amount, and a donation from Stability AI founder Emad Mostaque for between $9,000 and $10,000. Mostaque, who had founded Stability AI in 2020, used a 2 billion-image subset of LAION-5B to train Stable Diffusion, released in August 2022.
  • Schuhmann, who continues to work for LAION pro bono, has refused job offers from several tech firms.

Behind the news: Data scraped from the web is at the center of several disputes.

  • Artists are suing Stability AI and Midjourney for their use of copyrighted works in developing AI models. Developers are suing Microsoft, GitHub, and OpenAI over their use of open source code for the same purpose. Both cases are in progress.
  • LAION may be insulated from claims of copyright violation because it doesn’t host its datasets directly. Instead it supplies web links to images rather than the images themselves. When a photographer who contributes to stock image libraries filed a cease-and-desist request that LAION delete his images from its datasets, LAION responded that it has nothing to delete. Its lawyers sent the photographer an invoice for €979 for filing an unjustified copyright claim.
  • A major recording company has pressured streaming services to block AI developers from downloading music.
  • Such conflicts are set to proliferate. The latest draft of the European Union’s AI Act, which has been approved by the bloc’s assembly and is pending review by a higher authority, mandates that generative AI developers disclose copyrighted materials used to train their models — a tall order when those materials are scraped from the web en masse.

Why it matters: Copyright holders are questioning the ethics of using their materials to build AI models. LAION plays a major role in the controversy. On one hand, it’s a nonprofit effort run by volunteers on a shoestring budget. On the other, the datasets it curates are driving tremendous business value. Stability AI, for instance, seeks a $4 billion valuation.

We’re thinking: The AI community is entering an era in which we are called upon to be more transparent in our collection and use of data. We shouldn’t take resources like LAION for granted, because we may not always have permission to use them.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox