Dear friends,

As the winter holiday approaches, it occurs to me that, instead of facing AI winter, we are in a boiling-hot summer of AI.

The vast majority of economic value created by AI today comes through the tool of supervised learning, trained to generate short labels (such as spam/not-spam) or a sequence of labels (such as a transcript of audio). This year, generative AI, which is built on top of supervised learning, arrived as a second major tool that enables AI to generate complex and compelling outputs such as images or paragraphs of text.

Some previous attempts to develop major new tools — for example, reinforcement learning — have not yet borne fruit commensurate with their hype. But generative AI is working well enough that it’s creating a new paradigm for AI applications.

Illustration of an oasis in the middle of a field covered in snow

And supervised learning is still far from achieving even a small fraction of its potential! Millions of applications that can be solved by supervised learning have not yet been built. Many teams are still trying to figure out best practices for developing products though supervised learning.

In the coming year and beyond, I look forward to wrestling with generative AI to create massive amounts of value for everyone. I feel lucky to be alive in this era, when technology is growing rapidly and we have an opportunity to create the future together! I feel even luckier to share this world with my family and with you.

Happy holidays,

Andrew

A Dazzling Year in AI

As we settle into a cup of hot cocoa and badger ChatGPT to suggest holiday gifts for our loved ones, we reflect on a year of tremendous advances in AI. Systems that generate human-like text, images, and code — with video and music on the horizon — delighted users even as they raised questions about the future of creativity. Models that decode chemistry and physics drove scientific discovery, while governments moved to control the supply of specialized microprocessors that make such innovations possible. While such developments give us pause, in this special issue of The Batch — as in past years at this season — we survey the marvels wrought by AI in 2022.

Illustration of an elf workshop creating a red toy car from a description (channeling AI generated images)

Synthetic Images Everywhere

Pictures produced by AI went viral, stirred controversies, and drove investments.

What happened: A new generation of text-to-image generators inspired a flood of experimentation, transforming text descriptions into mesmerizing artworks and photorealistic fantasies. Commercial enterprises were quick to press the technology into service, making image generation a must-have feature in software for creating and editing graphics.

Driving the story: Models that generate media became the public face of AI thanks to friendly user interfaces, highly entertaining output, and open APIs and models.

OpenAI introduced DALL·E 2 in April. More than 1.5 million users beta tested the model, and in September, the company made it widely available. Microsoft, which funds OpenAI in exchange for exclusive commercial rights to its work, integrated the model into its Azure AI-as-a-service platform.
By July, push-button artists were flooding the social media platforms with relatively crude images produced by the simpler Craiyon.
Stability AI soon upped the ante with the open source model Stable Diffusion — updated in November to version 2.0 — that eventually attracted more than $100 million in fresh capital.
Adobe and stock-photo kingpins Getty Images and Shutterstock integrated image-generation models into their own products and services.
Such programs produce radically different results depending on the text prompt they’re given. PromptBase opened a marketplace for text strings that generate interesting output.

Yes, but: Such models are trained on images scraped from the web. Like large language models, they inherit biases embedded in online content and imitate the inflammatory styles of expression.

Lensa AI, a photo-editing app that generates artistic avatars from users’ selfies, reached the top of mobile app store charts. Its success came with a dose of controversy as users, particularly women, found that the app sexualized their images.
ArtStation, an online community for visual artists, launched its own text-to-image features. Many artists, feeling threatened by computer programs that can reproduce an artist’s hard-won personal style in seconds, boycotted the website.

Behind the news: Diffusion models generate output by starting with noise and removing it selectively over a series of steps. Introduced in 2015 by researchers at UC Berkeley and Stanford, they remained in the background for several years until further work showed that they could produce images competitive with the output of generative adversarial networks (GANs). Stability AI put a diffusion model at the heart of Stable Diffusion. OpenAI, which based the initial version of DALL·E on a GAN, updated it with a diffusion model at around the same time.

Where things stand: The coming year is shaping up for a revolution in computer-aided creativity. And the groundswell of generated imagery isn’t going to stop at pictures. Google and Meta released impressive text-to-video models this year, and OpenAI accelerated text-to-3D-object generation by an order of magnitude.

Illustration of The Grinch's hands coding on a tablet

Programmer’s Best Friend

Behind schedule on a software project? There’s an app for that.

What happened: Language models fine-tuned on computer code proved capable of generating software routines similar to the work of experienced developers — though the results can be hit-or-miss.

Driving the story: AI-powered code generators made their way into large companies, and even small-time developers (and non-developers) gained access to them.

Ebay started the year by placing low-code tools into the hands of non-engineers, enabling them to build and deploy models without prior knowledge of AI or machine learning.
In February, DeepMind introduced AlphaCode, a transformer pretrained on 86 million programs in 12 programming languages and fine-tuned on entries to coding contests. At inference, it generates a million possible solutions and filters out the bad ones. In this way, it retroactively beat more than half of contestants in 10 coding competitions.
In June, GitHub opened access to Copilot, an autocomplete system that suggests code in real time. Users pay a subscription fee, though students and verified open-source developers get free access.

Behind the news: Users of OpenAI’s GPT-3 language model showed that it could generate working code as early as mid-2020. A year later, OpenAI introduced a fine-tuned version known as Codex, which serves as the foundation for GitHub's Copilot.

Yes, but: The widely available versions of this technology aren’t yet able to write complex programs. Often their output looks right at first glance but turns out to be buggy. Moreover, their legal status may be in jeopardy. A class-action lawsuit against GitHub, OpenAI, and Microsoft claims that the training of Codex violated open source licensing agreements. The outcome could have legal implications for models that generate text, images, and other media as well.

Where things stand: AI-powered coding tools aren’t likely to replace human programmers in the near future, but they may replace the tech question-and-answer site Stack Overflow as the developer’s favorite crutch.

Illustration of a snowman with a top hat and glasses

AI’s Eyes Evolve

Work on vision transformers exploded in 2022.

What happened: Researchers published an abundance of ViT papers during the year. A major theme: combining self-attention and convolution.

Driving the story: A team at Google Brain introduced vision transformers (ViTs) in 2020, and the architecture has undergone nonstop refinement since then. The latest efforts adapt ViTs to new tasks and address their shortcomings.

ViTs learn best from immense quantities of data, so researchers at Meta and Sorbonne University concentrated on improving performance on datasets of (merely) millions of examples. They boosted performance using transformer-specific adaptations of established procedures such as data augmentation and model regularization.
Researchers at Inha University modified two key components to make ViTs more like convolutional neural networks. First, they divided images into patches with more overlap. Second, they modified self-attention to focus on a patch's neighbors rather than on the patch itself, and enabled it to learn whether to weigh neighboring patches more evenly or more selectively. These modifications brought a significant boost in accuracy.
Researchers at the Indian Institute of Technology Bombay outfitted ViTs with convolutional layers. Convolution brings benefits like local processing of pixels and smaller memory footprints due to weight sharing. With respect to accuracy and speed, their convolutional ViT outperformed the usual version as well as runtime optimizations of transformers such as Performer, Nyströformer, and Linear Transformer. Other teams took similar approaches.

Behind the news: While much ViT research aims to surpass and ultimately replace convolutional neural networks (CNNs), the more potent trend is to marry the two. The ViT’s strength lies in its ability to consider relationships between all pixels in an image at small and at large scales. One downside is that it needs additional training to learn in ways that are baked into the CNN architecture after random initialization. CNN’s local context window (within which only local pixels matter) and weight sharing (which enables it to process different image locations identically) help transformers to learn more from less data.

Where things stand: The past year expanded the Vision Transformer’s scope in a number of applications. ViTs generated plausible successive video frames, generated 3D scenes from 2D image sequences, and detected objects in point clouds. It's hard to imagine recent advances in text-to-image generators based on diffusion models without them.

A MESSAGE FROM DEEPLEARNING.AI

DeepLearning.ai's specialization Mathematics for Machine Learning and Data Science banner ad

Mathematics for Machine Learning and Data Science is our next specialization. Set to launch in January 2023, it’s a beginner-friendly way to master the math behind AI algorithms and data analysis techniques. Join the waitlist and be among the first to enroll!

Illustration of a person shoveling snow with the help of a flamethrower

Language Models, Extended

Researchers pushed the boundaries of language models to address persistent problems of trustworthiness, bias, and updatability.

What happened: While many AI labs aimed to make large language models more sophisticated by refining datasets and training methods — including methods that trained a transformer to translate 1,000 languages — others extended model architectures to search the web, consult external documents, and adjust to new information.

Driving the story: The capacity of language models to generate plausible text outstrips their ability to discern facts and resist spinning fantasies and expressing social biases. Researchers worked to make their output more trustworthy and less inflammatory.

In late 2021, DeepMind proposed RETRO, a model that retrieves passages from the MassiveText dataset and integrates them into its output.
AI21 Labs' spring launch of Jurassic-X introduced a suite of modules — including a calculator and a system that queries Wikipedia — to fact-check a language model’s answers to math problems, historical facts, and the like.
Researchers at Stanford and École Polytechnique Fédérale de Lausanne created SERAC, a system that updates language models with new information without retraining them. A separate system stores new data and learns to provide output to queries that are relevant to that data.
Meta built Atlas, a language model that answer questions by retrieving information from a database of documents. Published in August, this approach enabled an 11 billion-parameter Atlas to outperform a 540 billion-parameter PaLM at answering questions.
Late in the year, OpenAI fine-tuned ChatGPT to minimize untruthful, biased, or harmful output. Humans ranked the quality of the model’s training data, then a reinforcement learning algorithm rewarded the model for generating outputs similar to those ranked highly.
Such developments intensified the need for language benchmarks that evaluate more varied and subtle capabilities. Answering the call, more than 130 institutions collaborated on BIG-bench, which includes tasks like deducing a movie title from emojis, participating in mock trials, and detecting logical fallacies.

Behind the news: Amid the progress came a few notable stumbles. The public demo Meta’s Galactica, a language model trained to generate text on scientific and technical subjects, lasted three days in November before its developers pulled the plug due to its propensity to generate falsehoods and cite nonexistent sources. In August, the chatbot BlenderBot 3, also from Meta, quickly gained a reputation for spouting racist stereotypes and conspiracy theories.

Where things stand: The toolbox of truth and decency in text generation grew substantially in the past year. Successful techniques will find their way into future waves of blockbuster models.

Illustration of three deers doing holiday household chores: washing a champagne flute, cooking pie and wrapping a gift

One Model Does It All

Individual deep learning models proved their mettle in hundreds of tasks.

What happened: The scope of multi-task models expanded dramatically in the past year.

Driving the story: Researchers pushed the limits of how many different skills a neural network can learn. They were inspired by the emergent skills of large language models — say, the ability to compose poetry and write computer programs without architectural tuning for either — as well as the capacity of models trained on both text and images to find correspondences between the disparate data types.

In spring, Google’s PaLM showed state-of-the-art results in few-shot learning on hundreds of tasks that involve language understanding and generation. In some cases, it outperformed fine-tuned models or average human performance.
Shortly afterward, DeepMind announced Gato, a transformer that It learned over 600 diverse tasks — playing Atari games, stacking blocks using a robot arm, generating image captions, and so on — though not necessarily as well as separate models dedicated to those tasks. The system underwent supervised training on a wide variety of datasets simultaneously, from text and images to actions generated by reinforcement learning agents.
As the year drew to a close, researchers at Google brought a similar range of abilities to robotics. RT-1 is a transformer that enables robots to perform over 700 tasks. The system, which tokenizes actions as well as images, learned from a dataset of 130,000 episodes collected from a fleet of robots over nearly a year and a half. It achieved outstanding zero-shot performance in new tasks, environments, and objects compared to prior techniques.

Behind the news: The latest draft of the European Union’s proposed AI Act, which could become law in 2023, would require users of general-purpose AI systems to register with the authorities, assess their systems for potential misuse, and conduct regular audits. The draft defines general-purpose systems as those that “perform generally applicable functions such as image/speech recognition, audio/video generation, pattern-detection, question-answering, translation, etc.,” and are able to “have multiple intended and unintended purposes.” Some observers have criticized the definition as too broad. The emerging breed of truly general-purpose models may prompt regulators to sharpen their definition.

Where things stand: We’re still in the early phases of building algorithms that generalize to hundreds of different tasks, but the year showed that deep learning has the potential to get us there.