Dear friends,

As we approach the end of 2021, you may be winding down work and gearing up for the winter holiday. I’m looking forward to taking a break from work and hope you are too.

December is sometimes called the Season of Giving. If you have spare time and are wondering what to do with it, I think one of the best things any of us can do is to reflect on how we can help others.

When the AI community was small, there was a strong spirit of cooperation. It felt like an intrepid band of pioneers taking on the world, and people were eager to help others with advice, an encouraging word, or an introduction. Those who benefited from this often couldn’t pay it back, so we paid it forward by helping those who came after us. As the AI community grows, I would like to preserve this spirit. I promise to keep working to build up the AI community. I hope you will, too!

I also hope that you will consider ways — large or small — that you can lend a helping hand beyond the AI community. Many of us have access to advanced technology that much of the world does not. Collectively, our decisions move billions of dollars and affect billions of lives. This gives us a special opportunity to do good in the world.

“We are what we repeatedly do,” said historian and philosopher Will Durant (often misattributed to Aristotle). If you repeatedly seek to uplift others, not only does this help them but — perhaps equally important — it makes you a better person, too, for it is your repeated actions that define you as a person. There’s also a classic study that shows spending money on others may make you happier than spending money on yourself.

So, during this holiday season, I hope you’ll take some time off. Rest, relax, and recharge! Connect with loved ones if you haven’t done so frequently enough the past year. And if time permits, find something meaningful you can do to help someone else, be it leaving an encouraging comment on a blog post, sharing advice or encouragement with a friend, answering an AI question in an online forum, or making a donation to a worthy cause. Among charities relevant to education and/or tech, my favorites include the Wikimedia Foundation, Khan Academy, Electronic Frontier Foundation, and Mozilla Foundation. You can pick something meaningful to you from this list of organizations vetted by Charity Watch.

In the U.S., many parents tell their children that Santa Claus, the jolly character who leaves gifts in their homes at this time of year, is a magical being. When the kids grow up, they learn that Santa Claus isn’t real. Can we, as adults, be real-life Santa Clauses ourselves and give the gifts of our time, attention, or funds to someone else?

Love,

Andrew

2021 in the Rear-View Monitor

In the past year, the globe struggled with extreme weather events, economic inflation, disrupted supply chains, and the Darwinian wiles of Covid-19. In tech, it was another year of virtual offices and virtual conferences. The AI community continued its effort to bridge these worlds, advancing machine learning while strengthening its ability to benefit all corners of society. We probed the dark side of 2021 in our Halloween special issue. In this issue, we highlight developments that are primed to change the face of AI in 2022 and beyond.

Screen Shot 2021-12-21 at 4.32.38 PM copy

Multimodal AI Takes Off

While models like GPT-3 and EfficientNet, which work on text and images respectively, are responsible for some of deep learning’s highest-profile successes, approaches that find relationships between text and images made impressive strides.

What happened: OpenAI kicked off a big year in multimodal learning with CLIP, which matches images and text, and Dall·E, which generates images that correspond to input text. DeepMind’s Perceiver IO classifies text, image, video, and point clouds. Stanford’s ConVIRT added text labels to medical X-ray images.
Driving the story: While the latest multimodal systems were mostly experimental, a few real-world applications broke through.

The open source community combined CLIP with generative adversarial networks to produce striking works of digital art. Artist Martin O’Leary used Samuel Coleridge’s epic poem “Kubla Khan” as input to generate the psychedelic scrolling video interpretation, “Sinuous Rills.”
Facebook said its multimodal hate-speech detector flagged 97 percent of the abusive and harmful content it removed from the social network. The system classifies memes and other image-text pairings as benign or harmful based on 10 data types including text, images, and video.
Google said it would add multimodal (and multilingual) capabilities to its search engine. Its Muiltitask Unified Model returns links to text, audio, images, and videos in response to queries in any of 75 languages.

Behind the news: The year’s multimodal momentum built upon decades of research. In 1989, researchers at Johns Hopkins University and UC San Diego developed a system that classified vowels based on audio and visual data of people speaking. Over the next two decades, various groups attempted multimodal applications like indexing digital video libraries and classifying human emotions based on audiovisual data.

Where things stand: Images and text are so complex that, in the past, researchers had their hands full focusing on one or the other. In doing so, they developed very different techniques. Over the past decade, though, computer vision and natural language processing have converged on neural networks, opening the door to unified models that merge the two modes. Look for models that integrate audio as well.

Trillions of Parameters

The trend toward ever-larger models crossed the threshold from immense to ginormous.

What happened: Google kicked off 2021 with Switch Transformer, the first published work to exceed a trillion parameters, weighing in at 1.6 trillion. Beijing Academy of Artificial Intelligence upped the ante with Wu Dao 2.0, a 1.75 trillion-parameter behemoth.

Driving the story: There's nothing magical about the number of zeros in a model’s parameter count. But as processing power and data sources have grown, what once was a tendency in deep learning has become a principle: Bigger is better. Well-funded AI companies are piling on parameters at a feverish pace — both to drive higher performance and to flex their muscles — notably in language models, where the internet provides mountains of unlabeled data for unsupervised and semi-supervised pretraining. Since 2018, the parameter-count parade has led through BERT (110 million), GPT-2 (1.5 billion), MegatronLM (8.3 billion), Turing-NLG (17 billion), and GPT-3 (175 billion) to the latest giants.

Yes, but: The effort to build bigger and bigger models brings its own challenges. Developers of gargantuan models must overcome four titanic obstacles.

Data: Large models need lots of data, but large-scale sources like the web and digital libraries can lack high-quality data. For example, researchers found that BookCorpus, a collection of 11,000 ebooks that has been used to train over 30 large language models, could propagate bias toward certain religions because it lacks texts that discuss beliefs other than Christianity and Islam. The AI community is increasingly aware that data quality is critical, but no consensus has emerged on efficient ways to compile large-scale, high-quality datasets.
Speed: Today’s hardware struggles to process immense models, which can bog down as bits shuttle repeatedly in and out of memory. To reduce latency, the Google team behind Switch Transformer developed a method that processes a select subset of the model’s layers for each token. Their best model rendered predictions around 66 percent faster than one that had 1/30th as many parameters. Meanwhile, Microsoft developed the DeepSpeed library, which processes data, individual layers, and groups of layers in parallel and cuts redundant processing by dividing tasks between CPUs and GPUs.
Energy: Training such giant networks burns a lot of kilowatts. A 2019 study found that, using fossil fuel, training a 200 million-parameter transformer model on eight Nvidia P100 GPUs emitted nearly as much carbon dioxide as an average car over five years of driving. A new generation of chips that promise to accelerate AI, such Cerebras’ WSE-2 and Google’s latest TPU, may help reduce emissions while wind, solar, and other cleaner energy sources ramp up to meet demand.
Delivery: These gargantuan models are much too big to run on consumer or edge devices, so deploying them at scale requires either internet access (slower) or slimmed-down implementations (less capable).

Where things stand: Natural language modeling leaderboards remain dominated by models with parameter counts up to hundreds of billions — partly due to the difficulties of processing a trillion-plus parameters. No doubt their trillionaire successors will replace them in due course. And there’s no end in sight: Rumors circulate that OpenAI’s upcoming successor to GPT-3 will comprise a server-melting 100 trillion parameters.

Voices for the Voiceless

Musicians and filmmakers adopted AI as a standard part of the audio-production toolbox.

What happened: Professional media makers embraced neural networks that generate new sounds and modify old ones. Voice actors bristled.

Driving the story: Generative models can learn from existing recordings to create convincing facsimiles. Some producers used the technology to generate original voices, some to mimic existing voices. You can hear their work via the links below.

Modulate, a U.S. startup, uses generative adversarial networks to synthesize a new voice for a human speaker in real time. It enables gamers and voice chatters to inhabit a fictional character, and trans people have used it to adjust their voices closer to their gender identities.
Sonantic, a startup that specializes in synthetic voices, created a new voice for actor Val Kilmer, who lost much of his vocal ability to throat surgery in 2015. The company trained its model on audio from the Top Gun star’s body of work.
Filmmaker Morgan Neville hired a software company to re-create the voice of the late travel-show host Anthony Bourdain for his documentary Roadrunner: A Film About Anthony Bourdain. The move prompted outrage from Bourdain’s widow, who said she had not given her permission.

Yes, but: Bourdain’s widow isn’t the only one who’s disturbed by AI’s ability to mimic deceased performers. Voice actors expressed worry that the technology threatens their livelihoods; they were upset by a fan-built modification of the 2015 video game The Witcher 3: Wild Hunt that included cloned voices of the original actors.

Behind the news: The recent mainstreaming of generated audio followed earlier research milestones.

Open AI’s Jukebox, which was trained on a database of 1.2 million songs, employs a pipeline of autoencoders, transformers, and decoders to produce fully realized recordings (with lyrics co-written by the company’s engineers) in styles from Elvis to Eminem.
In 2019, an anonymous AI developer devised a technique that allows users to clone the voices of animated and video game characters from lines of text in as little as 15 seconds.

Where things stand: Generative audio — not to mention video — models give media producers the ability not only to buff up archival recordings but to create new, sound-alike recordings from scratch. But the ethical and legal issues are mounting. How should voice actors be compensated when AI stands in for them? Who has the right to commercialize cloned voices of a deceased person? Is there a market for a brand-new, AI-generated Nirvana album — and should there be?

One Architecture to Do Them All

The transformer architecture extended its reach to a variety of new domains.

What happened: Originally developed for natural language processing, transformers are becoming the Swiss Army Knife of deep learning. In 2021, they were harnessed to discover drugs, recognize speech, and paint pictures — and much more.

Driving the story: Transformers had already proven adept at vision tasks, predicting earthquakes, and classifying and generating proteins. Over the past year, researchers pushed them into expansive new territory.

TransGAN is a generative adversarial network that incorporates transformers to make sure each generated pixel is consistent with those generated before it. The work achieved state-of-the-art results in measurements of how closely generated images resembled the training data.
Facebook’s TimeSformer used the architecture to recognize actions in video clips. Rather than the usual sequence of words in text, it interprets the sequence of video frames. It outperformed convolutional neural networks, analyzing longer clips in less time and using less power.
Researchers at Facebook, Google, and UC Berkeley trained a GPT-2 on text and then froze its self-attention and feed-forward layers. They were able to fine-tune it for a wide variety of domains including mathematics, logic problems, and computer vision.
DeepMind released an open-source version of AlphaFold 2, which uses transformers to find the 3D shapes of protein based on their sequence of amino acids. The model has excited the medical community for its potential to fuel drug discovery and reveal biological insights.

Behind the news: The transformer debuted in 2017 and quickly revolutionized language modeling. Its self-attention mechanism, which tracks how each element in a sequence relates to every other element, suits it to analyze sequences of not only words but also pixels, video frames, amino acids, seismic waves, and so on. Large language models based on transformers have taken center-stage as examples of an emerging breed of foundation models — models pretrained on a large, unlabeled corpus that can be fine-tuned for specialized tasks on a limited number of labeled examples. The fact that transformers work well in a variety of domains may portend transformer-based foundation models beyond language.

Where things stand: The history of deep learning has seen a few ideas that rapidly became pervasive: the ReLU activation function, Adam optimizer, attention mechanism, and now transformers. The past year’s developments suggest that this architecture is still coming into its own.

Screen Shot 2021-12-21 at 9.09.03 AM copy

Governments Lay Down the Law

Legislators worldwide wrote new laws — some proposed, some enacted — to rein in societal impacts of automation.

What happened: Authorities at all levels ratcheted up regulatory pressure as AI’s potential impact on privacy, fairness, safety, and international competition became ever more apparent.

Driving the story: AI-related laws tend to reflect the values of the world’s varied political orders, favoring some balance of social equity with individual liberty.

The European Union drafted rules that would ban or restrict machine learning applications based on categories of risk. Real-time facial recognition and social credit systems would be forbidden. Systems that control vital infrastructure, aid law enforcement, and identify people based on biometrics would need to come with detailed documentation, demonstrate their safety, and undergo ongoing human supervision. Issued in April, the draft rules must undergo a legislative process including amendments and likely won’t be implemented for at least another 12 months.
Beginning next year, China’s internet regulator will enforce laws governing recommendation algorithms and other AI systems that it deems disruptive to social order. Targets include systems that spread disinformation, promote addictive behavior, and harm national security. Companies would have to gain approval before deploying algorithms that might affect public sentiment, and those that defy the rules would face a ban.
The U.S. administration proposed an AI Bill of Rights that would protect citizens against systems that infringe on privacy and civil rights. The government will collect public comments on the proposal until January 15. Below the federal level, a number of U.S. cities and states restricted face recognition systems, and New York City passed a law that requires hiring algorithms to be audited for bias.
The United Nations’ High Commissioner for Civil Rights called on member states to suspend certain uses of AI including those that infringe on human rights, limit access to essential services, and exploit private data.

Behind the news: The AI community may be approaching a consensus on regulation. A recent survey of 534 machine learning researchers found that 68 percent believed that deployments should put greater emphasis on trustworthiness and reliability. The respondents generally had greater trust in international institutions such as the European Union or United Nations than in national governments.

Where things stand: Outside China, most AI-related regulations are pending approval. But the patchwork of proposals suggests a future in which AI practitioners must adapt their work to a variety of national regimes.