Dear friends,

In this letter, I’d like to address the serious matter of newcomers to AI sometimes experiencing imposter syndrome, where someone — regardless of their success in the field — wonders if they’re a fraud and really belong in the AI community. I want to make sure this doesn’t discourage you or anyone else.

Let me be clear: If you want to be part of the AI community, then I welcome you with open arms. If you want to join us, you fully belong with us!

An estimated 70 percent of people experience some form of imposter syndrome at some point. Many talented people have spoken publicly about this experience, including former Facebook COO Sheryl Sandberg, U.S. first lady Michelle Obama, actor Tom Hanks, and Atlassian co-CEO Mike Cannon-Brookes. It happens in our community even among accomplished people. If you’ve never experienced this yourself, that’s great! I hope you’ll join me in encouraging and welcoming everyone who wants to join our community.

AI is technically complex, and it has its fair share of smart and highly capable people. But, of course, it is easy to forget that to become good at anything, the first step is to suck at it. If you’ve succeeded at sucking at AI -- congratulations, you’re on your way!

I once struggled to understand the math behind linear regression. I was mystified when logistic regression performed strangely on my data, and it took me days to find a bug in my implementation of a basic neural network. Today, I still find many research papers challenging to read, and just yesterday I made an obvious mistake while tuning a neural network hyperparameter (that fortunately a fellow engineer caught and fixed).

Andrew Ng with a welcome message displayed on a laptop

So if you, too, find parts of AI challenging, it’s okay. We’ve all been there. I guarantee that everyone who has published a seminal AI paper struggled with similar technical challenges at some point.
Here are some things that can help.

Do you have supportive mentors or peers? If you don’t yet, attend Pie & AI or other events, use discussion boards, and work on finding some. If your mentors or manager don’t support your growth, find ones who do. I’m also working on how to grow a supportive AI community and hope to make finding and giving support easier for everyone.
No one is an expert at everything. Recognize what you do well. If what you do well is understand and explain to your friends one-tenth of the articles in The Batch, then you’re on your way! Let’s work on getting you to understand two-tenths of the articles.

My three-year-old daughter (who can barely count to 12) regularly tries to teach things to my one-year-old son. No matter how far along you are — if you’re at least as knowledgeable as a three-year-old — you can encourage and lift up others behind you. Doing so will help you, too, as others behind you will recognize your expertise and also encourage you to keep developing. When you invite others to join the AI community, which I hope you will do, it also reduces any doubts that you are already one of us.

AI is such an important part of our world that I would like everyone who wants to be part of it to feel at home as a member of our community. Let’s work together to make it happen.

Your supporter and ally,

Andrew

DeepLearning.AI Exclusive

From Outsider to Educator

When Jagriti Agrawal started her career, she felt hopelessly behind her peers. She caught up with help from friends and teachers. The experience led to work at NASA and co-founding her own education startup, as she explains in a new edition of our Breaking Into AI series. Read her story

News

Different Nvidia cloud-computing services

Chipmaker Boosts AI as a Service

Nvidia, known for chips designed to process AI systems, is providing access to large language models.

What’s new: Nvidia announced early access to NeMo LLM and BioNeMo, cloud-computing services that enable developers to generate text and biological sequences respectively, including methods that tune inputs — rather than the models themselves — to enable models trained on web data to work well with a particular user’s data and task without fine-tuning. Users can deploy a variety of models in the cloud, on-premises, or via an API.

How it works: The new services are based on Nvidia’s pre-existing NeMo toolkit for speech recognition, text-to-speech, and natural language processing.

NeMo LLM provides access to large language models including Megatron 530B, T5, and GPT-3. Users can apply two methods of so-called prompt learning to improve the performance.
The prompt learning method called p-tuning enlists an LSTM to map input tokens to representations that elicit better performance from a given model. The LSTM learns this mapping via supervised training on a small number of user-supplied examples.
A second prompt learning approach, prompt tuning, appends a learned representation of a task to the end of the tokens before feeding them to the model. The representation is learned via supervised training on a small number of user-supplied examples.
BioNeMo enables users to harness large language models for drug discovery. BioNeMo includes pretrained models such as the molecular-structure model MegaMolBART, the protein-structure model ESM-1, and the protein-folding model OpenFold.

Behind the news: Nvidia’s focus on prompt learning and biological applications differentiate it from other companies that provide large language models as a service.

HuggingFace’s Accelerated Inference API allows users to implement over 20,000 transformer-based models.
NLP Cloud allows users to fine-tune and deploy open-source language models including EleutherAI’s GPT-J and GPT-NeoX 20B.
In December 2021, OpenAI enabled customers to fine-tune its large language model, GPT-3.

Why it matters: Until recently, large language models were the province of organizations with the vast computational resources required to train and deploy them. Cloud services make these models available to a wide range of startups and researchers, dramatically increasing their potential to drive new developments and discoveries.

We’re thinking: These services will take advantage of Nvidia’s H100 GPUs, developed specifically to process transformer models. Nvidia CEO Jensen Huang recently said the public no longer should expect chip prices to fall over time. If that’s true, AI as a service could become the only option for many individuals and organizations that aim to use cutting-edge AI.

Robot with an arm, camera, and gripper handing over a plastic bottle to a person

Parsing Commands Into Actions

A new method enables robots to respond helpfully to verbal commands by pairing a natural language model with a repertoire of existing skills.

What’s new: SayCan, a system developed by researchers at Google and its spinoff Everyday Robots, enabled a robot equipped with an arm, camera, and gripper to take a high-level command such as “I spilled my drink, can you help?” and choose low-level actions appropriate to a given environment such as “find a sponge” and “go to table.”

Key insight: A pretrained large language model can grasp verbal instructions well enough to propose a general response. But it can’t adapt that response to local conditions; for instance, an environment that includes a sponge but not a mop. Combining a large language model with a model that determines which actions are possible in the current environment makes for a system that can interpret instructions and respond according to the local context.

How it works: SayCan drew from over 550 kitchen-related actions that the authors had trained it to perform using a combination of image-based behavioral cloning and reinforcement learning. Actions included picking up, putting down, and rearranging objects; opening and closing drawers; and navigating to various locations.

Given a command, PaLM, a large language model, considered each action in turn and calculated the probability that it would respond with the description of that action. For instance, if instructed to clean up a spill, PaLM calculated the probability that it would respond, “find a sponge.”
A reinforcement learning model trained via temporal difference learning learned to estimate the likelihood that the robot would execute the action successfully, accounting for its surroundings. For instance, the robot could pick up a sponge if it saw one, but it couldn’t otherwise. Human judges determined whether the robot had completed a given skill in videos and applied a reward accordingly.
SayCan multiplied the two probabilities into a single score to determine the most appropriate action. It used a set of convolutional neural networks to decide how to move the robot arm. These networks learned either by copying recorded actions or by reinforcement learning in a simulation.
After the robot performed an action, SayCan appended the description to the initial PaLM query and repeated the process until it chose the “done” action.

Results: The authors tested the system by giving the robot 101 commands in a mock kitchen that contained 15 objects such as fruits, drinks, snacks, and a sponge. Human judges determined that the robot planned valid actions 84 percent of the time and carried them out 74 percent of the time. In a real-life kitchen, the robot achieved 81 percent success in planning and 61 percent success in execution.

Why it matters: The dream of a domestic robot has held the public imagination since the dawn of the industrial revolution. But robots favor controlled environments, while households are highly varied and variable. The team took on the challenge by devising a way to choose among 551 skills and 17 objects. These are large numbers, but they may not encompass mundane requests like “find granny’s glasses” and “discard the expired food in the fridge.”

We’re thinking: This system requires a well-staged environment with a small number of items. We imagine that it could execute the command, “get the chips from the drawer” if the drawer contained only a single bag of chips. But we wonder whether it would do well if the drawer were full and messy. Its success rate in completing tasks suggests that, as interesting as this approach is, we’re still a long way from building a viable robot household assistant.

A MESSAGE FROM DEEPLEARNING.AI

Learner Robert Wydler quote about the DeepLearning.ai Machine Learning Specialization

Robert Wydler was always drawn to AI. After 35 years in IT, he finally decided to pursue his passion by taking Andrew Ng’s Machine Learning course. Ready for a change? Enroll in the Machine Learning Specialization!

Panopticon Down Under

A state in Australia plans to outfit prisons with face recognition.

What’s new: Corrective Services NSW, the government agency that operates nearly every prison in New South Wales, contracted the U.S.-based IT firm Unisys to replace a previous system, which required a fingerprint scan to identify people, with one that requires only that subjects pass before a camera, InnovationAus.com reported.

How it works: The new system will use face recognition to identify inmates and visitors as they enter or exit correctional facilities.

Neither Corrective Services NSW nor Unisys disclosed details on the technology. Unisys offers a system called Stealth(identity) that scans a person’s face, irises, voice, and fingerprints. It places faces of people it has identified in a registry. Then, when it encounters any face, it scans the registry for a match.
The new system scans faces and irises simultaneously and does not require fingerprinting. It will process individuals faster and improve categorization of people coming and going, according to a prison representative.
16 correctional centers will complete installation in early 2023 at a total cost of $12.8 million in Australian dollars. Corrective Services NSW said it expects the system to reduce operational expenses by 12 percent.

Yes, but: Samantha Floreani of Digital Rights Watch raised concerns that face recognition may exacerbate biases in the Australian corrective system, which incarcerates indigenous people disproportionately. Additionally, Floreani said that contracting to Unisys, a U.S.-based firm, raises questions about whether personal data on Australians will be transferred to another country and whether the data will be secure and handled properly. The Australian public, too, is wary. A 2021 poll found that 55 percent of Australians supported a moratorium on face recognition until stronger safeguards are in place.

Behind the news: England and Wales tested face recognition for screening prison visitors in 2019, mostly in an effort to crack down on smuggling of drugs into prisons. In the United States, the federal Justice Department has funded several initiatives to apply face recognition. The U.S. Marshals Service, which handles fugitive investigations, is developing a face recognition system to aid in transporting prisoners.

Why it matters: The flow of visitors, contractors, and prisoners into and out of correctional facilities creates opportunities for security breaches. Face recognition promises to help manage this traffic more safely. However, the technology, which is relatively new, largely unregulated, and developing rapidly, brings with it potential for abuse, mission creep, and other adverse consequences, especially in a high-stakes field like criminal justice.

We’re thinking: Surveillance has always been an inextricable part of incarceration, but it shouldn’t encroach on the rights of prisoners or the people who guard, visit, and provide services to them. More optimistically, if technology can generate indelible, auditable records of the activities of both guards and prisoners, it can help protect against abuses and address them when they occur.

Animation showing 3 main types of data augmentation and random cropping of a picture

Cookbook for Vision Transformers

Vision Transformers (ViTs) are overtaking convolutional neural networks (CNN) in many vision tasks, but procedures for training them are still tailored for CNNs. New research investigated how various training ingredients affect ViT performance.

What's new: Hugo Touvron and colleagues at Meta and Sorbonne University formulated a new recipe for training ViTs. They call their third-generation approach Data Efficient Image Transformers (DeiT III).

Key insight: The CNN and transformer architectures differ. For instance, when processing an image, a CNN works on one group of pixels at a time, while a transformer processes all pixels simultaneously. Moreover, while the computational cost of a CNN scales proportionally to input size, a transformer’s self-attention mechanism requires dramatically more processing as input size increases. Training recipes that take these differences — and other, less obvious ones — into account should impart better performance.

How it works: The authors pretrained ViTs to classify images in ImageNet using various combinations of training data, data augmentation, and regularization. (They also experimented with variables such as weight decay, dropout, and type of optimizer, for which they didn’t describe results in detail.) They fine-tuned and tested on ImageNet.

The authors pretrained the transformers on ImageNet-21K using lower image resolutions, such as 192x192 pixels, before fine-tuning on full-res 224x224-pixel images. Pretraining transformers on lower-res versions is faster and less memory-intensive and has been shown to result in better classification of full-res images.
ImageNet-21K includes roughly 10 times as many images as the more common ImageNet. The larger dataset makes augmenting data via random cropping unnecessary to prevent overfitting. Instead, they used a cropping procedure that was more likely to retain an image’s subject. First, they resized training examples so their smaller dimension matched the training resolution (say, from 224x448 to 192x384). Then they cropped the larger dimension to form a square (192x192) with a random offset.
The authors altered the colors of training examples by blurring, grayscaling, or solarizing (that is, inverting colors above a certain intensity). They also randomly changed brightness, contrast, and saturation. Less consistent color information may have forced the transformers — which are less sensitive than CNNs to object outlines — to focus more on shapes.
They used two regularization schemes. Stochastic depth forces individual layers to play a greater role in the output by skipping layers at random during training. LayerScale achieves a similar end by multiplying layer outputs by small, learnable weights. Because a transformer’s residual connections connect every other layer, this scaling enables the network to begin learning with a small number of layers and add more as training progresses. The gradual accumulation helps it continue to learn despite having large numbers of layers, which can impede convergence.

Results: The authors’ approach substantially improved ViT performance. An 86 million-parameter ViT-B pretrained on ImageNet-21K and fine-tuned on ImageNet using the full recipe achieved 85.7 percent accuracy. Their cropping technique alone yielded 84.8 percent accuracy. In contrast, the same architecture trained on the same datasets using full-resolution examples augmented via RandAugment achieved 84.6 percent accuracy.

Why it matters: Deep learning is evolving at a breakneck pace, and familiar hyperparameter choices may no longer be the most productive. This work is an early step toward updating for the transformer era recipes that were developed when CNNs ruled computer vision.

We're thinking: The transformer architecture’s hunger for data makes it especially important to reconsider habits around data-related training procedures like augmentation and regularization.