Why Active Learning Fails Pairing active learning with visual question answering.

Published

Jan 05, 2022

Reading time

2 min read

Where labeled training data is scarce, an algorithm can learn to request labels for key examples. While this practice, known as active learning, can supply labeled examples that improve performance in some tasks, it fails in others. A new study sheds light on why.

What's new: Siddharth Karamcheti and colleagues at Stanford University showed that examples of a certain kind hinder active learning in visual question answering (VQA), where a model answers questions about images.

Key insight: Most active learning methods aim to label examples that a model is least certain about. This approach assumes that providing labels that resolve the model’s uncertainty will improve performance faster than providing labels that confirm its certainty. However, some examples that prompt uncertainty are also difficult to learn, and the uncertainty doesn’t dissipate with additional learning. For instance, in VQA, some questions about an image may refer to information that’s absent from the image itself; consider a photo of a car and the question, “What is the symbol on the hood often associated with?” If an active learning algorithm were choose many such examples, the additional labels would contribute little to learning. For active learning to work, it needs to choose examples the model can learn from. Thus, removing hard-to-learn examples prior to active learning should improve the results.

How it works: The authors trained several VQA models on a variety of datasets. They fine-tuned the models using five diverse active-learning strategies and compared their impact to labeling examples at random.

The authors applied each active learning strategy to each model-dataset pair. They noted the number of additional labeled examples needed to reach a certain level of accuracy, or sample efficiency.
They computed the model’s confidence in its classification of each training example. They also computed a variability score that quantifies how much its confidence varied over the course of training. Low confidence and high variability indicated the most difficult-to-learn examples.
They removed the 10 percent, 25 percent, or 50 percent of examples that had the lowest product of confidence and variability. Then they repeated step one, using each active learning strategy and measuring its impact on performance.

Results: Culling the most difficult-to-learn training examples (those that elicited the lowest product of confidence and variability) enabled all five active learning strategies to train VQA models using fewer examples. For instance, the authors used the active learning strategy called least-confidence, which labels additional examples in which the model is least confident in its classification, to fine-tune a Bottom-Up Top-Down Attention model on the VQA-2 dataset. It achieved 50 percent accuracy with 120,000 labeled examples — no better than labeling at random. The authors removed 10 percent of the most difficult-to-learn examples and achieved the same accuracy with 100,000 labeled examples. After removing 25 percent, it achieved the same accuracy with 70,000 labeled examples. After removing 50 percent, it took only 50,000 labeled examples (while labeling additional examples at random required 70,000 labeled examples).

Why it matters: VQA is data-hungry, while active learning is sample-efficient. They make a handy combo — when they work well together. This study identifies one problem with the pairing and how to solve it.

We're thinking: The authors focused on how difficult-to-learn examples affect active learning in VQA, but the same issue may hinder active learning in other tasks. We hope that further studies will shed more light.

Subscribe to The Batch