AI Knows Who Labeled the Data

Published

Sep 25, 2019

Reading time

2 min read

The latest language models are great at answering questions about a given text passage. However, these models are also powerful enough to recognize an individual writer’s style, which can clue them in to the right answers. New research measures such annotator bias in several data sets.

What’s new: Researchers from Tel Aviv and Bar-Ilan Universities uncovered annotator bias in several crowdsourced data sets.

Key insight: Only a few dozen people may generate the lion’s share of examples in a crowdsourced natural-language data set (see graph above). Having an overly small team of annotators introduces bias that can influence a model’s behavior.

How it works: Mor Geva, Yoav Goldberg, and Jonathan Berant studied three data sets: MNLI, OpenBookQA, and CommonsenseQA. They fine-tuned the BERT architecture for each of three experiments:

The authors measured the change in BERT’s performance after giving input sentences an annotator label. This experiment probed the degree to which the annotator’s identity encoded the correct answer.
Then they used BERT to predict the annotator of individual text samples. This tested whether the annotator’s style encoded the person’s identity.
Finally, they observed the difference in performance when the test and training sets had no annotators in common versus when the training set included samples from test-set annotators. An increase in performance further confirmed the presence of annotator bias.

Results: Performance improved an average of 4 percent across the three data sets when input text included an annotator label. The model inferred annotators most accurately in data sets created by fewer contributors. In two of three data sets, mixing in samples from test-set annotators during training improved test accuracy, implying that the model doesn’t generalize to novel annotators.

Why it matters: Annotator bias is pernicious and difficult to detect. This work raises a red flag around the number of contributors to data sets used in natural-language research.

We’re thinking: Benchmark data sets are used to identify the best-performing models, which drives further research. If the data is biased, it may lead that research astray. Here’s hoping this work inspires further enquiry into sources of bias and ways to assess and mitigate it.

Subscribe to The Batch