The latest pretrained language models have shown a remarkable ability to learn facts. A new study drills down on issues of scale, showing that such models might learn the approximate weight of a dog or cost of an apple, at least to the right order of magnitude.

What’s new: Xikun Zhang and Deepak Ramachandran with colleagues at Stanford, Google, AI2 Israel, Bar Ilan University, and University of Pennsylvania probed whether word embeddings produced by pretrained models encode knowledge of objects’ mass, length, or price.

Key insight: Pretrained features that represent words may or may not capture scale-bound attributes. To find out, the authors built simple linear models that took the pretrained embeddings as a starting point and trained them on a dataset that explicitly associates words with such attributes. If the models learned to estimate such attributes, they reasoned, then the pretrained embeddings did, indeed, represent them.

How it works: The authors analyzed features generated by ELMo and BERT, whose embeddings vary depending on context, as well as the earlier word2vec, a fixed set of embeddings. They also tested features generated by their own model, NumBERT, which is identical to BERT except that numerals in its pretraining data were replaced by the same numbers in scientific notation.

  • The researchers built two linear models that accepted embeddings from each language model. One linear model used regression to produce a median estimate of mass, length, or price. The other produced a distribution of probabilities among 12 orders of magnitude.
  • Freezing the language models’ weights, the researchers trained the linear models on the Distribution over Quantities (DoQ) dataset, which contains nouns and distributions of their masses, lengths, and prices.
  • They fed the language models sentences like “The dog is heavy” or “The ring is expensive,” and passed the embeddings of the key word (here, “dog” or “ring”) to the linear models to produce an estimate or distribution.

Results: The linear models matched the DoQ measures with greater-than-random accuracy. Those that used embeddings from ELMo, BERT, and NumBERT produced better performance than those that used word2vec. To evaluate whether the linear models generalized beyond DoQ, the authors tested them on comparing sizes and weights between pairs of objects. The regression model that used NumBERT embeddings achieved accuracy of 0.76, outperforming BERT (0.71), ELMo (0.72), and word2vec (0.74). The classification model that used NumBERT embeddings likewise outperformed the others but achieved lower accuracy.

Why it matters: The latest language models have come under fire for being less smart than their accomplishments might suggest. But how much less smart? Studies like this help quantify the deficits so we can work toward improving them.

We’re thinking: Language models also need to understand scale distinctions based on modifying words such as the difference between “watch” and “gold watch,” or between “Yoda” and “Baby Yoda.”


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox