Better Language Through Vision Study improved Bert performance using visual tokens.

Published

Feb 10, 2021

Reading time

2 min read

For children, associating a word with a picture that illustrates it helps them learn the word’s meaning. New research aims to do something similar for machine learning models.

What’s new: Hao Tan and Mohit Bansal at University of North Carolina Chapel Hill improved a BERT model’s performance on some language tasks by training it on a large dataset of image-word pairs, which they call visualized tokens, or vokens.

Key insight: Images can illuminate word meanings, but current datasets that associate images with words have a small vocabulary relative to the corpuses typically used to train language models. However, these smaller datasets can be used to train a model to find correspondences between words and images. Then that model can find such pairings in separate, much larger datasets of images and words. The resulting pairings can help an established language model understand words better.

How it works: The authors trained a system called the vokenizer to pair BERT-style tokens — generally individual words or characters — with related images. They used the resulting visualized tokens to train BERT to predict such pairings and fine-tuned it on various language tasks.

The vokenizer comprised a pretrained ResNeXt-101 vision model and a pretrained BERT, each followed by a two-layer neural network that generated representations separately for input images and tokens. To train it, the authors split COCO, which depicts roughly dozens of object types with captions, into token-image pairs, associating an image with every token in a given caption. They trained the vokenizer to predict pairings by encouraging it to make the distance between pairs of images and tokens larger than the distance between unpaired images and tokens.
To create a large number of token-image pairs, the vokenizer paired images in the Visual Genome, which depicts millions of objects, with words from English Wikipedia. First it generated a representation for each image. Then, for each token, it used a nearest neighbor search to find the image whose representation was closest.
Using a separate BERT with an extra fully-connected layer, the authors removed some tokens from Wikipedia sentences at random. They pretrained the model to predict both the missing tokens and the image paired with each token. Then they fine-tuned the model on GLUE (which includes several language understanding tasks), SQuAD (question answering), and SWAG (language reasoning).

Results: BERT pretrained with the token-image pairs outperformed the same architecture trained in the same way but without the pairs on tasks in GLUE, SQuAD, and SWAG. For instance, it achieved 92.2 percent accuracy on SST2, predicting the sentiment of movie reviews, compared to 89.3 percent for BERT without visual training. Similarly, on SQuAD v1.1, it achieved an F1 score of .867 on SQuAD compared to .853 for BERT without visual training.

Why it matters: This work suggests the potential of visual learning to improve even best language models.

We’re thinking: If associating words with images helps a model learn word meaning, why not sounds? Sonic tokens — sokens! — would pair, say, “horn” with the tone of a trumpet and “cat” with the sound of a meow.

Subscribe to The Batch