Every Picture Tells a Story Facebook expands automated alternative text.

Published

Jan 27, 2021

Reading time

2 min read

Facebook expanded a system of vision, language, and speech models designed to open the social network to users who are visually impaired.

What’s new: A Facebook service that describes photos in a synthesized voice now recognizes 1,200 visual concepts — 10 times more than the previous version. Known as automatic alternative text, the system can recognize and explain what’s happening in a picture, including the relative size and position of people and objects, in any of 45 languages.

How it works: Launched in 2016, the system initially learned from hand-labeled data to recognize 100 common concepts, like tree and mountain. Facebook added face recognition the following year, allowing users to opt into a more personalized experience. The new upgrade extends automatic alternative text in several ways:

Facebook engineers used a weakly supervised approach to train ResNeXt image recognition models on 3.5 billion Instagram images and 17,000 hashtags that users put with them. Using a similar architecture, they applied transfer learning to train linear classification heads to recognize concepts including selfies, national monuments, and foods like rice and French fries.
They used an existing object detection library to build a Fast R-CNN that recognizes the number, size, and position of various items in an image and determines its primary subject.
The system starts each description with the humble phrase, “May be…,” and it doesn’t describe concepts that it can’t identify reliably. Users can request extra details, and the model will display a page that itemizes a picture’s elements by their position (top, middle, left, or bottom), relative size (primary, secondary, or minor), and category (people, activities, animals, and so on).

Behind the news: Facebook, along with other popular websites, has struggled with how to serve visually impaired users. Some have complained that the site doesn’t work well with common accessibility equipment like screen readers that speak text aloud. For instance, earlier versions of automated alternative text didn’t inform users when the images it described were advertisements. However, some users have applauded Facebook’s use of face recognition with automatic alternative text, which can tell them when a photo depicts a friend or loved one.

Why it matters: Around 285 million people worldwide are visually impaired and 39 million are blind, the World Health Organization estimates. People who don’t see well are as reliant on information as anyone — and they represent a sizable market.

We’re thinking: Disabled web users in the U.S. file hundreds of lawsuits annually against Internet companies that don’t make their services accessible. Increasingly, online accessibility is recognized as a right, not a privilege.

Subscribe to The Batch