Outing Hidden Hatred How Facebook built a hate speech detector

Published

Jun 17, 2020

Reading time

2 min read

Facebook uses automated systems to block hate speech, but hateful posts can slip through when seemingly benign words and pictures combine to create a nasty message. The social network is tackling this problem by enhancing AI’s ability to recognize context.

What’s new: Facebook built a hate speech detector designed to recognize that a statement like, “You are welcome here,” is benign by itself but threatening when accompanied by a picture of a graveyard. The model automatically blocks some hateful speech, but in most cases it flags content for humans to review.

Key insight: Facebook extracts separate features from various aspects of a post. Then it melds the features to represent the post as a whole.

How it works: The system examines 10 different aspects of each post including text, images, video, comments, and external context from the web. Separate NLP and vision models extract feature vectors from these elements, fuse them, and classify the post as benign or hate speech. The training and test data came from the company’s own Hateful Memes dataset. The researchers trained the system using a self-supervised method, hiding portions of input data and training the model to predict the missing pieces. They fine-tuned the resulting features on a labeled dataset of hateful speech.

To extract vectors from text, the researchers used XLM-R, a pre-trained multilingual model trained on 100 languages.
They used an object detection network to extract features from images and video. Facebook doesn’t specify the architecture in its production system, but the best baseline model on this dataset used Faster R-CNN.
They fused vectors from various inputs using the approach known as early fusion, in which a model learns to combine features into a unified representation.

Results: A BERT model achieved 59.2 percent accuracy on a text-only subset of Hateful Memes. The best multimodal classifier released by Facebook, ViLBERT, achieved 63.2 percent accuracy.

Free money: If you think you can do better, there’s cash up for grabs in a competition for models that recognize hateful combinations of words and imagery. The contest is set to end in October.

Why it matters: The need to stop the viral spread of hatred, fear, and distrust through social media seems to grow only more urgent with the passage of time. Numerous experts have drawn a connection between online hate speech and real-world violence.

We’re thinking: What constitutes hate speech is hard for humans to agree on, never mind neural networks. There is a danger in policing speech either way. But there is greater danger in fanning flames of hostility on a global scale. Companies need strong, ethical leadership that can work with stakeholders to define limits on expressions of hatred. Then AI will be key in implementing such standards at scale. Meanwhile, we hope that blocking examples that are easiest to recognize opens room for reasoned debate about the edge cases.

Subscribe to The Batch