Automatically generated text summaries are becoming common in search engines and news websites. But existing summarizers often mix up facts. For instance, a victim’s name might get switched for the perpetrator’s. New research offers a way to evaluate factual consistency between source documents along with a measure to evaluate it.

What’s new: Wojciech Kryściński and colleagues at Salesforce Research introduce FactCC, a network that classifies such inconsistencies. They also propose a variant called FactCCX that justifies the classifications by pointing out specific inconsistencies.

Key insight: Earlier approaches to checking factual consistency determine whether a single source sentence implies a single generated sentence. But summaries typically draw information from many sentences. FactCC evaluates whether a source document as a whole implies a generated sentence.

How it works: The authors identified major causes of factual inconsistency in automated abstractive summaries (that is, summaries that don’t copy phrases directly from the source document). Then they developed programmatic methods to introduce such errors into existing summaries to generate a large training dataset. FactCC is based on a BERT language model fine-tuned on the custom dataset.

  • The researchers created a training dataset by altering sentences from CNN news articles. Transformations included swapping entities, numbers, or pronouns; repeating or removing random words, and negating phrases (“snow is in the forecast” versus “snow is not in the forecast”).
  • Some transformations resulted in sentences whose meaning was consistent with the source, while others resulted in sentences with altered meaning. The authors labeled them accordingly.
  • The development and test sets consisted of sentences from abstractive summaries generated by existing models. Each sentence was labeled depending on whether it was factually consistent with the source.
  • BERT received the source document and a sentence from the generated summary. It predicted a binary classification of consistent or inconsistent.

Results: FactCC classified summary sentences with an F1 score of 0.51. By contrast, a BERT model trained on MNLI, a dataset of roughly 400,000 sentence pairs labeled as either concordant or contradictory, achieved an F1 score of 0.08. In a separate task, FactCC ranked pairs of new sentences (one consistent, one not) for consistency with a source. It awarded consistent sentences a higher rank 70 percent of the time, better by 2.2 percent than the best previous model ranking the same dataset.

Why it matters: A tide of automatically generated text is surging into mainstream communications. Measuring factual consistency is a first step towards establishing further standards for generated text—indeed, an urgent step as worries intensify over online disinformation.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox