The Long and Short of It

Published

Sep 25, 2019

Reading time

1 min read

Not long ago, text-to-speech systems could read only a sentence at a time, and they were ranked according to their ability to accomplish that limited task. Now that they can orate entire books, we need new benchmarks.

What’s new: A Google research team discovered that the usual measure of text-to-speech quality — having human judges rate single-sentence examples for human-like realism — varies widely depending on how samples are presented. That makes the standard procedure insufficient to evaluate performance on longer texts.

Key insight: Rob Clark and colleagues tested samples of various lengths and formats to see how they affected quality ratings.
How it works: Judges rated human and synthesized voices reading identical news articles and conversational transcripts.

The judges evaluated samples in three forms: paragraphs, isolated sentences making up those paragraphs, and sentences preceded by the prior sentence or two (which were not rated).
For sentences accompanied by preceding material, the preceding material was presented in human, synthesized, or text versions.

Results: Samples that included prior sentences earned higher scores than sentences in isolation, regardless of whether they were spoken by humans or machines. That is, the additional context made the synthesized voices seem more realistic. Moreover, readings of paragraphs scored higher than readings of their component sentences, showing that isolated sentences aren’t a good gauge of long-form text-to-speech.

Why it matters: Metrics that reflect AI’s performance relative to human capabilities are essential to progress. The authors show that the usual measure of text-to-speech performance doesn’t reflect performance with respect to longer texts. They conclude that several measures are necessary.

We’re thinking: As natural language processing evolves to encompass longer forms, researchers are setting their sights on problems that are meaningful in that context. This work demonstrates that they also need to reconsider the metrics they use to evaluate success.

Subscribe to The Batch