Summarizing a document using original words is a longstanding problem for natural language processing. Researchers recently took a step toward human-level performance in this task, known as abstractive summarization, as opposed to extractive summarization consisting of sentences drawn from the input text. “We present a method to produce abstractive summaries of long documents,” their abstract reads — quoting words generated by the model they propose.
What’s new: Rather than generating abstractive summaries directly, researchers from Element AI and Montreal Institute for Learning Algorithms started with an extractive summary that guides the generated language.
Key insight: Providing an extractive summary along with source text can help a pre-trained language model generate a higher-quality abstractive summary.
How it works: Summarization proceeds in two steps: extraction and abstraction.
- The researchers trained a neural network to identify the most important sentences in a document. In essence, they assign a real-valued score to each sentence based on relationships among all sentences (in terms of content and style, for example). The highest-scoring sentences form an extractive summary.
- A GPT-like architecture, trained on ground-truth abstractive summaries, generates an abstractive summary by predicting words in a sequence. The model receives the extractive summary after the source document, so the summary has greater influence over its output.
Results: The authors tested four corpora, all of which include human-written summaries: arXiv (research papers), PubMed (medical research papers), bigPatent (patent documents) and Newsroom (news articles). The authors compared summarization quality using ROUGE scores, which capture the overlap between generated and ground-truth summaries. For three out of the four datasets, the proposed method achieved state-of-the-art summarization quality without copying entire sentences from the input. Extractive summarization models yielded the best ROUGE scores for the Newsroom corpus.
Why it matters: The ability to generate high-quality abstractive summaries could boost worker productivity by replacing long texts with concise synopses.
We’re thinking: Yikes! We hope this doesn’t put The Batch team out of a job.