If you want a model to answer questions correctly, then enriching the input with reference text retrieved from the web is a reliable way to increase the accuracy of its output. But the web isn’t necessarily the best source of reference text.

What's new: Wenhao Yu at University of Notre Dame and colleagues at Microsoft and University of Southern California used a pretrained language model to generate reference text. They fed that material, along with a question, to a second pretrained language model that answered more accurately than a comparable model that was able to retrieve relevant text from the web.

Key insight: Given a question, documents retrieved from the web, even if they’re relevant, often contain information that doesn’t help to answer it. For instance, considering the question “How tall is Mount Everest?,” the Wikipedia page on Mount Everest contains the answer but also a lot of confusing information such as elevations attained in various attempts to reach the summit and irrelevant information that might distract the model. A language model pretrained on web pages can generate a document that draws on the web but focuses on the question at hand. When fed to a separate language model along with the question, this model-generated reference text can make it easier for that model to answer questions correctly.

How it works: The authors used a pretrained InstructGPT (175 billion parameters) to generate reference text related to questions in trivia question-answer datasets such as TriviaQA. They generated answers using FiD (3 billion parameters), which they had fine-tuned on the dataset plus the reference text. (A given question may have more than one valid answer.)

  • InstructGPT generated reference text for each question in the dataset based upon a prompt such as, “Generate a background document to answer the given question,” followed by the question.
  • The authors embedded each question-reference pair using GPT-3 and clustered the embeddings via k-means.
  • At inference, the system randomly selected five question-reference pairs from each cluster — think of them as guide questions and answers.
  • For each cluster, given an input question (such as, "What type of music did Mozart compose?") and the question-reference pairs, InstructGPT generated a document — information related to the question.
  • Given the question and documents, FiD generated an answer. (Valid answers to the Mozart question include, "classical music," "opera," and "ballet.")

Results: The authors evaluated their fine-tuned FiD on TriviaQA according to the percentage of answers that exactly matched one of a list of correct answers. Provided with generated documents, FiD answered 71.6 percent of the questions correctly compared to 66.3 percent for FiD fine-tuned on TriviaQA and provided with text retrieved from Wikipedia using DPR.

Yes, but: The authors’ approach performed best (74.3 percent) when it had access to both Wikipedia and the generated documents. While generated documents may be better than retrieved documents alone, they worked best together.

Why it matters: Good reference text substantially improves a language model’s question-answering ability. While a relevant Wikipedia entry is helpful, a document that’s directly related to the question is better — even if that document is a product of text generation.

We're thinking: Your teachers were right — Wikipedia isn’t the best source.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox