Tuning LLMs for Better RAG Meta’s RA-DIT boosts language model output by optimizing text retrieval

Published

Apr 17, 2024

Reading time

2 min read

Retrieval-augmented generation (RAG) enables large language models to generate better output by retrieving documents that are relevant to a user’s prompt. Fine-tuning further improves RAG performance.

What’s new: Xi Victoria Lin, Xilun Chen, Mingda Chen, and colleagues at Meta proposed RA-DIT, a fine-tuning procedure that trains an LLM and retrieval model together to improve the LLM’s ability to capitalize on retrieved content.

Retrieval augmented generation (RAG) basics: When a user prompts an LLM, RAG supplies documents that are relevant to the prompt. A separate retrieval model computes the probability that each chunk of text in a separate dataset is relevant to the prompt. Then it grabs the chunks with the highest probability and provides them to the LLM to append to the prompt. The LLM generates each token based on the chunks plus the prompt and tokens generated so far.

Key insight: Typically LLMs are not exposed to retrieval-augmented inputs during pretraining, which limits how well they can use retrieved text to improve their output. Such methods have been proposed, but they’re costly because they require processing a lot of data. A more data-efficient, and therefore compute-efficient, approach is to (i) fine-tune the LLM to better use retrieved knowledge and then (ii) fine-tune the retrieval model to select more relevant text.

How it works: The authors fine-tuned Llama 2 (65 billion parameters) and DRAGON+, a retriever. They call the system RA-DIT 65B.

The authors fine-tuned Llama 2 on prompts that consist of retrieved text and a question or instruction. They used 20 datasets including dialogue, question-answering, answering questions about a given text passage, summarization, and datasets in which the model must answer questions and explain its reasoning.
They fine-tuned DRAGON+’s encoder to increase the probability that it retrieved a given chunk if the chunk improved the LLM’s chance of generating the correct answer. Fine-tuning was supervised for the tasks listed above. Fine-tuning was self-supervised for completion of 37 million text chunks from Wikipedia and 362 million text chunks from CommonCrawl.

Results: On average, across four collections of questions from datasets such as MMLU that cover topics like elementary mathematics, United States history, computer science, and law, RA-DIT 65B achieved 49.1 percent accuracy, while the combination of LLaMA 2 65B and DRAGON+ without fine-tuning achieved 45.1 percent accuracy, and LLaMA 2 65B without retrieval achieved 32.9 percent accuracy. When the input included five examples that showed the model how to perform the task, RA-DIT 65B achieved 51.8 percent accuracy, LLaMA 2 65B combined with DRAGON+ achieved 51.1 percent accuracy, and LLaMA 2 65B alone achieved 47.2 percent accuracy. On average, over eight common-sense reasoning tasks such as ARC-C, which involves common-sense physics such as the buoyancy of wood, RA-DIT 65B achieved 74.9 percent accuracy, LLaMA 2 65B with DRAGON+ achieved 74.5 percent accuracy, and LLaMA 2 achieved 72.1 percent accuracy.

Why it matters: This method offers an inexpensive way to improve LLM performance with RAG.

We’re thinking: Many developers have found that putting more effort into the retriever, to make sure it provides the most relevant text, improves RAG performance. Putting more effort into the LLM helps, too.

Subscribe to The Batch