Richer Context for RAG RAPTOR, a recursive summarizer, captures more relevant context for LLM inputs

Published

May 29, 2024

Reading time

2 min read

Text excerpts used in retrieval augmented generation (RAG) tend to be short. Researchers used summarization to pack more relevant context into the same amount of text.

What’s new: Parth Sarthi and colleagues at Stanford built Recursive Abstractive Processing for Tree-Organized Retrieval (RAPTOR), a retrieval system for LLMs. RAPTOR can choose to deliver original text or summaries at graduated levels of detail, depending on the LLM’s maximum input length.

Key insight: RAG improves the output of large language models by gathering from documents and/or web pages excerpts that are relevant to a user’s prompt. These excerpts tend to be brief to avoid exceeding an LLM’s maximum input length. For instance, Amazon Bedrock’s default excerpt length is 200 tokens (words or parts of a word). But important details may be scattered throughout longer passages, so short excerpts can miss them. A summarizer can condense longer passages into shorter ones, and summarizing summaries can condense large amounts of text into short passages.

How it works: RAPTOR retrieved material from QASPER, a question answering corpus that contains around 1,600 research papers on natural language processing. The authors processed QASPER through an iterative cycle of summarizing, embedding, and clustering. The result was a graduated series of summaries at ever higher levels of abstraction.

The authors divided the corpus into excerpts of 100 tokens each. The SBERT encoder embedded the excerpts.
A Gaussian mixture model (GMM) clustered the embeddings into groups of similar excerpts. GPT-3.5-turbo summarized each group of excerpts.
This cycle repeated — SBERT embedded the summaries, GMM clustered the embeddings into groups, and GPT-3.5-turbo summarized each group of summaries — until no further groups could be formed.
At inference, to retrieve passages relevant to a user’s prompt, the system computed the cosine similarity between SBERT’s embedding of the prompt and the embedding of each excerpt and summary. It ranked the excerpts and summaries according to their similarity to the prompt, retrieved the highest-scoring ones, and prepended them to the input. It stopped when adding another excerpt or summary would exceed the LLM’s maximum input length.
The LLM received the concatenated prompt plus excerpts and/or summaries and generated its response.

Results: Paired with a variety of LLMs, RAPTOR exceeded other retrievers in RAG performance on QASPER’s test set. Paired with the UnifiedQA LLM, RAPTOR achieved 36.7 percent F1 score (here, the percentage of tokens in common between the output and ground truth), while SBERT (with access to only the 100-token excerpts) achieved 36.23 percent F1 score. Paired with GPT-4, RAPTOR achieved 55.7 percent F1 score (setting a new state of the art for QASPER), DPR achieved 53.0 percent F1 score, and providing paper titles and abstracts achieved 22.2 percent F1 score.

Why it matters: Recent LLMs can process very long inputs, notably Gemini 1.5 (up to 2 million tokens) and Claude 3 (200,000 tokens). But it takes time to process so many tokens. Further, prompting with long inputs can be expensive, approaching a few dollars for a single prompt in extreme cases. RAPTOR enables models with tighter input limits to get more context from fewer tokens.

We’re thinking: This may be the technique that developers who struggle with input context length have been long-ing for!

Subscribe to The Batch