Retrieval Faces Hard Limits Google and Johns Hopkins researchers show embedding models can’t search unlimited documents

Published
Reading time
4 min read
Matrix links queries to documents, illustrating embedding limits in representing relevance combinations.
Loading the Elevenlabs Text to Speech AudioNative Player...

Can your retriever find all the relevant documents for any query your users might enter? Maybe not, research shows.

What’s new: Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee at Google and Johns Hopkins University pinpointed the critical number of documents beyond which a retriever’s embedding model is no longer sufficient to retrieve all documents relevant to a given query.

Retriever basics: Some retrievers find documents by comparing keywords, while others compare embeddings of a query and documents. In this case, the retriever’s embedding model typically learns to produce embeddings via contrastive learning: Given a query and a set of documents, the model learns to embed queries and documents so embeddings of a given query and relevant documents are similar, and embeddings of a query and irrelevant documents are dissimilar. Then the trained embedding model can be used to create a vector store of document embeddings. At inference, the retriever produces a query embedding, compares it to the stored document embeddings, and returns the documents whose embeddings are most similar. Most retrieval models produce a single embedding per query or document. Less commonly, some produce multiple embeddings per query or document.

Key insight: Ideally, a single-embedding retriever should be able to return any subset of documents in a database, because any subset may be relevant to a given query (“give me the documents about X and Y but not Z”). But in reality, as the number of documents rises, some pairs of documents inevitably lie so far apart in the embedding space that no single query embedding can be a nearest neighbor to both. Another, less relevant document will be more similar to the query than one of the distant pair. The more diverse the relevant subsets, the larger the embedding must be to distinguish the most-relevant documents. In other words, the number of distinct document pairs (or larger sets) a retriever can find is fundamentally limited by its embedding size.

How it works: To measure the maximum number of document pairs an embedding space can represent, the authors conducted two experiments. (i) They started with the best case experiment: They skipped the embedding model and used learnable vectors to represent query and document embeddings. They tested how well the learnable queries could be used to retrieve the learnable documents. (ii) They constructed a dataset of simple documents and queries and tested existing retrievers to see how well they perform.

  • In the best-case setup, the authors varied the size of the embedding (they tried sizes less than 46). For each size, they built a set of learnable document embeddings (initially random). For each possible pair of document embeddings, they created a corresponding learnable query embedding (also initially random). They adjusted these query and document embeddings via gradient descent to see whether they could retrieve the document pairs correctly. Specifically, they encouraged the embedding of each document in a pair to have a higher similarity to the corresponding query embedding than all other document embeddings. They gradually increased the number of documents, and when it passed a certain threshold, the query embedding could not be used to retrieve the document embeddings, regardless of further optimization. 
  • Using retrievers and natural language, the authors built 50,000 documents and 1,000 queries with exactly two relevant documents per query. Each document described a person and their likes (“Jon likes apples and quokkas”), and each query asked “Who likes X?” The authors selected 46 documents as the relevant pool (46 is the smallest number whose pairwise combinations exceed 1,000). The remaining 49,954 documents served as distractors. The language itself was extremely simple — no negation, no ambiguity, no long context — but the task is difficult because every possible pair of the 46 relevant documents is the correct answer for some query.

Results: The authors’ experiments demonstrate that no model can produce embeddings of queries and documents such that the queries can be used to retrieve all possible pairs of documents.

  • In the best-case setup, the number of two-document combinations that could be perfectly represented grew roughly cubically with the embedding size d. The authors fit a cubic polynomial to their data (r2 correlation coefficient: 0.999 — 1 represents perfect correlation) and extrapolated to much larger embedding sizes. When d = 512, optimization no longer improved the matching of queries and document pairs beyond roughly 500,000 documents. At d = 768, the limit rose to 1.7 million; at d = 1,024, to about 4 million; at d = 3,072, to 107 million; and at d = 4,096, to 250 million.
  • In the second experiment, all pretrained, single-embedding retrievers performed poorly. Even at the embedding size of 4,096, Promptriever Llama3 (8 billion parameters) reached 19 percent recall@100, GritLM (7 billion parameters) achieved 16 percent, and Gemini Embeddings (undisclosed parameter count) managed about 10 percent. In contrast, BM25 (a traditional keyword-based retriever that ranks documents by how often they contain the query words) achieved nearly 90 percent, and the ModernColBERT, which represents each query and document with multiple embeddings (one small embedding per token), got 65 percent.

Why it matters: Understanding the theoretical limits of single-embedding retrieval helps set realistic expectations of a retriever’s performance and the best embedding size for a given task. These limits become particularly relevant as agentic retrieval systems grow.

We’re thinking: That a single-embedding retriever cannot, in principle, represent every possible query-document combination is real, but not alarming. In practice, users usually ask for related information, so everyday retrieval tasks likely operate far below the limits. When queries become especially complex, agentic retrieval, in which an agent iteratively decides whether and where to retrieve additional documents, provides a promising alternative.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox