What is RAG? More accurate and reliable LLMs

Retrieval-augmented generation (RAG) is a technique used to “ground” large language models (LLMs) with specific data sources, often sources that weren’t included in the models’ original training. RAG’s three steps are retrieval from a specified source, augmentation of the prompt with the context retrieved from the source, and then generation using the model and the augmented prompt.

In an exercise in dogfooding, I asked the GPT-4 large language model “What is retrieval-augmented generation?” using its Browse plug-in, which is one implementation of retrieval-augmented generation.

This is what GPT-4 eventually told me, after I asked it to regenerate its answer:

—————–

Retrieval-Augmented Generation (RAG) is an AI framework that significantly impacts the field of Natural Language Processing (NLP). It is designed to improve the accuracy and richness of content produced by language models. Here’s a synthesis of the key points regarding RAG from various sources:

– RAG is a system that retrieves facts from an external knowledge base to provide grounding for large language models (LLMs). This grounding ensures that the information generated by the LLMs is based on accurate and current data, which is particularly important given that LLMs can sometimes produce inconsistent outputs【5†source】.

– The framework operates as a hybrid model, integrating both retrieval and generative models. This integration allows RAG to produce text that is not only contextually accurate but also rich in information. The capability of RAG to draw from extensive databases of information enables it to contribute contextually relevant and detailed content to the generative process【6†source】.

– RAG addresses a limitation of foundational language models, which are generally trained offline on broad domain corpora and are not updated with new information post-training. This makes them less effective for tasks requiring domain-specific knowledge. By contrast, RAG can access the latest data, making it more adaptable and capable of performing well in domain-specific applications【7†source】.

– In comparison to traditional text generation models, RAG has shown remarkable advantages and has achieved state-of-the-art performance in many NLP tasks. Its ability to augment the generation process with retrieved information makes it a powerful tool for creating high-quality, informative text【8†source】.

The combined capabilities of RAG to retrieve and generate text make it an innovative tool in NLP, enhancing the relevance and quality of the output from language models.

——————

With the exception of a couple of exaggerated points that’s a decent answer. It did leave out the fact that RAG originated at Facebook (now Meta) AI Research.

I’ve had fairly good results using RAG implementations for cases where unenhanced LLMs lack current information, and especially in cases where the base LLM goes off the rails and starts hallucinating its own facts. On the other hand, I’ve seen LLMs hallucinate even when they were using RAG.

Using external sources of information (the retrieval portion of RAG) doesn’t only mean running a web search. It can also mean providing your own documents for the LLM to use or pointing the LLM at specific web documentation (no additional search required), for example as described by my InfoWorld colleague Sharon Machlis in this article.

The problems: LLM hallucinations and limited context

LLMs often take a long time using expensive resources to train, sometimes months of run time using dozens of state-of-the-art server GPUs such as NVIDIA H100s. Keeping the LLMs completely up-to-date by retraining from scratch is a non-starter, although the less-expensive process of fine-tuning the base model on newer data can help.

Fine-tuning sometimes has its drawbacks, however, as it can reduce functionality present in the base model (such as general-purpose queries handled well in Llama) when adding new functionality by fine-tuning (such as code generation added to Code Llama).

What happens if you ask an LLM that was trained on data that ended in 2022 about something that occurred in 2023? Two possibilities: It will either realize it doesn’t know, or it won’t. If the former, it will typically tell you about its training data, e.g. “As of my last update in January 2022, I had information on….” If the latter, it will try to give you an answer based on older, similar but irrelevant data, or it might outright make stuff up (hallucinate).

To avoid triggering LLM hallucinations, it sometimes helps to mention the date of an event or a relevant web URL in your prompt. You can also supply a relevant document, but providing long documents (whether by supplying the text or the URL) works only until the LLM’s context limit is reached, and then it stops reading. By the way, the context limits differ among models: two Claude models offer a 100K token context window, which works out to about 75,000 words, which is much higher than most other LLMs.

The solution: Ground the LLM with facts

As you can guess from the title and beginning of this article, one answer to both of these problems is retrieval-augmented generation. At a high level, RAG works by combining an internet or document search with a language model, in ways that get around the issues you would encounter by trying to do the two steps manually, for example the problem of having the output from the search exceed the language model’s context limit.

The first step in RAG is to use the query for an internet or document or database search, and vectorize the source information into a dense high-dimensional form, typically by generating an embedding vector and storing it in a vector database. This is the retrieval phase.

Then you can vectorize the query itself and use FAISS or another similarity search, typically using a cosine metric for similarity, against the vector database, and use that to extract the most relevant portions (or top K items) of the source information and present them to the LLM along with the query text. This is the augmentation phase.

Finally, the LLM, referred to in the original Facebook AI paper as a seq2seq model, generates an answer. This is the generation phase.

That all seems complicated, but it’s really as little as five lines of Python if you use the LangChain framework for orchestration:

from langchain.document_loaders import WebBaseLoader
from langchain.indexes import VectorstoreIndexCreator
loader = WebBaseLoader("https://www.promptingguide.ai/techniques/rag")
index = VectorstoreIndexCreator().from_loaders([loader])
index.query("What is RAG?")

Thus RAG addresses two problems with large language models: out-of-date training sets and reference documents that exceed the LLMs’ context windows. By combining retrieval of current information, vectorization, augmentation of the information using vector similarity search, and generative AI, you can obtain more current, more concise, and more grounded results than you could using either search or generative AI alone.

READ SOURCE