Retrieval-augmented generation (RAG) vs. fine-tuning
Both RAG and fine-tuning aim to improve large language models (LLMs). RAG does this without modifying the underlying LLM, while fine-tuning requires adjusting the weights and parameters of an LLM. Often, you can customize a model by using both fine-tuning and RAG architecture.
Building on top of large language models
An LLM is a type of artificial intelligence (AI) that uses machine learning (ML) techniques to understand and produce human language. These ML models can generate, summarize, translate, rewrite, classify, categorize, and analyze text—and more. The most popular use for these models at an enterprise level is to create a question-answering system, like a chatbot.
LLM foundation models are trained with general knowledge to support a broad range of use cases. However, they likely aren’t equipped with domain-specific knowledge that’s unique to your organization. RAG and fine-tuning are 2 ways to adjust and inform the LLM with the data you want so it produces the output you want.
For example, let’s say you’re building a chatbot to interact with customers. In this scenario, the chatbot is a representative of your company, so you’ll want it to act like a high-performing employee. You’ll want the chatbot to understand nuances about your company, like the products you sell and the policies you uphold. Just as you’d train an employee by giving them documents to study and scripts to follow, you train a chatbot by using RAG and fine-tuning to build upon the foundation of knowledge it arrives with.
Red Hat resources
What is RAG and how does it work?
RAG supplements the data within an LLM by retrieving information from sources of your choosing, such as data repositories, collections of text, and pre-existing documentation. After retrieving the data, RAG architectures process it into an LLM’s context and generate an answer based on the blended sources.
RAG is most useful for supplementing your model with information that’s regularly updated. By providing an LLM with a line of communication to your chosen external sources, the output will be more accurate. And because you can engineer RAG to cite its source, it’s easy to trace how an output is formulated, which creates more transparency and builds trust.
Back to our example: If you were to build a chatbot that answers questions like, “What is your return policy?”, you could use a RAG architecture. You could connect your LLM to a document that details your company’s return policy and direct the chatbot to pull information from it. You could even instruct the chatbot to cite its source and provide a link for further reading. And if your return-policy document were to change, the RAG model would pull the most recent information and serve it to the user.
Use cases for RAG
RAG can source and organize information in a way that makes it simple for people to interact with data. With a RAG architecture, models can fetch insights and provide an LLM with context from both on-premise and cloud-based data sources. This means external data, internal documents, and even social media feeds can be used to answer questions, provide context, and inform decision making.
For example, you can create a RAG architecture that, when queried, provides specific answers regarding company policies, procedures, and documentation. This saves time that would otherwise be spent searching for and interpreting a document manually.
What is fine-tuning?
Think of fine-tuning as a way to communicate intent to the LLM so the model can tailor its output to fit your goals. Fine-tuning is the process of training a pretrained model further with a smaller, more targeted data set so it can more effectively perform domain-specific tasks. This additional training data is embedded into the model’s architecture.
LoRA and QLoRA are both parameter-efficient fine-tuning (PEFT) techniques that can help users optimize costs and compute resources.
Let’s return to our chatbot example. Say you want your chatbot to interact with patients in a medical context. It’s important that the model understands medical terminology related to your work. Using fine-tuning techniques, you can ensure that when a patient asks the chatbot about “PT services,” it will understand that as “physical therapy services” and direct them to the right resources.
Use cases for fine-tuning
Fine-tuning is most useful for training your model to interpret the information it has access to. For instance, you can train a model to understand the nuances and terminologies of your specific industry, such as acronyms and organizational values.
Fine-tuning is also useful for image-classification tasks. For example, if you’re working with magnetic resonance imaging (MRI), you can use fine-tuning to train your predictive AI model to identify abnormalities.
Fine-tuning can also help your organization apply the right tone when communicating with others―especially in a customer-support context. It lets you train a chatbot to analyze the sentiment or emotion of the person it’s interacting with. Further, you can train your generative AI model to respond in a way that serves the user while upholding your organization’s values.
Considerations for choosing RAG vs. fine-tuning
Understanding the differences between RAG and fine-tuning can help you make strategic decisions about which AI resource to deploy to suit your needs. Here are some basic questions to consider:
What’s your team’s skill set?
Customizing a model with RAG requires coding and architectural skills. Compared to traditional fine-tuning methods, RAG provides a more accessible and straightforward way to get feedback, troubleshoot, and fix applications. Fine-tuning a model requires experience with natural language processing (NLP), deep learning, model configuration, data reprocessing, and evaluation. Overall, it can be more technical and time consuming.
Is your data static or dynamic?
Fine-tuning teaches a model to learn common patterns that don’t change over time. Because it’s based on static snapshots of training data sets, the model’s information can become outdated and require retraining. Conversely, RAG directs the LLM to retrieve specific, real-time information from your chosen sources. This means your model pulls the most up-to-date data to inform your application, promoting accurate and relevant output.
What’s your budget?
RAG is typically considered to be more cost efficient than fine-tuning. To implement a RAG architecture, you build pipeline systems to connect your data to your LLM. This approach saves on cost because it uses existing data to inform your LLM. This stands in contrast to the significant resources required by fine-tuning to perform specialized data labeling and the intensive computational power needed for repeated model training.
While fine-tuning is historically considered the more expensive option, developments like vLLM are helping to close the budget gap. vLLM is an inference server and engine that improves the cost efficiency of serving fine-tuned models.
How Red Hat can help
Red Hat® AI is built for fast, flexible, and efficient inference through its vLLM-powered server. It reliably connects models to your data to unify the customization and development of specialized agents on a single platform. Built on an open source foundation, our products give you full control of AI workflows from end-to-end at any scale.
The Red Hat AI portfolio includes Red Hat AI Inference, an inference stack that provides the operational control to run any model on any accelerator across the hybrid cloud. Get fast, efficient, and cost-effective inference at scale.
The official Red Hat blog
Get the latest information about our ecosystem of customers, partners, and communities.