You built an LLM RAG and … it doesn’t work very well. Here’s how you make it better.
Welcome to Episode 30 in Prolego’s Generative AI series. Retrieval Augmented Generation, or RAG, is the most efficient way to optimize LLM applications with your data, and almost every company is exploring it. If you’re new to LLM RAG, watch our RAG playlist.
So you followed the online tutorials and built a basic RAG solution. Unfortunately it can’t handle complex or ambiguous customer questions. Building a RAG demo is easy, but getting a production-ready application is significantly harder.
You need to improve it through the 13 techniques we describe in our LLM Optimization Playbook. Download your free copy at prolego.com/playbook. I’m going to show an example, and then summarize the key lessons for your application.
In Episode 17 I demonstrated an LLM RAG solution for the Formula 1 rulebooks. We built a chat interface that fans or teams can use to ask complex questions about the sport. This example is similar to real business scenarios: multiple documents, relevant document hierarchy, and arcane terminology.
We took this Formula 1 RAG solution and conducted an ablation study based on our LLM Optimization Playbook. We evaluated 4 of the 13 optimization techniques on performance:
We created an evaluation framework of 18 questions derived from racing enthusiasts and incident reports. We ran eac h question 3 times and evaluated the answers manually or through GPT-4. Let’s walk through the results.
First, choose the model size/version
We began by asking GPT-4 and GPT-3.5 Formula 1 rules questions. For example, users often ask about the location of specific rules. In this instance, GPT-3.5 can discuss it generally but cannot provide the correct answer, Section 31.1 of the FIA Formula One Sporting Regulations. GPT-3.5 achieved 56% accuracy and GPT-4 63%. Not surprisingly, GPT-4 performs better because it is a larger model trained on more data.
Second, add relevant context, or RAG
We then supplemented our solution with relevant sections from the Formula One Regulations. This effort required parsing the documents to maintain hierarchy and converting the text to embeddings. Of course RAG created big performance improvements of 87% for GPT-3.5, and 91% for GPT-4. Surprisingly, the gap between 3.5 and 4 is now smaller, a significant finding because 3.5 is both faster and cheaper.
Third, integrate multiple information sources
As is common in many business applications, this solution underperforms because the LLM doesn’t understand unique or arcane terms. To overcome this challenge we integrate definitions, such as the meaning of Bargeboard. The results are the most significant findings of this study: both 3.5 and 4 achieved the same accuracy score, 93%. Some engineering work allows us to use a faster, cheaper model and get the same results.
Finally, we implemented an agent.
The agents performed additional searches over regulations or definitions. They struggled to improve performance and results actually degraded slightly for both models.
These are the results from our study. Here are the key learnings for you.
First, the evaluation framework is the most important part of your solution. In this study we only had 18 questions. While more would be helpful, most importantly the questions must represent user behavior.
Second, it takes creativity and engineering work to get a production RAG solution. Download a copy of our LLM Optimization Playbook, carefully review your options, and systematically test them. I’ve watched teams waste months pursuing expensive options like fine-tuning instead of straightforward ones like passing definitions.
Finally, revisit your choice of model to see if a smaller model can work. We demonstrated GPT-3.5, but we expect to get similar results using small open source LLMs like Mistral-7B. The cost savings can be 10 or 100x lower depending on your application.
Follow these steps and you’ll turn your LLM RAG demo into a production solution.