Research

Small Llama 8B Model Matches 70B on Multi-Hop Q&A With Better Prompting

March 21, 2026 2 min read

Image: Meta

A small language model matching one nearly 10x its size - not through expensive training, but through smarter prompting. New experiments with Graph RAG show that Llama 8B can hit the same accuracy as Llama 70B on multi-hop question answering, the kind of task where a model needs to chain together multiple facts to reach an answer.

The research used KET-RAG, a Graph RAG technique that structures retrieved information as a knowledge graph rather than dumping raw text into the prompt. RAG (retrieval-augmented generation) is the method where an AI pulls in external documents before answering a question, and the "graph" variant organizes those documents into connected relationships instead of flat chunks.

Retrieval Is Solved. Reasoning Is the Bottleneck.

The most striking finding is how lopsided the problem actually is. The correct answer was present in the retrieved context 77% to 91% of the time. The information was right there. Models just couldn't connect the dots.

Between 73% and 84% of wrong answers came from reasoning failures, not retrieval failures. The model had the facts it needed but failed to chain them together logically. This flips the common assumption that RAG systems mostly fail because they retrieve the wrong documents.

What Structured Prompting Actually Fixed

Smaller models like Llama 8B choke on reasoning even when the answer sits in their context window (the amount of text the model can process at once). But by structuring the prompt to explicitly lay out the relationships between retrieved facts, the 8B model performed at the same level as the 70B model on these multi-hop tasks - without any fine-tuning (the process of retraining a model on specialized data).

This matters for anyone running AI locally or building RAG applications. If prompt structure can close the gap between an 8B model that runs on a laptop and a 70B model that needs serious hardware, the cost and infrastructure implications are significant. You might not need the bigger model if you're willing to invest in better retrieval architecture instead.

Retrieval Is Solved. Reasoning Is the Bottleneck.

What Structured Prompting Actually Fixed

More from today

Study: Heavy AI Users Write 69% More Neutral, Less Personal Content

Study: 8 of 10 AI Chatbots Helped Teens Plan Violent Attacks

Karpathy Says Coding Is Over, Replaced by Agent Loops and AutoResearch

Cookie Preferences