16 Comments
User's avatar
Mitchell Kosowski's avatar

The "evaluator paradox" is the most underrated point here: asking the same LLM that might hallucinate to judge retrieval quality is a fundamental circular dependency.

In practice, you don't need the full ReAct loop to get 80% of the value. Just adding a single relevance checkpoint between retrieval and generation cuts down on confidently wrong answers dramatically.

And for production, a hybrid approach of standard RAG for simple lookups, agentic only for ambiguous queries, escalating to a human when only necessary is the practical sweet spot given the latency/cost trade-offs.

Mike P's avatar

What if the relevance checkpoint determines the quality is unacceptable? You would then start the loop to reason and act (again). What does your simplified approach look like compared to what the author posted to get 80% value?

ToxSec's avatar

“Agentic RAG turns retrieval from a one-shot pipeline into a loop with decision points. Those decision points are the entire value add”

excellent takeaway. it’s been really interesting watching everything go agentic in 2026. this was a great breakdown.

Santosh Vaza's avatar

Excellent explanation

Thomas Aldren's avatar

This is a good framework for thinking about the problem. One thing I'd add: the "loop" framing can undersell what's possible once you commit to it.

We built a system that runs 20+ steps between retrieval and final output: multiple retrieval passes against different source types, a combinatorial exploration phase that doesn't use the LLM at all, and specialized evaluation steps that each look for different failure modes. The key insight was that "retrieve, evaluate, retry" is just the beginning. Once you have the control loop, you can insert reasoning steps between retrieval and generation that change the quality of what comes out. Not just "did I find the right chunks" but "what is the strongest possible answer given everything I found, and does it survive scrutiny before I start writing?"

The evaluator paradox you mention is real. One thing that helped: for certain decision points, we replaced LLM self-evaluation with deterministic methods that enumerate the option space programmatically. Keeps the agentic flexibility without trusting the model to grade its own homework at every step.

Chris's avatar

Interesting. Do you have an example of how you would add reasoning steps between retrieval and generation for the above scenarios?

Thomas Aldren's avatar

Sure. One example: after retrieval, we add a step that pressure-tests the argument before any writing happens — strongest objections, what evidence would make you abandon the thesis. That changes what the model reaches for during generation.

General principle: any time the model is doing two cognitive tasks at once during generation, split them into separate steps with an explicit artifact in between.

Chris's avatar

So you're saying add another evaluator step which makes a first pass on if the data is logical enough to send to the primary evaluator? I just don't understand how any responses can be pressure tested or iterated on without running into the evaluator paradox mentioned above. I'm also still in the learning phase and have only built one shot RAG agents so I'm still learning agentic theory without knowing how it's actually built out. Forgive me if I don't understand.

Brian Kuchta's avatar

Great breakdown of Agentic RAG, and I appreciate the honest look at the trade-offs. But the "evaluator paradox" deserves more than a bullet point.

The whole value proposition here is that the system can evaluate whether what it retrieved is actually good enough before generating a response, and that sounds like a meaningful upgrade until you ask one simple question: who is doing the evaluating? An LLM. The same type of system that confidently generated a response from a bad retrieval in the first place.

Think about how we train humans. If a new employee keeps making the same mistake, we don't ask that employee to grade their own work and assume the problem is solved. The feedback mechanism has to be meaningfully different from the thing producing the output, and agentic RAG doesn't fully do that, which means the ceiling on self-correction is set by the same model that needed correcting. That’s a pretty significant architectural constraint.

And this is exactly where a human AI manager embedded in the process could make a real difference. This isn’t someone hovering over every query, but someone who is actively reviewing where the loop breaks down, identifying patterns in failed retrievals, and feeding that insight back into the training and evaluation criteria.

Just like a good manager doesn't micromanage but does stay close enough to catch systemic problems before they compound, a human in this role could provide the external perspective that the LLM evaluator simply can’t give itself. The machine will never know what it doesn't know, but a human manager might.

This is a topic I have been writing about extensively on LinkedIn and Substack, specifically the transition humans must make from being AI users to AI managers, and the evaluator paradox is a perfect example of why that transition matters.

The technology is evolving fast, but it's humans who must transform their mindset and skill set to AI manager as they are still the best checkpoint that we have.

Chris's avatar

Thanks, Brian. I'm confused by the comments here. I understand that people are agent savvy and that they need to stay cryptic so they can stay ahead but the idea that you can increase the quality of return by adding another AI checkpoint just seems bizarre to me. If this was a human workflow then I'd agree with checkpoints because humans have discourse and at work we have a point (whether human or time driven) where discourse needs to wrap up and we need summarize and tally action items to inform next steps.

Maybe these commenters work for the most prestigious AI companies so they have technology in beta that can absolutely increase quality return but it seems to me the comments are saying "to get better quality responses have a policeman police the police." Or worse, LLM your results to death.

I think you're right about human managers managing the flow but it's a sad thought considering how much of management has been decimated in tech over the last few years. I do understand that "manager" could be anyone in the loop who is knowledgeable on the subject, and also that 65% of managers are beyond useless garbage,...but I also know 1 good manager can oversee and manage a battalion of agents and responses better than 3 knowledgeable humans who have never had to actually manage anything in their careers.

I view AI managers as people who are great people managers with technical know-how. We want our AI to be human but it's tech. We want our managers to be machines but they're human. I think competent people managers that produce streamlined productive teams will be the AI managers that make AI successful. But still, good managers aside, how do you allow agent autonomy and trust human oversight will be piped in at the right time for quality control? A human can only monitor a fraction of what AI can compute in a minute. But I guess that's where being a good manager comes in.

Pawel Jozefiak's avatar

The control loop between retrieval and generation is where the real latency cost lands. Used agentic RAG in a personal knowledge system for a few months. The 10-second wait is acceptable for complex queries, annoying for simple ones.

Ended up routing by query complexity upfront. Obvious lookups skip the agentic loop entirely. Ambiguous or multi-source queries go through the full cycle. The 'false confidence on irrelevant results' failure mode is the hardest one to catch because the output sounds reasonable. Standard RAG fails loudly.

Agentic RAG fails quietly with a plausible answer that happens to be wrong. tbh the routing logic to decide when to use which is most of the actual engineering work.

Chris's avatar

How are you defining obvious lookups to your agent so it'llskip the loop?

Pawel Jozefiak's avatar

Good catch. My setup doesn't use RAG in the traditional sense - agents pull from structured files and state rather than a vector store. But the failure pattern is the same: when context is ambiguous, the agent commits to an answer instead of flagging uncertainty.

The fix I landed on: explicit escalation triggers.

Low confidence = ping me, don't proceed. It added friction but made failures visible before they compound.

Chris's avatar

Very interesting. So if it needs to probe for clarity then ping you. And if it needs to use multiple tools then go through the loop.

Kumar Vuppala's avatar

Interesting explanation .. how do you apply this to sourceing

Rakia Ben Sassi's avatar

Here’s the thing nobody tells you when you graduate from “I deploy to a VPS” to “I’m cloud-native now”:

Kubernetes is not a more reliable version of your old server. It’s a fundamentally different relationship with reliability. And if you approach it the same way, your pods will keep dying and you’ll keep losing sleep.

Let’s talk about it.

https://rakiabensassi.substack.com/p/the-kubernetes-mortality-rate-everything?utm_campaign=post-expanded-share&utm_medium=web