4 Comments
User's avatar
Scenarica's avatar

The honest answer to your closing question is the one that isnt on the list: the exit condition. The while-loop framing is correct but it hides the thing that actually kills agents in production. Knowing when to stop is significantly harder than knowing what to do next. An agent that cant recognise its output is good enough, or that the task is genuinely impossible, will burn tokens, take destructive actions, or loop indefinitely. Planning, tools, and memory all solve the "what next" problem. The exit condition solves the "when to stop" problem, and in my experience thats where roughly 70% of production failures originate.

The memory split is also missing a category thats more important in practice than either short-term or long-term: episodic memory. What did I try, what failed, and why. Without it the agent in a failing loop will attempt the same broken approach repeatedly because it has no record of having already tried it. The Reflexion mention in planning gets closest but in production episodic memory is the difference between an agent that converges on a solution and one that oscillates between the same two broken states until you kill it manually.

Isaac Steiner's avatar

What heuristic do you recommend for using (or learning about) episodic memory?

Scenarica's avatar

Simplest heuristic that works in production: append a structured log to the agent's context after every tool call or decision point. Three fields: what I tried, what happened, why it failed or succeeded. Before the next action, the agent checks the log for prior attempts on the same subtask. If it finds one, it must choose a different approach or escalate. Keeps the agent from repeating itself and gives you a debugging trail when something goes wrong. Start there and add sophistication later.

Ex-Consultant in Tech's avatar

I think the hardest part is evaluation. Most agent writeups assume the agent knows what “good” means. In practice, that’s the squishiest part. The model can make a plan, call tools, summarize results, and still be optimizing for the wrong definition of done.

That’s where agents get weird. If you don’t externalize that judgment into checks, tests, rubrics, budgets, constraints, and human review points, the agent just invents its own grading system.