Most Agents Are Just Prompt Chains With Better Branding
A few months ago, one of my side projects sent 500 unsolicited messages to my wife. I've also written about the canyon between AI demos and production systems. This post is the synthesis — the framework I wish I'd had before either of those lessons cost me sleep and spousal goodwill. It's about what "agentic AI" actually means when you strip the branding off, what the real production stack looks like, and where your system will betray you first.
The taxonomy nobody agrees on (but should)
The industry uses "agent" the way enterprise sales uses "platform" — loosely, aspirationally, and to justify pricing. Here's a more honest breakdown:
Workflow. A fixed DAG of steps. Model calls might happen at several nodes, but the execution path is predetermined. Think: ingest document → extract fields → validate → write to database. There's an LLM in the loop, but you decided the loop. This is the workhorse. Most of what ships in production today is this.
Copilot. A human-in-the-loop system where the model drafts and the human decides. GitHub Copilot, obviously, but also any UI where the model suggests a next action and waits. The human is the execution loop. This is the safety net that actually works.
Agent. A system where the model decides which tools to call, in what order, with the ability to observe results and change its plan. The execution loop belongs to the model. This is where things get exciting in demos and catastrophic on a Tuesday at 3 AM.
The problem isn't that agents are impossible. It's that most teams claiming to build agents are actually building workflows with an LLM picking which if branch to take — and calling it autonomy. That's fine! Workflows are great. Just don't architect for agent-grade failure modes if you're building a workflow, and don't pretend you have workflow-grade reliability if you're actually building an agent.
The real stack
Every production agentic system — whether you call it a workflow or a proper agent — has six layers. Miss one and you'll find out in prod.
1. Model. The LLM doing the reasoning. Model choice matters less than people think and prompt engineering matters more than people admit. The model is the least likely layer to be your bottleneck.
2. Tools. Functions the model can invoke: API calls, database queries, code execution, file operations. Every tool is an attack surface and a failure surface. The more tools you expose, the more creative your incident reports get.
3. State. The memory of the current run: conversation history, intermediate results, accumulated context. State management is the difference between a working system and a system that hallucinates its own past. Most frameworks punt on this. You cannot.
4. Execution loop. The orchestration logic that takes model output, parses tool calls, executes them, feeds results back, and decides when to stop. This is where "agentic" lives. In a workflow, you wrote this loop. In an agent, the model is this loop, and your job is to put a cage around it.
5. Constraints. Token budgets, step limits, tool-call caps, permission scopes, timeout windows. Constraints are not a nice-to-have. They are structural load-bearing walls. Remove them and the system doesn't get smarter — it gets expensive and dangerous.
6. Evaluation. How you know the system did the right thing. Not vibes. Not "looks good in the demo." Automated checks on tool-call sequences, output validation, regression tests against known-good traces. If you can't evaluate an agentic run programmatically, you cannot ship it.
Most tutorials cover layers 1 and 2, gesture at 4, and ignore 3, 5, and 6 entirely. This is why most agent demos break the moment you point them at real data.
What breaks first
In order of "how quickly this will ruin your week":
Tool failures with no handler. The model calls an API. The API returns a 429 or a 503. The agent either hallucinates success or spirals into a retry loop until your bill has opinions. You need explicit failure handling at every tool boundary — not just a try/catch, but a defined recovery strategy per failure type: retry with backoff, fallback to alternate tool, escalate to human, abort with logged state.
Loops with no exit. The model decides it needs more context, calls a tool, gets a partial result, decides it needs more context. Iteration cap: not set. What happens next: bad things. Hard step limits are not a crutch — they're a circuit breaker.
State drift in multi-step runs. Step 3 makes an assumption based on step 1, but step 2 changed something the agent didn't track. The model reasons correctly given what it sees. What it sees is wrong. This is the bug that looks like a hallucination but is actually an architecture problem.
Non-idempotent tool calls. The agent sends a Slack message. Something fails downstream. The agent retries. Now two Slack messages went out and someone is asking questions. Every tool call with side effects needs idempotency keys and a clear answer to "what happens if this runs twice."
Runaway permissions. You gave the agent write access because it needed to update one row. It updated a table. Least-privilege is not just a security principle in agentic systems — it is the primary mechanism that keeps the agent inside the problem you gave it.
The pattern that looks great in demos and breaks in production
Fully autonomous research-and-execute. You give the agent a goal — "research our top three competitors and update the positioning doc" — and it browses the web, synthesizes content, writes copy, and commits changes to a live document. No checkpoints. No review. Just outputs.
In a demo: genuinely impressive. In production: a confidence-maximizing system is making brand decisions from open-web sources with no human review, writing directly to assets your team depends on. The failure mode isn't a crash. It's a confident, well-formatted, completely wrong update that sits in your positioning doc for three weeks before someone notices.
The fix is not to not build it. The fix is a human approval gate before any write operation, explicit source scoping, and treating every draft as a draft until signed off. Now it's a copilot with agentic research — less dramatic in the demo, dramatically more trustworthy in the field.
The pattern actually worth building
Agentic triage and routing. Inbound item arrives — ticket, email, form submission, support request. Agent classifies it, extracts structured fields, checks against policy or routing rules, sends it to the right queue or triggers the right downstream action, logs everything with a confidence score and a reasoning trace.
Why this works:
- The task is bounded and the success criteria are clear
- Most tool calls are reads, not writes
- Human review is easy to inject at the routing decision for low-confidence cases
- Misclassification has an obvious fallback: route to a human
- The output is fully auditable
- Cost per run is small and predictable
This is not a conference keynote demo. It is also something that can be in production in two weeks and saves hundreds of hours a month. Build the boring one first.
The production concerns that never make the tutorial
Observability. Logs are not enough. You need traces — every model call, every tool call, inputs and outputs, latency, cost, and the model's reasoning at each step. If you cannot replay a run and understand exactly what happened, you cannot debug it and you cannot improve it. LangSmith, Weave, Arize, roll your own — pick something, use it from day one.
Cost controls. Agentic loops are the fastest way to generate a surprise invoice. Set per-run budget limits. Alert on cost anomalies. Know your cost per successful task completion — that's the unit economics number that determines whether this thing is a product or a science project.
Auditability. For anything touching money, customers, compliance, or regulated data: you need a complete record of what the agent did, what it was instructed to do, and what a human approved. This is a legal requirement in some industries and a good idea everywhere else.
Human approval gates. Not as a fallback for when things break — as a designed feature of the workflow. The question is not "do we need a human in the loop" but "where does a human checkpoint create the best risk/value tradeoff in this specific workflow."
The takeaway
Agents are not a replacement for system design. They are a component in a system you still have to design.
The teams shipping reliable agentic systems right now are doing it by treating the model as one piece of a carefully constrained workflow — not as the workflow itself. They have hard limits, explicit failure handling, human checkpoints at high-stakes decisions, and enough observability to understand what happened after the fact.
Start with a workflow. Pick one bounded, high-value, repetitive task. Add LLM steps where the task requires judgment. Instrument everything. Put a human checkpoint at the highest-stakes decision point. Ship it. Learn from the traces. Expand scope only when the narrow version is running cleanly.
The demo that impresses the room is fully autonomous and does something remarkable. The system that actually runs your business is boring, well-observed, and appropriately constrained — with a smart model inside it.
Build the second one first.
Related Posts
Agentic Workflows That Actually Work
How to build production agentic workflows with retry logic, audit trails, and human-in-the-loop checkpoints that survive real-world failure modes.
The Gap Between AI Demos and Production
The gap between AI demos and production: what happens when you deploy AI agents into incomplete data, hostile inputs, and users who don't read instructions.
OpenClaw Sent 500 Messages to My Wife
A real-world OpenClaw safety failure: my home automation agent sent 500 messages, got stuck in a loop, and ended up in Bloomberg.