The Seven-Layer AI Agent Stack

By Chris Boyd ·

Most teams building agentic AI systems focus on the model. That's the wrong layer to obsess over. The model is table stakes. What breaks in production isn't the LLM, it's everything around it.

I laid out the full framework for thinking about agents, copilots, and workflows in Most Agents Are Just Prompt Chains With Better Branding. This post goes one level deeper on the production stack itself: the seven layers every agentic system needs, what each one actually does, and the mistake teams make with each one.


Layer 1: Model

What it is: The LLM doing the reasoning, inference, decision-making, tool selection, and language generation.

Why it matters: The model is the only layer with probabilistic output. Everything else in your stack is deterministic. That asymmetry is the entire design problem.

The common mistake: Treating model selection as the primary architectural decision. Teams spend weeks benchmarking GPT-5.4 vs. Claude vs. Gemini and ship with no constraints layer and no evaluation harness. Prompt and context engineering will handle the tone and the vibes, regardless of the model, in 99% of cases. The model will perform fine. The system around it will not.

Model choice matters at the margins. Context window, cost per token, and tool-calling reliability are the real selection criteria for production, not benchmark leaderboard position.


Layer 2: Tools

What it is: The functions the model can invoke, including API calls, database reads and writes, code execution, web search, file operations, and external service integrations.

Why it matters: Tools are where the model touches the real world. Every tool you expose is both a capability and a risk surface. The model can call the right tool at the wrong time, with the wrong parameters, more times than you intended, in a sequence you didn't anticipate.

The common mistake: Treating tools as a feature list instead of a threat model. Before you give your agent a tool, answer two questions: what's the worst it can do with this, and do you have a rollback path? If you can't answer both, the tool isn't ready for production.

This is also where scope creep kills you. More tools means more surface area means harder evaluation. Start minimal. Add tools when you have evidence they're needed, not because the model might find them useful.


Layer 3: State

What it is: The memory of the current run, including conversation history, intermediate results, accumulated context, retrieved documents, and tool outputs.

Why it matters: State management is the difference between a system that reasons coherently across a multi-step task and one that hallucinates its own prior decisions. The model has no native memory between calls. You are responsible for what it knows about what it's already done.

The common mistake: Either ignoring state entirely (the model loses context mid-task and starts over) or stuffing everything into the context window and wondering why performance degrades and costs spike. State needs a strategy: what to keep, what to compress, what to drop, and when.

Most frameworks have an opinion on state. Most of those opinions are optimistic. Build your own handling layer and treat the framework's defaults as a starting point, not a solution.


Layer 4: Execution Loop

What it is: The orchestration logic that takes model output, parses tool calls, executes them, feeds results back, and decides when to stop. In a workflow, you wrote this loop. In a true agent, the model drives it, and you constrain it.

Why it matters: The execution loop is where agentic behavior actually happens. It's also where infinite loops, runaway tool calls, and compounding errors live.

The common mistake: No exit conditions. Teams build agents that can loop indefinitely, with no max step count, no cost ceiling, no timeout, and no graceful degradation path. The model doesn't know when to stop unless you tell it. Define that before you define anything else.


Layer 5: Constraints

What it is: The guardrails on model behavior, including what topics are off-limits, what actions require confirmation, what outputs get blocked, and what the agent is explicitly not allowed to do.

Why it matters: Constraints are your production safety net. Without them, you're relying entirely on the model's judgment, which is probabilistic, not reliable.

The common mistake: Writing constraints as suggestions. "Don't do X" in a system prompt is not a constraint. It's a preference. Real constraints are enforced at the execution layer, not requested in the prompt. If a guardrail only exists in natural language, assume it will eventually be violated.


Layer 6: Security

What it is: The controls that protect your system from adversarial inputs, data leakage, and unauthorized actions. This includes input sanitization, output filtering, prompt injection defense, and audit logging of every tool call and model decision.

Why it matters: An agentic system with access to real tools and real data is a high-value target. Prompt injection, jailbreaks, and indirect attacks through external data sources are not theoretical. They happen in production, and they're harder to detect than a failed API call.

The common mistake: Treating security as a deployment checklist item instead of a layer of the architecture. A system without security controls isn't just exposed, it's a liability you own. Build security into the execution loop, not onto it after the fact. Log everything. Review the logs.


Layer 7: Evaluation

What it is: The system for measuring whether your agent is doing what you think it's doing. This includes unit-level prompt tests, end-to-end task completion checks, regression suites, and human review pipelines.

Why it matters: You cannot eyeball a production agent. The output space is too large, the failure modes are too subtle, and the cost of silent degradation is too high. Without eval, every deploy is a guess.

The common mistake: Skipping eval entirely because it's hard to define success for open-ended tasks. That difficulty is exactly why it matters. Start with the things you can measure: tool call accuracy, task completion rate, output format compliance. Build toward the harder stuff. Something is better than nothing, and nothing is how teams end up with agents they can't trust and can't improve.


Seven layers. Most production failures trace back to one of them being absent or underdeveloped. The model is rarely the problem. The stack around it almost always is.

Related Posts