Building Systems That Survive Contact With Humans

TL;DR: Systems don’t fail because of bad code-they fail because humans are involved. This post covers practical patterns for building software that handles incomplete inputs, confused users, ambiguous requirements, and all the other ways reality diverges from the spec.

Humans Are the Spec

There’s a fiction we all inherited from CS programs and whiteboard interviews: the rational user. The person who reads the docs, fills in every field, follows the happy path, and never hits “submit” twice because their connection hiccuped. That person doesn’t exist. Never has.

I’ve spent the last few years building agentic AI systems-pipelines where an LLM calls tools, orchestrates multi-step workflows, and interfaces with humans at various checkpoints. And if there’s one thing this work has taught me, it’s that the model is never the hard part. The hard part is the seam where silicon meets skin.

Your user will paste a 47-page PDF into a field that expects a paragraph. They’ll approve step 3 of a workflow, disappear for six days, then come back and ask why step 4 “isn’t working.” They’ll give your agent instructions that contradict the instructions they gave it twenty minutes ago-and they’ll be frustrated with you when the output is incoherent.

This isn’t a bug in the users. This is the spec. The sooner you internalize that, the sooner your systems stop being fragile academic exercises and start being things that work in production.

Four Failure Modes to Design For

Most failure taxonomies focus on infrastructure: network partitions, disk failures, OOM kills. Those are solved problems with solved tooling. The failures that actually take your system down at 2 AM are human failures-and they’re not even dramatic ones. They’re mundane, repetitive, and completely predictable if you’re paying attention.

1. Incomplete Inputs

This is the big one, especially in AI systems. A user kicks off a workflow with “analyze this data and give me insights.” Which data? What kind of insights? For whom? By when? They don’t know. They’re figuring it out as they go. And honestly, so are you.

In traditional software, we handled this with form validation-red asterisks and error toasts. In agentic systems, incomplete input is the default state. Every prompt is underspecified. Every tool call is missing context. If your system can’t operate gracefully on partial information, it can’t operate at all.

2. Ambiguous Requirements

I once built an internal AI tool where three different VPs were “the stakeholder.” Each had a different definition of success. One wanted speed, one wanted accuracy, one wanted the output to “feel more human.” Those are three different systems. We built one and satisfied nobody.

Ambiguous requirements aren’t a phase you get past in discovery. They’re a permanent condition. Stakeholders rotate. Priorities shift quarterly. The person who signed off on the PRD left the company. Requirements aren’t a foundation you build on-they’re weather you build through.

3. Long-Running Workflows

Here’s where agentic AI gets really interesting and really dangerous. You spin up a multi-step pipeline: ingest data, enrich it, generate a draft, route it for human review, incorporate feedback, produce a final output. Elapsed time: maybe three days if everyone’s responsive. Maybe three weeks if they’re not.

In that window, everything drifts. The underlying data changes. The user’s mental model of what they asked for evolves. The LLM you’re calling ships a new version with subtly different behavior. The human reviewer forgets what they were reviewing and why. Your system has to handle all of this-or it has to handle none of it and be honest about the fact that it’s a one-shot tool, not a workflow.

4. Murphy Conditions

I don’t mean adversarial attacks-those are a different talk. I mean the sheer combinatorial weirdness that emerges when real humans interact with real systems at any kind of scale.

A user copy-pastes a prompt that includes invisible Unicode characters. Someone’s browser auto-translates the UI into Portuguese before they submit a form. A power user discovers they can chain your API endpoints in a way that creates an infinite loop. Nobody planned this. Nobody is being malicious. It’s just the universe doing QA.

In AI systems, Murphy conditions are especially brutal because the LLM will try to be helpful no matter what garbage it receives. It won’t throw an exception-it’ll hallucinate a reasonable-looking response to an unreasonable input, and now you have a confident-sounding wrong answer propagating through your pipeline.

Patterns That Actually Help

None of these are revolutionary. That’s the point. The revolutionary stuff is the model architecture, the training runs, the novel agent frameworks. The stuff that keeps it all alive in production is boring on purpose.

Make “Incomplete” a Valid State

Stop treating missing information as an error condition. It’s the most common condition.

In practice, this means progressive disclosure and staged completion. Let users kick off a workflow with minimal input and refine as they go. Build your data models with nullable fields and your UIs with sensible defaults. If your AI agent needs clarification, have it ask-but also have it make its best attempt with what it has and show its work.

I’ve started designing every agentic pipeline with an explicit “confidence threshold.” Below it, the system asks for more input. Above it, it proceeds and flags assumptions. The threshold is tunable per use case. This alone has cut our “user abandoned the workflow” rate by about 40%, because we stopped blocking people with questions they couldn’t answer yet.

Build for Change

When requirements are still forming-and they’re always still forming-hard-coding behavior is debt that comes due fast.

Configuration over code. Feature flags. Prompt templates stored in a database, not in your source tree. If you’re building with LLMs, this is doubly important: you will need to change the prompt. You will need to swap models. You will need to A/B test different system instructions for different user segments. Bake that flexibility in from day one or accept that every “small tweak” will require a deploy.

The pattern I keep coming back to is treating the entire AI orchestration layer as a directed graph where nodes (tool calls, LLM invocations, human checkpoints) are configurable and edges (routing logic, fallback paths) are data-driven. When a stakeholder says “actually, we want a human review step before the final output,” that’s a config change, not a refactor.

Design for Resumption

If a user can’t walk away from your system and come back tomorrow without losing progress, your system is a toy.

Checkpointing is table stakes. Every meaningful state transition should be persisted. But the subtler piece is idempotency-making it safe for a user (or an agent) to retry any step without side effects. This matters enormously in AI pipelines where a single step might call an external API, charge a credit, or send an email. “The user hit refresh” cannot be a catastrophic event.

Explicit timeouts are the other half. Long-running workflows need deadlines-not to punish users, but to force the system to make a decision. If a human review step hasn’t been completed in 72 hours, what happens? If the answer is “nothing, it just sits there forever,” you don’t have a workflow. You have a queue that grows until someone notices.

Add Bulkheads

Borrowed from ship design: compartmentalize so that one failure doesn’t sink everything.

In agentic systems, this means guardrails around every LLM call. Output validation. Schema enforcement on structured outputs. Circuit breakers that trip when an external service starts returning garbage. And critically, sane default behavior-when something unexpected happens, your system should do the least dangerous thing, not the most helpful thing. Helpful is what got us hallucinated legal citations in court filings.

I run every LLM output through a lightweight validation layer before it touches anything downstream. Is it valid JSON if I asked for JSON? Does it reference entities that actually exist in our system? Is the sentiment within expected bounds for this use case? These checks catch maybe 5% of outputs, but that 5% is where the catastrophic failures live.

Observe the Humans

Your dashboards are probably showing you the wrong things. Error rates, latency percentiles, throughput-those are infrastructure metrics. They tell you if the machine is healthy. They tell you nothing about whether the humans are succeeding.

What you actually need: workflow completion rates. Time-to-completion distributions. Where users drop off. Where they retry. Where they override the AI’s suggestion versus accept it. Which prompts produce outputs that users actually use versus outputs they immediately regenerate.

I’ve started treating every human-AI interaction point as an implicit feedback signal. If a user regenerates a response, the first one was wrong-or at least unsatisfying. If they edit 80% of a generated draft, your generation isn’t saving them time. If they consistently skip a step, that step shouldn’t exist. This telemetry is more valuable than any eval benchmark.

A Case Study

I’ll keep this vague enough to not violate any NDAs, specific enough to be useful.

Last year, I built a document processing pipeline for a financial services client. The system ingested compliance documents (PDFs, Word files, occasionally scanned images), extracted key obligations, mapped them to internal policies, and generated a gap analysis report. Standard RAG-plus-orchestration stuff.

The first version was architecturally sound and completely unusable.

What went wrong: We designed for the happy path. Clean PDFs with selectable text, one document per submission, a compliance analyst who would review the output and provide structured feedback. Reality: analysts submitted batch uploads of 30+ documents at once, half of which were scanned at an angle. They’d start a review, get pulled into a meeting, come back two days later, and expect their session state to be intact. The gap analysis output was too long and too detailed-analysts wanted a summary they could skim in 60 seconds, not a 15-page report.

What we changed:

Incomplete inputs: We added an ingestion preprocessing step that classified documents by quality and type, then routed low-quality inputs to an OCR enhancement pipeline automatically. Users stopped getting cryptic errors about “unparseable content” and started getting a progress indicator showing their documents moving through cleanup.

Long-running workflows: We broke the monolithic pipeline into independently resumable stages with persistent state. Each stage had an explicit SLA. If human review stalled beyond 48 hours, the system sent a nudge. Beyond 96, it escalated to a manager. The workflow could be resumed from any checkpoint.

Ambiguous requirements: We replaced the single 15-page output with three views-executive summary (the 60-second skim), detailed findings (the working document), and raw evidence (the audit trail). Analysts self-selected the view they needed. This was a config change, not a rebuild, because we’d built the output layer as a templating system from the start.

Murphy conditions: One analyst discovered they could submit the same document to multiple concurrent workflows, creating duplicate obligations in the system. We added deduplication at the ingestion layer and idempotency keys on every document. Another analyst submitted a document in Mandarin. We added language detection and routing to a translation step. We didn’t predict these-but because each stage was isolated with its own validation, the blast radius was contained.

The result wasn’t a more sophisticated AI. The model and the prompts barely changed. What changed was everything around the model-the scaffolding that let the system bend instead of break when humans did human things.

Key Takeaways

Incomplete input is the default, not the exception. Design your systems-especially AI systems-to operate usefully on partial information and refine iteratively.
Requirements are weather, not foundation. Build orchestration layers that treat routing, prompts, and workflow steps as configuration, not code.
If it can’t be resumed, it’s not a workflow. Checkpoint state, enforce idempotency, and set explicit timeouts on every human-dependent step.
Validate everything the model produces. LLMs fail silently and confidently. A lightweight validation layer between your model and your downstream systems is the cheapest insurance you’ll ever buy.
Instrument the humans, not just the machines. Completion rates, drop-off points, and override patterns will tell you more about your system’s health than any infrastructure dashboard.

Every system is a human system. The sooner you accept that, the better your software gets.