•8 min read•John McBride
Agent Harnesses Are the New Assembly Line
LLM demos are easy. Dependable delivery needs a harness: deterministic orchestration, budget caps, verification agents, and human gates. What to ask vendors.
agentsai-strategyautomationengineering
Ford didn't win because his workers were better than everyone else's. He won because of the line. The conveyor, the fixed stations, the inspection gates, the takt time — the system around the workers is what made output predictable.
We're at the same moment with AI agents. The models are the workers, and frankly, everybody has access to roughly the same workers. The thing that separates a flashy demo from a system you'd bet a deadline on is the structure wrapped around the model. I call that structure a harness, and after running several of them in production, I've come to a blunt conclusion: the harness is the product.
## What a harness actually is
A harness is the deterministic machinery around a non-deterministic model. Four parts, every time:
**Orchestration.** A job queue, defined job types, and a fixed pipeline of stages. Which agent runs, in what order, with which tools, writing to which workspace. None of this is left to the model's discretion.
**Budgets.** Hard caps on cost, turns, and wall-clock time, per job. Enforced by code that can interrupt the model mid-task — not by asking the model to please be frugal.
**Verification.** Separate agents whose only job is to attack the output. Compilers, test suites, and a reviewer agent that was never involved in producing the work.
**Human gates.** Specific points where a person must approve before anything irreversible happens. Merges, sends, deploys, spend.
If a vendor shows you an agent without showing you these four things, you're looking at a demo, not a system.
## Chat-assisted coding is a power tool. A harness is a factory.
Most teams today use AI the way you'd use a very fast intern sitting next to you: prompt, read, correct, prompt again. That's genuinely useful — I work that way every day — but it scales exactly as far as your attention does. The human is the orchestrator, the budget, the verifier, and the gate, all at once. The moment you look away, work stops.
A harnessed run is a different animal. Last year I ran a 28-hour autonomous build: twelve specialized agents building a full-stack Next.js application end to end — planning, coding, writing 236 automated tests, iterating on failures — with zero human coding in the loop. I wasn't typing prompts for 28 hours. I designed the harness, started the run, and reviewed the output at the gates.
The interesting part isn't that the agents wrote code. Any model can write code. The interesting part is what made 28 unattended hours survivable: every agent had a bounded job, every job had a budget, failures rolled back instead of compounding, and the test suite acted as a continuous, merciless inspector. Take away the harness and the same twelve agents would have drifted into confident nonsense by hour three.
That's the contrast executives should internalize. Chat-assisted AI multiplies one person. A harness multiplies while you sleep — but only because the determinism lives in the harness, not in the model's good intentions.
## The dark factory pattern
The most advanced version of this I've built runs what I think of as a dark factory — a manufacturing term for a plant that runs with the lights off because no humans are on the floor. Humans are still in the building. They're just stationed at the gates instead of on the line.
The pipeline looks like this:
1. **Intake.** A scheduled job scans incoming material — meeting transcripts, work requests — on a fixed interval.
2. **Classification.** A triage agent decides whether there's actionable work and what kind. Most material is correctly thrown away. This stage exists so expensive agents never see junk.
3. **Specification.** A writer agent produces a PRD and deliverable briefs. A human approves the plan before anything gets built. That's gate one.
4. **Production.** Builder agents do the work in isolated git worktrees on their own branches. Isolation matters: an agent that goes sideways can't touch anything outside its sandbox.
5. **Adversarial review.** A separate reviewer agent — different prompt, different incentives, no stake in the work — hunts for security issues, breakage, scope creep, and subtle errors like wrong model identifiers. It emits a structured verdict: approve, fix, or reject.
6. **Human gate.** A person approves every merge. That's gate two. Nothing the factory produces reaches production without a human signature.
The adversarial review step is the one most teams skip, and it's the one that matters most. Asking an agent to check its own work is like asking a student to grade their own exam — the same blind spots that produced the error will excuse it. The reviewer has to be a different agent with an explicitly hostile mandate. In my runs, the reviewer catches real problems regularly: things that compile, pass a casual read, and would have quietly broken something downstream.
## Budgets are a control system, not a finance feature
Here's a detail that sounds like accounting but is actually safety engineering: every job type in my factory carries a hard budget — a dollar cap and a turn cap, set per job type, with a build job allowed more room than a triage job. The runtime prices every model interaction itself, in real time, and interrupts the agent mid-stream the moment a cap trips. It never trusts the model's self-reporting, and it never waits for an after-the-fact invoice.
Why be this paranoid? Because an unbounded agent loop is the AI equivalent of an unattended machine with no emergency stop. Agents fail in a particular way: they don't crash, they persist. A confused agent will happily spend hours and real money pursuing a doomed approach with total confidence. Budgets convert "this could run away" into "this fails closed at a known cost."
On top of per-job budgets sits a global kill switch — one flag that pauses the entire factory. I've come to think of the kill switch the way manufacturers think of the andon cord: the system isn't trustworthy because it never fails, it's trustworthy because anyone can stop it instantly.
## Three questions to ask any agent vendor
If you're a CTO or founder evaluating an agentic AI pitch — internal or external — the demo will look great. Demos always look great. Ask these instead:
**Where are the verification agents?** Who checks the work, and is the checker independent of the producer? If the answer is "the agent validates its own output" or "the model is very accurate," walk away. You want a named adversarial stage with structured verdicts and a track record of rejecting work.
**Where are the budget caps?** What's the maximum this system can spend or do before a human notices? Ask for the number. If they can't give you a per-job cost ceiling and show you the code path that enforces it mid-execution, the real cap is your credit card limit.
**Where is the human gate?** What, specifically, requires a human signature? Merges? Outbound emails? Production deploys? "Human in the loop" is a slogan; a gate is a place in the pipeline where the system halts and waits. Ask to see the queue of things waiting for approval. A real system has one.
A vendor running a genuine harness will answer all three in under a minute, because these decisions were the hard part of the build. A vendor running a wrapped chat model will pivot to talking about the model.
## The line is the moat
The uncomfortable truth for anyone selling "we use the best model" is that model advantages evaporate every few months. The harness doesn't. The orchestration logic, the budget enforcement, the reviewer prompts hardened by months of real failures, the judgment about where the gates belong — that's accumulated engineering, and it compounds.
A century ago, the companies that won weren't the ones with access to assembly-line theory. Everyone had access to the theory. The winners were the ones who had actually run a line, broken it, and fixed it enough times to trust it.
I've been running these lines in production — the 28-hour autonomous build, the dark factory with adversarial review, multi-agent research swarms that compress 40 hours of work into 15 minutes. If you're trying to figure out where a harness fits in your business, the [AI Business Empire case study](/projects/ai-business-empire) shows one of these runs in detail, and I'm always glad to compare notes — [get in touch](/contact).