•8 min read•John McBride
Inside a Dark Factory: What Autonomous Software Delivery Actually Looks Like
A walk through a real autonomous delivery pipeline: transcript to PRD to working code, with budget kill switches and a human on the merge button.
ai-agentsautomationsoftware-deliveryengineering
In manufacturing, a "dark factory" is a plant that runs with the lights off. No people on the floor, so nobody needs to see. I borrowed the name for something I built this year: a software and content factory where AI agents do the production work end to end, and humans show up at exactly two points — the request and the merge.
I want to walk through how it actually works, because most writing about autonomous agents is either hype or doom, and almost none of it describes the plumbing. The plumbing is where the interesting decisions live.
## The trigger: nobody files a ticket
The factory's main input isn't a backlog. It's meeting transcripts.
A scheduled job polls recent transcripts every fifteen minutes. A classifier agent reads each one and asks a narrow question: is there actionable work in here? Someone said "we really need a one-pager comparing these two vendors" or "can we get a tool that does X" — that's actionable. General discussion is not. The classifier triages, and anything that looks like real work becomes a job in a Postgres-backed queue.
There's also a direct intake path — a simple request form where anyone can describe what they want built. Those land in the same queue. Either way, the work originates in plain language from people who were never going to write a spec.
That's the first design decision worth stealing: meet the demand where it already exists. People state what they need out loud in meetings constantly. They almost never file tickets.
## From intent to spec: the PRD agent
Once a job is triaged as real, a PRD writer agent picks it up. It has web search, so it can ground the spec in current information rather than whatever the model remembers. It produces a product requirements document plus briefs for any deliverables — a deck, a report, a tool, whatever the request implies.
This is the step people underestimate. The hard part of automating delivery isn't the code generation. It's turning a half-sentence from a meeting into something specific enough to build against. Giving that job to a dedicated agent — one whose only output is a reviewable document — means a human can sanity-check the plan before any expensive work starts. A human approves every plan before the factory commits resources to it.
## Production: producers and builders
After approval, the work splits by type.
Producer agents handle content deliverables. They have tools for saving reports, generating slide decks, and generating images. A request for "a summary deck on this topic" goes in; a finished deck comes out.
The builder agent is the one that writes actual software. This is where isolation matters. Every build job runs in its own git worktree on a dedicated `factory/*` branch. The agent writes code, registers any new components, and runs the TypeScript compiler before it's allowed to call the work done. It cannot touch main. It cannot touch another job's worktree. The blast radius of a bad build is one disposable branch.
Worktrees turned out to be the right primitive here. Containers would work too, but worktrees are cheap, fast, and native to the workflow the humans already use. When a job finishes, you have a branch you can diff, review, and throw away if it's garbage.
## The adversarial reviewer
Here's the part I'd argue is non-negotiable: a separate reviewer agent whose entire job is to attack the output.
Not "review" in the polite sense. The reviewer is prompted adversarially — hunt for security problems, hunt for things that would break existing functionality, hunt for scope creep, hunt for hallucinated details like model identifiers that don't exist in our registry. It emits a structured JSON verdict: approve, fix, or reject.
Why a separate agent? Because the builder grading its own homework doesn't work. A model that just wrote code is anchored on that code. A fresh context with an explicitly hostile mandate finds problems the builder genuinely cannot see. It's the same reason human code review exists, and it transfers to agents almost perfectly.
A "fix" verdict sends the job back with specific findings. A "reject" kills it. Only "approve" moves anything toward a human.
## Money: the kill switches
Autonomous systems fail in a specific, expensive way: they loop. An agent gets stuck, retries, burns tokens, and nobody notices until the invoice does. So budget enforcement isn't a feature of the factory — it's the foundation.
Every job type carries a hard cap, both in dollars and in agent turns. A PRD job gets one dollar and twenty turns. A content deliverable gets a dollar fifty and twenty-five turns. A code build gets four dollars, sixty turns, and a twenty-minute wall clock limit. Cheap jobs stay cheap by construction; an expensive job has to be expensive on purpose.
The enforcement detail that matters: the runtime prices every single message itself, in-stream, and interrupts the agent the moment a cap trips. It never waits for the SDK to report cost after the fact, because after the fact is exactly when the damage is already done. Mid-stream interruption is the difference between a four-dollar failure and a four-hundred-dollar one.
Above the per-job caps sits a global pause — one switch that halts the entire factory. And above that, an alerting path that notifies a human when something trips. Three layers: per-job limits, a master kill switch, and a pager.
If you build one of these and skip the budget layer, you haven't built a factory. You've built a slot machine.
## The control room
The factory has a dashboard — kill switch front and center, budget burn-down, a live job board streaming status over server-sent events, the approval queue, and health checks on the scheduled jobs. I made it deliberately phone-friendly, because the whole point of an autonomous system is that you're not at your desk when it's running. If you can't pause the factory from your phone in the parking lot, the kill switch is decorative.
One more piece of self-maintenance worth mentioning: the factory audits its own model registry on a weekly schedule. Model IDs drift — providers rename things, deprecate things — and a stale identifier is a silent production failure. So an agent checks the registry against reality once a week. The factory does maintenance work on itself, through the same pipeline, with the same caps.
## Why humans stay on the merge button
Nothing the factory produces ships without a person clicking approve. Every plan, every merge. That's not a temporary training-wheels phase I'm planning to remove. It's the design.
The reasoning is simple. Agents are good at producing work and surprisingly good at critiquing work, but they're bad at owning consequences. The adversarial reviewer catches most defects; it catches zero questions of judgment. Should this exist? Does it conflict with something the agents can't see — a policy, a politics, a plan that lives in someone's head? That's the merge button's job.
And the economics support it. The expensive part of delivery was never the final review — it was everything before it. If agents compress spec-writing, building, and first-pass review down to minutes and a few dollars, the human approval step stops being a bottleneck and becomes the cheapest insurance you can buy. Thirty seconds of human judgment gating four dollars of autonomous work is a trade I'll take every time.
So the lights are off on the factory floor, but there's still a person at the loading dock checking every pallet before the truck leaves. That's the actual shape of autonomous delivery in 2026 — not "no humans," but humans repositioned to the two points where they're irreplaceable: deciding what's worth building, and deciding what's safe to ship.
## If you're building one: where to start
A few things I'd tell anyone wiring up their own version:
- **Build the budget enforcement first**, before the agents get interesting. Per-job dollar and turn caps, priced in-stream, with mid-stream interruption. Add a global kill switch. Everything else can be ugly at the start; this can't.
- **Separate the builder from the reviewer.** Different agents, different contexts, and prompt the reviewer to be hostile. Make it return a structured verdict — approve, fix, reject — so the pipeline can act on it without a human parsing prose.
- **Isolate every build.** Git worktrees on throwaway branches are cheap and give you diffable, disposable output. Never let an agent write where production lives.
- **Take input where people already express it** — transcripts, chat, a dead-simple request form — and let a classifier agent do the triage. The spec-writing agent turns vague intent into a reviewable plan before money gets spent.
- **Keep a human on every plan approval and every merge.** Not because the agents are bad, but because judgment and accountability don't automate. The factory makes that gate cheap; it shouldn't make it optional.
Start with one job type, one budget cap, and one approval queue. The factory gets dark one light at a time.