8 min readJohn McBride

We Let AI Agents Build an App for 28 Hours Straight. Here Is What Happened.

Twelve AI agents, 236 Puppeteer tests, a full-stack Next.js app, zero human coding. What broke, what held, and what I'd change in the harness.

agentsautomationengineeringcase-study

I wanted an answer to a question that demos never answer: if you take the human out of the loop completely — no prompting, no nudging, no quietly fixing things between runs — can a team of AI agents build a real application?

Not a todo list. A full-stack app with authentication, a PostgreSQL database, API endpoints, and a test suite. Production patterns, not happy-path code.

So I built a harness, gave twelve specialized agents their assignments, kicked off the run, and let it go for 28 hours straight. Zero human coding. I didn't touch a file.

The run finished with a complete Next.js application: 15,000+ lines of TypeScript, auth flows, Prisma talking to Postgres, working API routes, and 236 automated Puppeteer tests covering the whole thing.

That's the headline. The useful part is everything underneath it — what broke, what held, and what I now do differently when I design these systems.

## The setup

Twelve agents, each with a fixed role. Not twelve copies of the same chatbot — twelve narrow jobs:

- A **planning agent** that architected the application structure and the build roadmap before any code existed.
- **Coding agents** split across the frontend, backend, and database layers, each working its own slice.
- A **testing agent** that owned the Puppeteer suite — writing tests, running them, and reporting failures back into the queue.
- A **DevOps agent** handling deployment, monitoring, and infrastructure.
- **QA agents** running continuous integration on everything the coding agents produced.
- A **documentation agent** writing technical docs and user guides as the app took shape.

The orchestration ran on the Claude Agent SDK with Claude 4.5 doing the reasoning. Every agent had a bounded job, a defined set of tools, and a workspace it was allowed to write to. Work flowed through a queue, not a group chat. No agent ever decided for itself what to do next; the harness decided, and the agent executed.

That distinction sounds bureaucratic. It's the entire reason the run survived 28 hours.

## The test suite was the real manager

Of everything in the system, the 236 Puppeteer tests did the most work per dollar.

I chose browser-level tests deliberately. A coding agent can write unit tests that pass against its own wrong assumptions — same author, same blind spots, green checkmarks all the way down. It's much harder to fake a headless browser actually completing a signup flow, clicking through the UI, and hitting a real API route backed by a real database.

Puppeteer became the one source of feedback the agents couldn't argue with. A model will defend bad code in fluent, confident prose. It cannot negotiate with a failing browser test. The test fails, the failure goes back in the queue, an agent picks it up, and the cycle repeats until the suite is green.

By the end of the run, that loop — build, test in a real browser, feed failures back — had executed continuously for over a day. Every one of the 236 tests passed. Not because the agents were careful, but because the harness made passing the only way out.

## What broke

Plenty broke. That was the point of running it this long.

**Agents don't crash. They persist.** This is the failure mode that surprises people most. A confused agent doesn't throw an error and stop — it keeps going, confidently, down a doomed path. Left alone, it will burn hours and real money producing plausible-looking work that's wrong at the foundation. Software fails loud; agents fail polite. The fix wasn't a better prompt. It was hard caps — turns, cost, wall-clock time — enforced by the harness, with the authority to cut an agent off mid-task.

**Shared state is poison.** Early on, agents working in the same space would step on each other — one refactoring a module while another wrote code against the old version. Neither agent did anything wrong by its own lights. The harness had to enforce isolation: each job in its own workspace, changes integrated through a controlled path instead of a free-for-all.

**Self-review is worthless.** Asking an agent to check its own output is asking a student to grade their own exam. The blind spot that produced the bug excuses the bug. The only reviews that caught real problems came from agents that had no part in producing the work — the QA agents and the test suite, with different prompts and a different mandate.

**Failures compound unless you roll them back.** A bad change that survives becomes the foundation for the next three changes, and now you're unwinding a tower instead of reverting a commit. The harness treated failed work as disposable: roll back, requeue, try again clean. Repair-in-place sounds efficient. Over 28 hours, it's how small errors become structural ones.

None of these are model problems. A smarter model fails the same ways, just more fluently. They're system problems, and they have system fixes.

## What worked

**Narrow roles beat general intelligence.** Twelve specialists, each with a small bounded job, outperformed what any generalist setup I'd tried could do. A frontend agent that only does frontend doesn't wander into the schema. The planning agent's roadmap gave every other agent a contract to build against instead of a vibe to interpret.

**Bounded jobs kept context fresh.** Long-running agents drift — the longer the session, the further the output wanders from the original intent. Short jobs with clear completion criteria meant every agent started sharp and finished before it could lose the plot.

**Verification as a separate power.** Producers produced, checkers checked, and the two never shared a prompt. The testing and QA agents had no stake in the code being good — only in finding where it wasn't. That adversarial split is what let me sleep while the thing ran.

**Determinism in the harness, not the model.** Every decision that mattered — what runs next, who reviews it, when to stop, when to roll back — lived in plain code. The model brought capability. The harness brought predictability. You need both, and they don't live in the same place.

## The 28 hours were the easy part

Here's the part that doesn't fit in a headline: the run itself was uneventful. I checked in, watched the job queue churn, read test reports, and went about my day. The drama was all in the weeks before, designing the harness — deciding the roles, the budgets, the isolation rules, the feedback loops.

That inversion is the real finding. With chat-assisted coding, the human effort is spent during the work: prompt, read, correct, repeat. With a harnessed run, the effort moves upstream into system design, and the run becomes boring. Boring is the goal. Boring is what unattended means.

It also reframes what you're actually building. The application was the output, but the asset is the harness. The app took 28 hours. The harness is reusable for the next run, and the one after that, and it gets better every time something breaks and I fix the system instead of the symptom.

## Takeaways if you're building one of these

If you want to attempt unattended agent runs — even short ones — here's what this experiment earned the hard way:

1. **Build the verification layer first.** Before any agent writes production code, you need tests the agents can't game. Browser-level end-to-end tests are worth the setup cost; they're the one judge that doesn't take the model's word for anything.
2. **Cap everything in code.** Turns, dollars, time — per job, enforced by the runtime, with the power to interrupt mid-task. An agent loop without hard limits isn't autonomous, it's unsupervised.
3. **Never let a producer review itself.** Separate agents, separate prompts, no shared stake in the outcome. If your checker helped make the thing, you don't have a checker.
4. **Isolate every job.** Own workspace, controlled integration. Two agents in one sandbox is a race condition with a language model attached.
5. **Roll back, don't repair.** Failed work gets discarded and requeued. Patching on top of a bad foundation is how one wrong turn becomes a wrong architecture.
6. **Spend your effort on the harness.** The model is rented and replaceable. The orchestration, the budgets, the review structure — that's the part you own, and the part that compounds.

The agents wrote every line of the app. But the thing I'm proudest of is the part with no AI in it at all: the plain, deterministic machinery that kept twelve non-deterministic workers productive for 28 hours without me in the room.

The full breakdown — metrics, architecture, agent roles — is in the [AI Business Empire case study](/projects/ai-business-empire). And if you're weighing an autonomous build for your own team, [get in touch](/contact) — I'm happy to talk through where a harness fits and where it doesn't.