From 40 Hours to 15 Minutes: Multi-Agent Research That Actually Verifies Itself

Last August I pointed nine AI agents at a single Irish village: Dromahair, County Leitrim. By the end of the run I had a 4,021-word heritage narrative, a verification report covering 247 individual facts, and an audit trail connecting every claim in the story back to a source.

A quick word on the headline. Researching one Irish townland properly — official monument records, academic papers, folklore archives, parish newsletters — takes a historian days of manual cross-referencing. Call it a working week; the 40 hours is my estimate, not a measured benchmark. The agent pipeline runs in minutes. Those are the two numbers I actually care about: days of skilled human time compressed into a coffee break, without giving up the part historians are rightly precious about. Sourcing.

That last part is the whole post. Anyone can chain agents together to generate text fast. The interesting engineering problem is making the output trustworthy enough to publish.

## The problem: nobody reads all of it

Irish local history lives in fragments. The official record sits in buildingsofireland.ie and archaeology.ie. Folklore sits in duchas.ie and the National Folklore Collection. Placename etymology sits in logainm.ie. Academic work hides behind journal walls. Community memory lives in blogs and parish newsletters.

Nobody reads all of it for one village. So what gets published about places like Dromahair is thin, uncited, and frequently wrong. Verified fact, oral tradition, and outright guesswork get blended into the same confident paragraph. Folklore either gets dropped entirely or flattened into cliche because it's been cut off from the historical record that gives it context.

This is not unique to Irish heritage. It's the shape of almost every research problem: the truth is distributed across sources that disagree with each other, and the cost of cross-referencing them all by hand is so high that nobody does it.

## The architecture: three tiers, nine agents

I built the system as a three-tier pipeline of Claude Code subagents. Each agent is a markdown file defining a role, a restricted tool set, and an output contract. The tiers hand off through files on disk, which matters more than it sounds — I'll come back to that.

**Tier 1 — Source Discovery.** Four agents launch in parallel, each owning one slice of the source landscape: official heritage databases, local community sources, folklore archives, and academic research. Each returns ten vetted URLs as structured JSON. Splitting discovery by source type instead of by topic was deliberate — a folklore specialist agent knows to search duchas.ie, and an official-records agent knows the National Monuments Service exists. A generalist agent finds the same five tourism pages everyone else finds.

**Tier 2 — Content Extraction.** Three agents work through what Tier 1 found. One does a rapid survey across 20+ sources to map what's out there. One does deep extraction from the high-value sources. The third does nothing but cross-referencing — its only job is finding places where sources confirm or contradict each other.

**Tier 3 — Verification and Synthesis.** Two agents. A fact-verifier cross-references every claim from Tier 2 and assigns it a confidence score. Only then does a story-synthesizer write the narrative, working exclusively from the verified fact base.

Every stage writes its output to disk: a `1-Discovery/` folder of source JSON, a `2-Processing/` folder of extractions, a `3-Final/` folder with the narrative and verification reports. When something in the final story looks wrong, I can open the intermediate artifacts and watch the claim move through the pipeline. Agents that pass context invisibly, inside one long conversation, give you no such forensics. File-based handoffs turn the whole research trail into something you can audit after the fact.

## Confidence scoring: an explicit formula, not vibes

The fact-verifier doesn't get to say a claim "seems reliable." It scores every fact on a 0-to-1 framework with a deterministic formula: five or more agreeing sources earns a 0.9 base score. Confirmation from an official source adds 0.1. A contradiction anywhere in the source set subtracts 0.1. A fact supported only by blogs loses 0.2.

The score lands each fact in one of five bands: VERIFIED, PROBABLE, POSSIBLE, DOUBTFUL, or REJECTED.

Those bands aren't decorative. They directly control the language the synthesis agent is allowed to use. A verified fact gets "the castle was built." A probable one gets "evidence indicates." Folklore gets "local folklore tells of," with attribution to the National Folklore Collection. The narrative's prose never claims more certainty than the verification layer granted, because the writer literally isn't permitted to.

Contradictions get resolved, not hidden. When two sources disagree, the verifier weighs them by authority, picks a position, and documents the reasoning in the report. Gaps get logged too — the things the pipeline looked for and couldn't establish. A research system that won't tell you what it doesn't know isn't doing research.

## Why verification agents beat generation agents

Here's the opinion the whole project converges on: in a multi-agent research system, the verification agents are worth more than the generation agents.

Generation is the easy part now. Any frontier model will hand you 4,000 fluent words about an Irish village on request. The words will be confident, well-structured, and partly wrong, and you won't be able to tell which part. That failure mode is exactly why most organizations still don't trust AI research output — not because the writing is weak, but because there's no way to know which sentences to believe.

The Dromahair numbers show what verification actually does. The pipeline evaluated 247 facts. It verified 156 of them at high confidence. The rest were downgraded, flagged as contradicted, or rejected outright. That's roughly a third of the candidate fact base that never made it into the narrative as settled truth — claims a single-pass generation system would have stated flat-out, in the same confident voice as everything else.

The overall reliability score for the pilot came out at 0.87, and the point isn't whether 0.87 is good. The point is that the number exists. A reader — or a tourism board's legal reviewer — can see exactly how much of the story is bedrock and how much is attributed tradition. Trust comes from the system showing its work, and only a dedicated verification layer produces that work to show.

So when I design these pipelines now, I budget agent roles backwards from verification. The cross-referencing agent in Tier 2 and the fact-verifier in Tier 3 exist purely to challenge what the other agents found. One in three of my extraction agents generates nothing. That ratio felt extravagant when I sketched it. After seeing 91 of 247 facts fail to reach high confidence, it feels like the minimum.

## The same pattern, beyond Ireland

The pilot delivered the full pipeline end to end: source inventories from all four discovery agents, survey and deep extractions, the 247-fact verification report, and a final narrative that came in at 4,021 words against a 2,000-word minimum — covering etymology, medieval history, three properly attributed folklore traditions, and practical visiting information.

But the discover → extract → verify → synthesize pattern isn't about heritage. I've since reused the same tiered architecture in enterprise market-intelligence work for the mobility sector, where parallel discovery agents verify 160+ sources per research cycle and a full cycle completes in 5 to 10 minutes. Different domain, different stakes, same skeleton: fan discovery out in parallel, run extraction and cross-referencing as separate concerns, and put a scoring gate between research and writing.

Hidden Ireland is the public, low-sensitivity demonstration of that capability. The village makes a better story than the spreadsheet, but the architecture is identical.

## The takeaway: build the skeptic first

If you're building a multi-agent research system, here's the order of operations I'd argue for.

**Start with the verifier, not the writer.** Define your confidence formula before you write a single synthesis prompt. If you can't state, deterministically, what makes a fact trustworthy in your domain, the rest of the pipeline is decoration.

**Separate discovery, extraction, and verification into agents that can't see each other's reasoning.** An agent that found a source is a bad judge of that source. The cross-referencer should encounter claims cold.

**Make handoffs file-based and auditable.** Every tier writes structured artifacts to disk. When a claim in the output looks wrong, you trace it backwards in minutes instead of re-running the whole pipeline and hoping.

**Let confidence scores control the output language.** Banded verdicts that map to specific phrasing — "was built" versus "evidence indicates" versus "folklore tells of" — are the difference between a report people can act on and one they have to independently re-check.

**Log gaps and contradictions as first-class outputs.** The verification report is a deliverable, not a byproduct. In most professional contexts it's the deliverable that earns the trust.

The speed is what gets attention — days of historian work running while you make coffee. But speed without verification just means producing wrong answers faster. The agents that made Dromahair work weren't the ones writing the story. They were the ones allowed to reject a third of it.