•9 min read•John McBride
Context Engineering: Why What You Feed the Model Matters More Than the Model
Million-token context windows are here, and stuffing them is still a mistake. Lessons from production agents on retrieval, memory, and compaction.
context-engineeringai-agentsragllmarchitecture
Every few months a new model ships with a bigger context window, and every few months I watch teams make the same mistake: they treat the window like a bucket and fill it to the brim. Entire codebases. Every past conversation. The full document library, just in case.
It feels safe. The model can technically see everything, so nothing can be missed, right?
In practice, the opposite happens. The agents I've shipped that work reliably are the ones that see the *least* — the smallest set of tokens that's actually relevant to the current step. The ones that struggle are the ones drowning in their own context. After a couple of years of building agent systems in production, I've come to believe context engineering is the discipline that separates demos from systems. The model matters. What you feed it matters more.
## The bucket fallacy
A million-token context window does not mean a million tokens of *attention*. Models degrade as context grows: instructions buried in the middle of a long prompt get ignored, earlier constraints get overridden by later noise, and the model starts pattern-matching against irrelevant material you included "just in case."
I see this constantly in agent work. Give a coding agent the whole repository and it will confidently modify a file that has nothing to do with the task, because something in that file looked related. Give it the three files that actually matter plus a one-paragraph map of the rest, and it behaves.
There's a second problem that has nothing to do with model quality: you pay for every token, every turn. That's where the economics get interesting.
## The economics nobody runs
An agent conversation isn't a single API call. Every turn — every tool use, every observation — replays the accumulated context back through the model. If your agent carries 200,000 tokens of stuffed context and takes forty turns to finish a job, you're not paying for 200,000 tokens. You're paying for something closer to forty passes over them.
Prompt caching softens this, but it has rules people ignore. Cache entries expire on short TTLs — minutes, not hours — so an agent that pauses for a slow tool call or waits on a human can come back to a cold cache and re-pay full price for the entire prefix. And caches work on prefixes: change anything early in the context and everything after it is invalidated. An agent that "helpfully" rewrites its own system context mid-run torches its cache every time.
When I build agent pipelines now, I price the context before I price the model. In one multi-agent pipeline I built for an enterprise client, every job type has a hard dollar budget and a turn cap, and the runtime computes spend per message and interrupts mid-stream when a cap trips. The single biggest lever on those budgets was never the model tier. It was how much context each agent carried per turn. Trimming what an agent re-reads on every step did more for cost than any model downgrade we tested — without giving up capability.
## Progressive disclosure: load the index, not the library
The pattern that changed how I build agents is progressive disclosure: give the model a cheap index of what exists, and let it pull detail only when it needs it.
Concretely, that looks like layers:
- **Level 1:** one line per capability or document. A name and a sentence. Always loaded.
- **Level 2:** a short summary file, loaded when the agent decides that capability is relevant.
- **Level 3:** the full material — source files, complete docs, raw data — fetched on demand through tools.
This is how good agent harnesses already work internally, and it's worth copying in your own systems. An agent with fifty tools described in full detail burns thousands of tokens before it does anything; an agent with a fifty-line tool index and a "describe tool" call burns almost nothing until it commits to a path. Same with knowledge: the agent reads the table of contents for free, then spends tokens on the two chapters that matter.
The discipline is resisting the urge to "save a round trip" by preloading level 3. Every time I've done that, I've paid for it twice — once in tokens, once in degraded focus.
## Retrieval versus long context
So when do you retrieve and when do you just include?
My working rule: long context is for material the model needs *coherently and completely* — a contract being analyzed clause by clause, a single codebase module being refactored, a transcript being summarized. Retrieval is for everything the model needs *occasionally and partially* — reference docs, history, prior decisions, the other 95% of the corpus.
The math backs this up. A retrieval call costs you an embedding lookup and a few hundred tokens of results. Stuffing the same corpus into context costs you the whole corpus, re-read on every turn, with cache expiry as a recurring tax. Retrieval also scales where context doesn't: your document library grows forever; your context window doesn't, and your attention budget grows even slower than that.
The mistake I see in the other direction: teams that chunk a 30-page document into retrieval fragments and then wonder why the model misses cross-references. If the task is "understand this document," give it the document. Retrieval is not a religion. It's a cost and relevance tool.
## Memory is a database problem, not a prompt problem
Long-running systems need memory, and the worst way to build it is appending everything to an ever-growing prompt.
Two patterns have worked for me in production.
**File-based memory with an index.** The agent maintains a small index file — a few lines, one per topic — plus separate topic files it reads only when relevant. Across sessions, it loads the index for pennies and pulls detail on demand. It's progressive disclosure applied to the agent's own history, and it's almost embarrassingly simple: plain markdown files, no infrastructure. For a single agent serving one team, this gets you 80% of what a fancy memory system promises.
**Entity extraction into a graph.** For richer recall, I extract structure at write time instead of paying for comprehension at read time. In a memory application I built for families dealing with Alzheimer's, every uploaded memory passes through a small, fast model that pulls out entities — people, places, approximate dates, relationships — and writes them into Postgres alongside pgvector embeddings of the text. Recall queries then hit a structured graph plus a vector search and return a handful of relevant memories, instead of dumping a family's entire history into the prompt and hoping. The cheap model does the extraction once; the expensive model never has to read everything.
That write-time/read-time trade is the heart of memory design. Pay a small cost when information arrives, and you avoid a large cost every single time it's needed.
## Compaction: agents need to forget on purpose
Any agent that runs long enough will fill its window. What you do at that moment defines whether it's a long-running system or a slow-motion failure.
Naive truncation — drop the oldest messages — loses exactly the wrong things: the original task definition and early constraints. What works is deliberate compaction: periodically summarize the transcript into a structured checkpoint (task, decisions made, current state, open items, files touched) and restart the context from that checkpoint plus the system prompt.
In the agent pipeline I mentioned earlier, I went a step further and avoided long transcripts altogether. Each stage — classification, spec writing, building, review — starts with a fresh context and a written brief from the previous stage, not the previous stage's conversation. The brief is the compaction. An adversarial review agent gets a scoped diff and a spec, never the builder's full meandering transcript. That isolation made the reviewer noticeably sharper, for the same reason a human code reviewer does better with a clean PR description than with a screen recording of the author's afternoon.
## Harnesses live or die on this
Here's the broader point. An agent harness — the scaffolding around the model that manages tools, memory, budgets, and control flow — is mostly a context management machine. Strip away the branding and the core questions of any harness are: what goes into the window this turn, what gets summarized, what gets fetched, and what gets dropped.
Every production failure I've debugged in agent systems traces back to one of those questions answered badly. The agent that ignored an instruction had it buried under 80,000 tokens of retrieved noise. The agent that blew its budget was re-reading a cold-cached corpus forty times. The agent that "forgot" its task had truncated the wrong end of its history.
None of those are model problems. A bigger model makes each of them more expensive, not better.
## What I'd tell a CTO
If you're evaluating an AI initiative right now, the questions that predict success aren't about which model is on the leaderboard this month. They're:
- What's the minimum context each step actually needs, and who decided that?
- What does one full agent run cost, including every re-read, and what enforces the cap?
- Where does long-lived knowledge live — in prompts, or in storage the agent queries?
- What happens when the context fills up?
Teams with crisp answers ship systems that survive contact with production. Teams without them ship demos that get quietly retired.
Models will keep getting bigger, and that's fine — more headroom is useful. But headroom rewards discipline; it doesn't replace it. The teams winning with AI in 2026 aren't the ones with access to the biggest window. Everyone has that. They're the ones deliberate about every token that goes through it.
If you're working through these decisions for your own systems, the [projects](/projects) page shows how this thinking plays out in real builds — or [get in touch](/contact) and we can talk through your architecture directly.