8 min readJohn McBride

Talking to Your Graph Database: GraphRAG in Production

Twelve iterations of natural-language-to-Cypher over a 174M-node Neo4j graph. What broke, what fixed it, and when graphs beat vector RAG.

graphragneo4jllmvoiceenterprise-ai

For the past year and change I've been building a tool that lets business users ask a 174-million-node Neo4j graph questions in plain English. You type — or say — "show me the largest vendor bills from 2025," the model writes a Cypher query, the graph runs it, and the model explains the result back to you in a sentence instead of a table.

It took twelve-plus iterations to get from a single-file Streamlit prototype to a voice-enabled FastAPI and Next.js stack that answers reliably. Most of what I learned along the way isn't in the neo4j-graphrag docs, and almost none of it was about the model. It was about the schema.

## The setup

The graph combines data from an ERP, a financial planning system, and a learning platform — all merged into one enterprise knowledge graph. The data was rich. The problem was access. Answering any question required Cypher, Neo4j's query language, and the number of business users who write Cypher rounds to zero.

The pitch was simple: put an LLM between the user and the database. GPT-5 generates the Cypher, the official neo4j driver executes it, GPT-5 interprets the rows that come back. The neo4j-graphrag package (we're on 1.9.1) gives you the scaffolding for this pattern.

The first version worked in a demo and fell apart in real use. Here's why.

## Lesson one: the model can't query data it can't find

The schema had 200-plus possible fields across the node types. Many of them were empty. Nothing in the type definitions tells you which is which.

So the model would guess. Ask about product groups, and it would write a perfectly valid query against the product-group field on item records — which happened to be empty. The actual data lived on transaction lines. Valid Cypher, zero rows, confident wrong answer of "no data found." The data was there the whole time.

This is the failure mode I'd warn anyone about before they start a text-to-Cypher project: an LLM will write syntactically beautiful queries against fields that contain nothing, and it will do so with total confidence. The query doesn't error. It just returns empty.

The fix was unglamorous: I annotated the schema. Every field that actually contains data got an explicit [POPULATED] marker, and the prompt tells the model to prefer those. We couldn't send the full database context — at 174 million nodes that's not a conversation — so the annotated schema had to do the teaching inside a small token budget.

## Lesson two: examples beat instructions

Annotation got us from "wrong field" to "right neighborhood." What got us to consistently correct was a curated library of 19 real query examples — actual business questions paired with the Cypher that answers them, using the fields that actually hold the data.

These aren't generic few-shot examples. Each one encodes a trap we'd already fallen into. The best illustration: in this graph, one vendor bill is many line-item nodes. A literal "count the bills" query counts line items and reports a number several times too high. The example library bakes in the DISTINCT and grouping logic, so the model copies a correct pattern instead of inventing a broken one.

After the annotated schema plus the example library went in, our internal review showed a 100% success rate on the supported question types. Before that fix, the tool was a coin flip. The model never changed. The context did.

If you take one thing from this post, take that: in GraphRAG, your accuracy lives in the schema annotations and the example library, not in the model tier.

## Lesson three: you can push real analytics into the graph

Once basic querying was reliable, we got greedy. Finance users didn't just want lookups — they wanted projections.

It turns out you can express a surprising amount of analysis directly in Cypher. We built four revenue-projection algorithms — trend analysis and weighted moving average among them — as reusable Cypher patterns the model can adapt to a question. Ask for a six-month revenue projection and the computation happens in-graph, over live transaction data, no export to a spreadsheet, no separate analytics service.

That's the part that made me a believer in the graph-native approach. The projection patterns sit in the same example library as everything else, so "project revenue for the next six months" is just another question the model knows how to translate.

## When graphs beat vector RAG

I build vector RAG systems too, so this isn't a tribal position. But after living with both, the dividing line is clear to me.

Vector RAG answers "find me passages about X." It's the right tool when the answer is *written down somewhere* and your job is to retrieve and summarize it. Documents, policies, support articles — embed them, search them, done.

Graph queries answer "compute me the truth about X." Top vendors by spend. Transactions that touch a specific account across two systems. A trend line over eighteen months of line items. There is no document containing those answers. They have to be computed from structured relationships, and similarity search has nothing to grab onto. Embedding your transaction records and hoping the right ones surface near a question is not analytics; it's astrology.

My rule now: if the question contains words like *largest*, *total*, *trend*, *across*, or *between*, you want a graph (or at least a real query engine) behind the conversation. If the question is "what does our policy say about X," vectors are cheaper and easier. Plenty of systems need both. But teams keep reaching for vector RAG on aggregation problems because it's familiar, and then they wonder why the numbers are wrong.

## Lesson four: voice changes who uses it

Around iteration eleven I added voice: the browser's Web Speech API for speech-to-text in, OpenAI's TTS reading the answer back out, with play and stop controls. Speak a question, the query auto-executes, the answer plays as audio.

I expected a gimmick. It wasn't. Typing a question into a query box still *feels* like using a database tool. Asking out loud and hearing an answer feels like talking to an analyst. The interaction cost drops to near zero, and that's exactly what the people who most need this — leaders who will never open a BI dashboard — respond to.

We've also been exploring OpenAI's Realtime API for full speech-to-speech, aiming at open-mic follow-up questions with conversational context. That's roadmap, not shipped, but the direction is obvious: the endgame for enterprise data access is a conversation, not a console.

## Lesson five: the prototype stack will hit a wall

Versions one through ten were Streamlit. I'll defend Streamlit for prototypes forever — it's how this thing existed at all. But by the time voice and real users entered the picture, it was the bottleneck.

Iteration twelve moved to a FastAPI backend with a Next.js and Tailwind frontend. Internal notes put it at roughly 10x faster than the Streamlit version. End-to-end, complex questions answer in about 3 to 5 seconds, including Cypher generation, and cost lands around two to five cents per query. Cheap enough that nobody thinks about it, fast enough that voice doesn't feel broken.

The boring hardening mattered too: credentials out of code and into environment variables, graceful handling when the database is unreachable, connection status surfaced in the UI. The whole thing runs against an internal database on a private network — nothing publicly exposed — which is the right default for a system that can query your financials.

## Practical takeaways

If you're putting a natural-language layer on a graph database, here's the short version of twelve iterations:

- **Annotate your schema before you prompt-engineer anything.** Mark which fields actually contain data. Empty-field guessing is the number one silent failure in text-to-Cypher.
- **Build an example library from real questions.** Ours is 19 examples, each encoding a structural trap (line-item counting, fields living on unexpected node types). This was the single highest-leverage artifact in the project.
- **Treat empty results as a red flag, not an answer.** A query that returns nothing usually means the model picked the wrong field, not that the data doesn't exist.
- **Push computation into the graph.** Projections and aggregations as reusable Cypher patterns beat exporting data to compute elsewhere.
- **Use graphs for computed answers, vectors for written ones.** Aggregation questions don't belong in a vector store.
- **Prototype in Streamlit, ship on something else.** Plan the migration before users arrive, not after.
- **Try voice earlier than you think.** It's a different product with the same backend, and it reaches people dashboards never will.

The model gets the headlines, but the work — the part that turned an unreliable demo into a tool people trust — was teaching it where the data actually lives. That part doesn't ship in any SDK.