9 min readJohn McBride

The 200-Millisecond Problem: Real-Time AI Coaching That Keeps Up With Conversation

Why speech coaching has to land inside a 200ms window — latency budgets, invisible overlay UX, emotion AI, and the design behind Verbal Victory.

voice-airealtimelatencycoachingproduct-design

Human conversation runs on a turn-taking gap of roughly 200 milliseconds. Linguists have measured it across languages, and it holds up remarkably well: when you finish a sentence, the other person starts speaking about a fifth of a second later. We don't notice the rhythm until something violates it.

A coaching tool that wants to help you *while* you speak has to live inside that rhythm. That's the problem I've been working on with Verbal Victory, my AI communication coaching product, and it's a much harder problem than the one most tools in this category actually solve.

This post is about where the milliseconds go, why the user interface matters as much as the model, and a design lesson about habit that I didn't expect when I started.

## Coaching after the fact is the easy version

The version of Verbal Victory that exists today works like this: you pick a scenario — job interview, public speaking, conflict resolution, sales call, or a hard conversation with a partner — record yourself, and get a scored breakdown.

The pipeline is deliberately split in two. Local math runs instantly: words per minute, filler word counts, duration. Then Whisper transcribes the audio and a GPT model scores clarity and confidence and writes contextual feedback — three strengths, three improvements, three practice tips per session. Everything rolls up into a weighted 0–100 score: clarity 30%, confidence 30%, pace 20%, filler penalty 20%.

This works. It costs somewhere between $0.15 and $0.50 in AI spend per analyzed session, it's fast enough that nobody complains, and the feedback is genuinely useful.

But it has a structural flaw that no amount of model quality fixes: the feedback arrives after the behavior is over. You learn you said "um" fourteen times in a recording you can't change. Post-hoc analysis tells you *that* you rush when nervous. It can't stop you mid-rush.

That gap is what the real-time coach — the next phase, designed in detail and now in front of me to build — is meant to close.

## The 200-millisecond problem

Here's the thing about a coaching cue: it's only useful while the behavior is still happening.

If you're speeding up and a "slow down" prompt appears two seconds later, you're already three sentences past the moment. Worse, a late cue actively damages trust — the tool feels like a backseat driver commenting on a turn you already made. In my testing of post-hoc tools generally, late feedback gets ignored fast, and ignored feedback might as well not exist.

So the design target for the live coach is an end-to-end loop under 200 milliseconds: audio leaves your mouth, gets streamed and analyzed, and a cue lands while the sentence is still in your mouth. That's not a nice-to-have number. It's the boundary between "coach" and "critic."

The market agrees this is where the value is. Yoodli hit a $300M+ valuation in December 2025 on the strength of speech coaching, and Deepgram acquired Poised specifically for sub-300ms real-time coaching technology. The companies getting bought are the ones who solved the latency, not the ones with the prettiest report card.

## Budgeting the milliseconds

When your whole budget is 200ms, you account for every hop. The architecture I've designed runs streaming audio over a persistent WebSocket to the OpenAI Realtime API — request-response HTTP is disqualified before you start, because connection overhead alone would blow the budget.

The pipeline splits feedback into two channels with very different costs.

**Visual cues are the fast path.** A live transcript streams in with filler words highlighted as they happen, and a prompter system flashes short directives: slow down, speed up, on track, filler alert, be specific, more energy. Rendering a colored chip on screen costs almost nothing, so visual feedback gets the tightest loop. This is most of the coaching.

**Spoken cues are the expensive path.** Sometimes a voice in your ear beats a flash on your screen — especially when your eyes are on an audience. But text-to-speech adds a whole synthesis step to the loop, which is why the TTS vendor choice is a latency decision first and a voice-quality decision second. The two candidates in my design docs: Cartesia at roughly 40ms to first audio, and ElevenLabs Flash v2.5 at roughly 75ms. Both fit the budget. A standard TTS endpoint at 400ms+ does not, full stop, no matter how nice it sounds.

Notice what that math means: synthesis at 40–75ms still consumes a fifth to a third of the entire budget. Everything else — capture, transport, analysis, the decision of whether to cue at all — has to fit in what's left. That's why the cue logic itself stays simple. Pace and filler detection are arithmetic on a streaming transcript, not a model call. You spend model time only where arithmetic can't go.

## The overlay problem: coaching your audience never sees

Latency is half the problem. The other half is where the coaching appears.

Picture the highest-stakes use case: you're presenting on a video call, screen shared. A coaching panel floating on your screen is now floating on *everyone's* screen. Every existing tool I evaluated either ignores this scenario or quietly becomes unusable in it. Feedback you can only receive in private practice never reaches the moments that matter most.

The design answer is a desktop overlay that's excluded from screen capture. On Windows, this is a real OS facility: a Tauri app can call `setContentProtected(true)`, which maps to `SetWindowDisplayAffinity` with `WDA_EXCLUDEFROMCAPTURE`. The overlay renders on your physical display but is removed from what Zoom, Teams, OBS, and Discord capture. You see "slow down." Your audience sees your slides.

The feasibility research I finished this month is honest about the edges, and I think the honesty is the interesting part. It works reliably on Windows 10 — about 70% device reliability across the configurations tested. Windows 11 is partial. macOS 15+ is broken by design: Apple removed the exclusion path. A Chrome extension version is flatly impossible inside the browser sandbox. So the proof of concept is scoped as Windows-first, one week, before any further commitment.

I'd rather build on a documented 70% than an assumed 100%. Most overlay-style products in this space don't tell you which platforms silently leak your coaching panel into the recording. Knowing exactly where the floor is *is* the engineering.

## Emotion is a signal the transcript throws away

One more piece of the live dashboard is worth calling out: an emotion tracker, planned on Hume AI.

A transcript is a lossy format. "I'm confident this will work" reads identically whether you said it with conviction or with an audible wobble. Whisper hands the language model clean text, and everything your voice was actually doing — strain, hesitation, flatness — is gone before scoring begins.

Emotion analysis puts that channel back. In the live dashboard design, it sits alongside the clarity gauge, confidence meter, filler word cloud, and pace graph — not as a gimmick widget, but because vocal tone is frequently the gap between what speakers think they projected and what listeners heard. People are reliably surprised by their own audio. Giving them an in-the-moment reading of *how* they sound, not just *what* they said, is the feedback a human coach provides by instinct and a transcript can't provide at all.

## The score is what makes people come back

Here's the lesson I didn't fully appreciate until I watched people use the scoring system: the number matters more than the prose.

The GPT-written feedback is richer. It's specific, it's contextual, it names exactly which answer rambled. But the 0–100 score is what people remember, and more importantly, it's what people try to beat. A 71 creates an itch that "you used several filler words" never does.

That's the real argument for gamification in a practice product, and it has nothing to do with badges. Communication skills only improve through repetition, and repetition is exactly what people skip — practicing a hard conversation alone in a room is awkward, and unstructured practice gives no sense of progress. The score restructures the activity. Each scenario has difficulty progression, every session lands in your history, and the progress tracking turns "I should practice more" into "I want to get my conflict-resolution score above 80."

The loop is the product: practice produces a score, the score produces a reason to practice again. Real-time coaching makes each session better. The score is what produces the next session.

## What I'd tell you to steal

If you're building anything real-time with AI, the transferable lessons from this project:

- **Set the latency budget before the architecture.** Ours is 200ms because that's the measured rhythm of human conversation. Your number may differ, but pick it first — it disqualifies entire architectures (request-response HTTP, standard TTS endpoints) before you waste a sprint on them.
- **Split feedback into fast and slow channels.** Visual cues from streaming arithmetic are nearly free; spoken cues cost 40–75ms of synthesis even with the fastest vendors (Cartesia ~40ms, ElevenLabs Flash ~75ms). Spend the expensive channel only where it beats the cheap one.
- **Keep model calls off the hot path.** Pace and filler detection are math on a transcript. Reserve the model for what math can't do, and let the deep analysis run post-session where seconds are free.
- **Treat delivery context as a feature.** The screen-capture-proof overlay is a UX decision, not an AI decision — and it's the most defensible thing in the design. Ask where your output appears, and who else can see it.
- **Write down where your approach fails.** Windows 10 at ~70%, macOS not at all, browser extension impossible. That paragraph of honesty is worth more than a roadmap of maybes, to investors and to your own planning.
- **Give users a number to beat.** Rich feedback informs; a score retains. If your product depends on repeated use, the scoring system is core architecture, not polish.

The model vendors will keep shaving milliseconds off synthesis and transcription. The durable work is everything around the model: the budget, the channels, the overlay, the score. That's the part that decides whether the coaching actually changes how someone speaks.