9 min readJohn McBride

When AI Can See Your Screen and Hear Your Voice: Realtime Multimodal in 2026

Realtime voice and vision APIs changed the interface contract. Latency budgets, privacy design, and the business workflows that light up first.

voice-aimultimodalrealtimeai-strategyprivacy

For most of the last three years, working with AI meant typing into a box and waiting. You composed a prompt, the model composed a reply, and the two of you took turns like chess players. That contract is gone.

The realtime APIs that matured over the past eighteen months — speech-to-speech models that listen and talk natively, vision models that accept a live stream of frames — support a different kind of interface. The assistant is present while you work. It hears you mid-sentence. It sees the document you're scrolling through. It can interrupt, and be interrupted, the way a colleague standing at your desk would.

I've been building with these APIs across several projects, from a voice-and-vision consultation tool for dense technical standards to a screen-aware browser extension to a realtime speech coach. This post covers what I've learned: where the latency budget actually goes, how to design for privacy when the AI can see everything, and which business workflows benefit first.

## The interface contract changed

The old pipeline for voice AI was a relay race: speech-to-text transcribes your audio, the text goes to a language model, the model's text reply goes to a text-to-speech engine. Each handoff adds delay and loses information. By the time the model "hears" you, your tone, hesitation, and emphasis are gone — flattened into a transcript.

Speech-to-speech models skip the relay. Audio in, audio out, one model. The practical differences are bigger than they sound:

- **Interruption works.** You can talk over the assistant and it stops, because it's processing your audio continuously, not waiting for a turn boundary.
- **Prosody survives.** The model can hear that you're uncertain, rushed, or asking a genuine question versus thinking out loud.
- **Latency drops** from multiple seconds to well under one — close enough to human conversational rhythm that the exchange stops feeling like a transaction.

Live vision is the same shift applied to sight. Instead of uploading a screenshot and asking a question about it, you share a stream — a document, a browser tab, a camera feed — and the model maintains context across frames. It knows what you were looking at thirty seconds ago and what changed since.

Put the two together and you get something genuinely new: an assistant that watches the same thing you're watching while you talk about it.

## What this looks like in real work

Three examples from my own work, at concept level.

**Live document consultation.** I built a voice-and-vision tool for navigating technical standards — the kind of dense, multi-hundred-page engineering documents where finding the right clause is half the job. You load the standard, then just talk: "Where does this define the test temperature range?" The assistant navigates to the section, highlights the relevant passage on screen, and explains it out loud while you both look at the same page. The document stays in view the whole time. You never leave it to go ask a chatbot somewhere else. That's the part people underestimate: the value isn't the answer, it's that the answer arrives *in context*, anchored to the exact paragraph, while your hands stay on the document.

**Screen-aware browser help.** I built a Chrome extension that can see the page you're on and talk you through it. The first use case was personal finance — navigating a bank's site to compare CD rates and complete a reinvestment, with the assistant reading the actual screens rather than guessing from a generic script. Bank portals are a hostile environment for written instructions ("click the third tab... no, they redesigned it") but a trivial one for an assistant that can see the current layout. The same pattern generalizes to any web workflow where the UI is the obstacle: insurance portals, government forms, enterprise software nobody was trained on.

**Realtime speech coaching.** My communication coaching product gives feedback while you practice speaking — pacing, filler words, clarity. This is the most latency-sensitive of the three, because coaching feedback that arrives two seconds late is feedback about a sentence you've already finished. The interaction only works when the loop is tight enough that a cue lands while the behavior is still happening.

## Where the latency budget goes

If you're scoping a realtime project, the latency budget is the first engineering conversation to have, because it determines your architecture.

Human conversation runs on a roughly 200-millisecond turn-taking gap. Linguists have measured this across languages; it's close to a universal. Anything under about half a second feels responsive. Past one second, people start repeating themselves or talking over the assistant, and the experience degrades fast.

Your budget gets spent in four places:

1. **Capture and transport.** Microphone buffering, audio encoding, and the network hop to the model. WebRTC or a persistent WebSocket, not request-response HTTP. This is also why region matters — a round trip to a distant data center can eat a third of your budget before the model does anything.
2. **Voice activity detection.** The system has to decide you've finished a thought before responding. Too aggressive and it interrupts you mid-clause; too patient and it adds dead air. The newer APIs handle this server-side and let you tune it.
3. **Model time-to-first-audio.** Speech-to-speech models start emitting audio before they've finished "thinking" — like streaming tokens, but for sound. What matters is time to first audible syllable, not time to complete response.
4. **Tool calls.** The silent killer. The moment your assistant needs to query a database or call an API mid-conversation, you've added that system's latency to the middle of a sentence. The fix is architectural: pre-fetch what you can predict, run slow lookups asynchronously while the model keeps talking ("let me pull that up — so, while that loads..."), and keep anything conversational on the fast path.

Vision has its own budget question: frame rate. Streaming full-motion video to a model is expensive and usually unnecessary. For document and screen work, one frame per second — or frames only on change — is plenty, and it cuts cost and bandwidth by an order of magnitude. The model doesn't need to watch you scroll; it needs to see where you stopped.

## Privacy is a design decision, not a disclaimer

An assistant that can see your screen and hear your room is a different risk profile from a chatbot, and your users know it. The trust question has to be answered in the architecture, not the privacy policy.

The principles I work from:

**Decide what never leaves the device.** Wake-word detection and voice activity detection can run locally, so no audio is transmitted until the user is actually addressing the assistant. For screen capture, the extension or app should capture the specific tab or window the user designated — never the full desktop — and only while a session is active.

**Redact before transmit.** On screens with predictable sensitive fields — account numbers, SSNs, patient identifiers — masking can happen client-side, before any frame is encoded. The model can help someone navigate a banking portal without ever receiving their account number. In my financial-navigation work this was a hard requirement I set on day one, and it shaped everything downstream.

**Default to ephemeral.** Realtime sessions shouldn't be recorded unless the user explicitly opts in for a specific purpose (a training review, a compliance record). Session context can live in memory and die with the connection.

**Make the indicator honest.** A persistent, unmistakable signal whenever the mic or screen is live. Users forgive a lot if they always know what's being shared; they forgive nothing if they find out later.

None of this is exotic. It's the same discipline as any data-handling design, applied earlier — at the capture point instead of the storage layer.

## Which workflows light up first

Not every workflow benefits from realtime multimodal. The ones that do share a shape: a person looking at something complicated, needing guidance *now*, where switching to a separate tool breaks the task. Four categories stand out:

**Support.** "Show me what you're seeing" is the oldest request in tech support. Screen-aware assistance answers it without a human agent on the line, and escalates to one with full visual context already established.

**Compliance and procedure walkthroughs.** Audits, safety checklists, regulated processes — anywhere a person works through a document or physical procedure step by step. A voice assistant that sees the form and reads the standard can confirm each step as it happens, instead of someone checking their own work from memory afterward.

**Training.** Realtime feedback compresses learning loops. This applies to my speech-coaching work, but equally to onboarding someone on unfamiliar software: the assistant watches them attempt the task and coaches in the moment, which beats a recorded video every time.

**Accessibility.** This may be the most important one. For users with low vision, motor constraints, or reading difficulty, a voice interface over a visual one isn't a convenience feature — it's the difference between usable and not. Realtime vision plus voice means any screen can be navigated by conversation. The same investment that builds a support tool builds an accessibility tool.

## How I'd start

If you're a CTO or founder evaluating this: pick one workflow where your users are already looking at something while wishing they could ask someone about it. A dense document, a confusing portal, a procedure. Build the narrowest possible version — one document type, one portal, one task — and instrument the latency from the first day, because a realtime product that misses its budget is worse than a turn-based one that doesn't pretend.

Then settle the privacy architecture before the demo, not after. What's captured, what's masked, what's stored, what the user sees about all of it. Those decisions are nearly free at the start and very expensive later.

The interface contract has changed. The teams that internalize that early will ship assistants that feel like colleagues; everyone else will ship chatbots with microphones.

---

*I build realtime and agentic AI systems in production — you can see related work on the [projects page](/projects), including the [Spark AI platform](/projects/spark-ai-platform) and [Verbal Victory](/projects/verbal-victory). If you're weighing a voice or vision workflow for your own team, [get in touch](/contact).*