Screen-Aware AI: Why the Browser Is the Best Place for an Assistant

Every chatbot has the same blind spot: it can't see what you're looking at. You're three screens deep into some web portal, stuck on a form that won't validate, and the assistant that's supposed to help is sitting in a different tab, waiting for you to describe your problem in words.

So you describe it. You type out what the page says, what you clicked, what happened. The chatbot answers from a memory of what that website looked like whenever its training data was collected. You go back, the advice doesn't match the screen, and you return to the chatbot tab to explain why. The two of you pass notes about a page only one of you can see.

That gap is why I built VIP, a voice-and-vision Chrome extension that's live in the Chrome Web Store. The premise is simple: the assistant joins you on the page. It sees what you see, you talk to it out loud, and it answers about the actual screen in front of you — not a generic version of it. Building it taught me that the browser is a far better home for an AI assistant than another chat tab, and this post is about why.

## The describe-your-screen tax

Watch a non-technical person try to get chatbot help with a website task and you'll see the real cost of the chat-tab model. It isn't the model's intelligence. It's the round trip.

The user has to translate a visual situation into prose, accurately, including details they don't know are relevant. Then they have to translate the chatbot's prose back into clicks on a layout the chatbot has never seen. Every redesign the website ships makes the chatbot's directions a little more wrong. "Click the Accounts tab on the left" is useless advice when the bank moved it into a hamburger menu last quarter.

People who live in software all day barely notice this tax because we're good at both translations. The people who most need help — older users, infrequent users, anyone facing an unfamiliar portal under time pressure — pay it in full. They're the ones who give up and call a phone line, or hand the laptop to a relative.

A screen-aware assistant deletes the tax. There's nothing to describe. The question becomes "what do I click?" and the answer can be "that one, the button under the balance" — because both of you are looking at it.

## What the browser already knows

Here's the part that surprised me when I started building: an extension doesn't have to work nearly as hard as a chatbot to understand a web page, because the browser hands it the page on a platter.

A chatbot given a screenshot is doing forensics — inferring structure from pixels. An extension's content script sits inside the page itself. It can read the DOM: every button's label, every form field's state, what's disabled, what's hidden, what error message just appeared. It knows the URL, so it knows which step of which flow you're on. When the page changes, it knows immediately, because it watched the change happen.

That structural access turns vague guidance into precise guidance. The assistant doesn't say "look for something like a Continue button." It can find the button, confirm it exists on this version of the page, and point at it — literally, by highlighting the element. Pixels tell you what a page looks like. The DOM tells you what it *is*, and the extension gets both.

Voice completes the loop. VIP is voice-and-vision because the moment your hands are busy with a form, typing questions into a sidebar is almost as bad as switching tabs. You talk; it watches; your hands stay on the task. That combination — see the page, speak the guidance, point at the element — is the whole product.

## A bank portal was the first real test

The first workflow I pointed VIP at was personal finance: a certificate of deposit maturing, and the question of what to do with the money — compare current rates, weigh the options, then actually complete the reinvestment inside the bank's website.

I picked it deliberately, because bank portals are the worst case for written instructions and the best case for screen-aware help. They're dense, they're redesigned constantly, the navigation labels are bank-specific jargon, and the cost of clicking the wrong thing feels high enough that cautious users freeze. No static help article survives contact with a live banking session.

VIP's job in that flow is to be the patient person looking over your shoulder. It reads the page you're actually on, tells you what the screen is showing in plain language, and walks you to the next step of *this* layout, not the layout from a help doc written two redesigns ago. When the portal throws a surprise — an interstitial offer, a re-authentication prompt — the assistant sees it too, so the guidance never derails.

The pattern generalizes well past banking. Insurance claims, government benefit forms, airline rebooking, the enterprise software nobody got trained on — anywhere the user interface itself is the obstacle, an assistant that can see the interface beats one that can only hear about it.

## What an extension can do that a chatbot cannot

Stepping back from VIP specifically, the capability gap between a browser extension and a chat tab comes down to four things.

**It sees current state, not described state.** The assistant reasons about the page as it exists this second, including the error message that just appeared. No transcription by the user, no stale screenshots.

**It can point.** Guidance can be anchored to the page — highlight this field, scroll to this section. "Click the third link" becomes a glowing outline around the link. For anyone with low vision or low confidence, that's the difference between guidance and noise.

**It follows you.** A chat tab loses you the moment you navigate. An extension rides along across pages, so the assistant keeps the thread through a whole multi-step flow. It remembers that you're mid-reinvestment when the confirmation page loads.

**It meets you where the work is.** No context switch, no second window, no copy-paste shuttle. The help arrives inside the task. In my experience that placement matters more than raw model quality — a decent model in the page beats a brilliant model in another tab, because the brilliant model never gets an accurate picture of the problem.

## The privacy line has to be drawn first

An assistant that can watch your screen — on a banking site, no less — is a serious trust proposition, and it deserves to be treated as one. The boundaries can't live in a privacy policy. They have to live in the architecture.

The rules I built to:

**Nothing is seen until you invoke it.** The assistant is dormant until the user deliberately activates it. No passive monitoring, no ambient capture. Chrome's permission model actually helps here: extensions can be scoped so that access is granted to the active tab when the user acts, rather than to everything always.

**The tab, never the desktop.** Scope is the page you asked for help with — not other tabs, not other windows, not your whole screen. The browser's sandbox makes that a real boundary instead of a promise.

**Sensitive fields get masked before anything leaves the machine.** On predictable screens — account numbers, balances, identifiers — redaction happens client-side. The assistant can guide you through a reinvestment without ever receiving the number of the account it's helping with. I treated that as a day-one requirement, and it shaped the design more than any feature did.

**The indicator is honest.** When the assistant can see or hear, the user can tell at a glance. Sessions are ephemeral by default; the context dies when you close it.

This is also the quiet argument for the browser over deeper OS-level integration. A browser extension's reach is legible — users can reason about what "this tab, while active" means. "An agent that can see my whole computer" is a much harder sentence to say yes to, and in 2026 users are right to hesitate.

## Where I'd start

If you run a product where users get stuck — a portal, a dashboard, a workflow with a support queue full of "where do I click" tickets — here's the practical version of everything above:

1. **Find the flow people abandon.** Your support transcripts already know which screens defeat users. Pick one flow, not ten.
2. **Put the help in the page, not beside it.** Whether that's an extension, an embedded assistant, or a guided overlay, the requirement is the same: it must see the live page state, not a description of it.
3. **Anchor guidance to elements.** If the assistant can't highlight what it's talking about, you've rebuilt the help article with extra steps.
4. **Settle the privacy architecture before the demo.** What's captured, when, what's masked, what's stored, and how the user can tell. These decisions are cheap on day one and brutal to retrofit.
5. **Add voice if hands are busy.** Form-heavy flows earn it; read-only flows may not.

The chat tab was a fine first home for AI — it proved people want to ask computers questions in plain language. But the questions people most need answered are about the screen they're already on. The browser is where that screen lives, and it's where the assistant belongs.

---

*VIP is live in the Chrome Web Store, and it's part of a broader bet on realtime voice-and-vision interfaces — you can see related work on the [projects page](/projects). If your users are getting lost in a portal or a workflow, [get in touch](/contact).*