Somewhere in the back room of this paper there's a "night editor" you can talk to. It isn't a chatbot calling an API in a data centre. It's a language model that downloads into your browser once and then runs entirely on your own hardware — no server, no key, nothing leaving your machine. I wanted to see how close that experience is to "good enough" in 2026. It's closer than most people think.
How it actually works
Three pieces make in-browser inference possible:
- WebGPU — the browser API that finally gives JavaScript real access to the GPU. This is the unlock. Without it you're stuck on WASM and CPU, which works but crawls.
- A compiled runtime — WebLLM (built on MLC/TVM) compiles model kernels to WebGPU shaders. Hugging Face's Transformers.js does a similar job via ONNX Runtime Web.
- A small, quantised model — the editor here is Qwen2.5 0.5B, 4-bit quantised, around 400 MB. It loads once, the browser caches the weights, and subsequent visits are instant.
You call it with an OpenAI-shaped API (chat.completions.create, streaming and
all), except the tokens are being generated a few centimetres from your eyes
instead of in someone's cloud.
Why it's more than a gimmick
- Privacy by construction. The prompt never leaves the device. For anything sensitive — health notes, draft contracts, a journalling app — that's not a feature you bolt on, it's the architecture.
- No per-token cost, no rate limit. Once it's loaded it's yours. Great for toys, classrooms, and anything you don't want metered.
- Offline. Plane, train, dead Wi-Fi — it still answers.
- Latency floor. No network round-trip. Small models feel snappy.
Where it falls down
I'm not going to oversell it. A 0.5B model is small. It will confidently get facts wrong, lose the thread on long context, and it's nowhere near a frontier model. The first load is a few hundred megabytes, which is a lot to ask of a casual visitor. And it leans on WebGPU, so older browsers and weak GPUs are out.
The honest framing: client-side LLMs aren't a replacement for the big hosted models — they're a different tool. The right mental model is "a fast, private, free, slightly dim assistant that's always there," not "ChatGPT but local."
The bigger picture
This is the same shift that's been happening on phones — Gemini Nano in Chrome and on Pixels, Apple Intelligence on-device — just arriving in the open web where anyone can ship it without a backend. For a lot of small, private, latency- sensitive features, "the model runs in the tab" is about to be a perfectly reasonable answer.
Anyway — go wake the night editor and ask it something. It's not clever, but it's running on your computer, which is the whole point.