est. 2018 · szczecin, poland ● availability: open — 1 retainer + 1 short build vol. viii — no. 42 · may 2026
late edition
EN·PL·
„Moin." Mein Deutsch reicht gerade fürs Essenbestellen, „die Karte, bitte" und „zahlen, bitte". Für alles Weitere — lieber Englisch oder Polnisch. 🍺

The Perliński Gazette

all the stack that's fit to ship
single issue
one email
§ a4 — notes from the workshopTechnology2026.06.12
← all dispatches

There's a language model in this newspaper (and it runs on your machine)


Somewhere in the back room of this paper there's a "night editor" you can talk to. It isn't a chatbot calling an API in a data centre. It's a language model that downloads into your browser once and then runs entirely on your own hardware — no server, no key, nothing leaving your machine. I wanted to see how close that experience is to "good enough" in 2026. It's closer than most people think.

How it actually works

Three pieces make in-browser inference possible:

  • WebGPU — the browser API that finally gives JavaScript real access to the GPU. This is the unlock. Without it you're stuck on WASM and CPU, which works but crawls.
  • A compiled runtimeWebLLM (built on MLC/TVM) compiles model kernels to WebGPU shaders. Hugging Face's Transformers.js does a similar job via ONNX Runtime Web.
  • A small, quantised model — the editor here is Qwen2.5 0.5B, 4-bit quantised, around 400 MB. It loads once, the browser caches the weights, and subsequent visits are instant.

You call it with an OpenAI-shaped API (chat.completions.create, streaming and all), except the tokens are being generated a few centimetres from your eyes instead of in someone's cloud.

Why it's more than a gimmick

  • Privacy by construction. The prompt never leaves the device. For anything sensitive — health notes, draft contracts, a journalling app — that's not a feature you bolt on, it's the architecture.
  • No per-token cost, no rate limit. Once it's loaded it's yours. Great for toys, classrooms, and anything you don't want metered.
  • Offline. Plane, train, dead Wi-Fi — it still answers.
  • Latency floor. No network round-trip. Small models feel snappy.

Where it falls down

I'm not going to oversell it. A 0.5B model is small. It will confidently get facts wrong, lose the thread on long context, and it's nowhere near a frontier model. The first load is a few hundred megabytes, which is a lot to ask of a casual visitor. And it leans on WebGPU, so older browsers and weak GPUs are out.

The honest framing: client-side LLMs aren't a replacement for the big hosted models — they're a different tool. The right mental model is "a fast, private, free, slightly dim assistant that's always there," not "ChatGPT but local."

The bigger picture

This is the same shift that's been happening on phones — Gemini Nano in Chrome and on Pixels, Apple Intelligence on-device — just arriving in the open web where anyone can ship it without a backend. For a lot of small, private, latency- sensitive features, "the model runs in the tab" is about to be a perfectly reasonable answer.

Anyway — go wake the night editor and ask it something. It's not clever, but it's running on your computer, which is the whole point.