🜂 EMBERGLASS

A 3-billion-parameter mind, running inside a browser tab. No server. No install. No upload. Just a page.

~35 tokens/sec decode · live LoRA hot-swap · bit-exact to the reference · 100% client-side WebGPU

Code & runtime: https://github.com/maceip/emberglass


What this is

Most "AI in the browser" is a thin client phoning home to someone else's GPU. This isn't that.

Emberglass is a hand-built inference engine that runs a fine-tuned Qwen2.5-3B reasoning model entirely on your own machine's GPU, from inside a single static web page — written from the metal up in raw WebGPU compute shaders. The model thinks for thousands of tokens, streams a verdict, and never sends a single byte off your device. You bring the weights; the page brings the engine.

And the part that shouldn't be possible at this speed: you can swap the model's personality at runtime. Load the base once, then hot-swap LoRA adapters live — no reload, no recompile, no re-quantization. The base weights never move. The output changes the instant you flip the adapter, and flips back bit-for-bit identically when you remove it.

Why it's hard (and why it's fast)

A browser tab is the most hostile environment imaginable for a 3B-parameter model. No CUDA. No vendor kernels. A 5.4 GB weight shard won't even fit in a single JavaScript array. Every fast path that exists on a server is closed. So we closed the gap by hand:

  • Custom WGSL compute kernels for every op — the only way LoRA could become live and swappable instead of a baked-in constant.
  • int4 group-128 quantization that is numerically exact on the reference decode — half the memory, zero quality lost.
  • Split-K flash-style decode attention so it stays fast even at thousands of tokens of context.
  • Subgroup-reduction GEMV + a GPU-resident batched decode loop (argmax→embed stays on the GPU; one sync per batch).

Every win was found by measuring — nanosecond GPU timestamp profiling — not guessing. 9 → 35 tok/s over one focused push.

Results

Decode speed ~35 tok/s across a full multi-thousand-token reasoning generation
Correctness argmax + every generated token exact vs the HuggingFace reference; bit-exact run-to-run
LoRA hot-swap load base once · swap live · perfect restore on clear · no reload
Footprint one static HTML page; weights supplied by the visitor (BYO-model)
Privacy absolute — inference never leaves the device

Context window & prefill sizes

The base model — WeiboAI/VibeThinker-3B, a Qwen2.5-architecture 3B reasoning model (from Qwen2.5-Coder-3B) — supports 131072 (128K) positions with a 32K sliding window, and is built to think long: its generation config defaults to max_new_tokens=65536, and the authors suggest 60K–100K tokens for the hardest problems. So context length is a first-class concern here, not an afterthought.

The runtime exposes context + prefill as options:

const rt = new QwenWGPU(device, QWEN25_3B, { maxCtx: 8192, maxPrefillT: 8192 });
  • maxCtx — the context window (KV-cache length). Decode attention is split-K and prefill attention is flash / online-softmax (O(block) workgroup memory, not O(ctx)), so neither caps out at small sizes — context scales until you run out of VRAM.
  • maxPrefillT — the largest prompt processed in one batched (tiled-int4-GEMM) prefill pass. Longer prompts (or prefill while a LoRA adapter is active) fall back to the sequential path; clamped to maxCtx.

Defaults are 8192 / 8192 — ample for the bug-bounty triage adapter (its chain-of-thought runs a few thousand tokens) at a modest footprint. Raise them toward the base model's 128K as memory allows. The KV cache is the cost, and it grows linearly (~72 KB per token of context, f32, across all 36 layers):

context (maxCtx) KV cache (f32)
8 192 (default) ~0.6 GB
16 384 ~1.2 GB
32 768 (sliding window) ~2.4 GB
131 072 (max positions) ~9.4 GB

Plus ~2 GB of int4/int8 weights and lazily-sized prefill scratch. Verified in-browser: batched prefill is bit-exact to the sequential path through ctx 1024; runs end-to-end at 4 096 / 8 192; and a maxCtx: 16384 build prefills a 9 000-token prompt and decodes past it. (KV is f32 today — quantizing it would roughly halve these numbers.)

Note on weights

This page hosts no multi-GB weights. Emberglass is the engine; it is bring-your-own-model. Point it at a Qwen2.5-3B (or compatible) checkpoint served locally and it quantizes to int4 on the way to the GPU. Drag in a PEFT/MLX LoRA adapter to hot-swap a specialization live.

Run it

See https://github.com/maceip/emberglass. Requires a WebGPU browser exposing the subgroups feature. Built and validated on an Apple M5 Max.


Built the hard way, on purpose. 🜂

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for macmacmacmac/qwen-webgpu-lora

Base model

Qwen/Qwen2.5-3B
Adapter
(5)
this model