Running Local AI in a Chrome Extension with Transformers.js: What I Learned Building a Gemma 4 Assistant

I’ve been playing with running local models in the browser for a while, but the Chrome extension context is a different beast. When Hugging Face dropped their Transformers.js demo extension powered by Gemma 4 E2B, I had to dig into the source code to see how they handled the Manifest V3 constraints.

Turns out, the architecture decisions are more interesting than I expected. Here’s what I found useful.

The Big Picture: Three Runtimes, One Brain

The extension splits work across three Chrome runtime contexts, and the key insight is keeping the heavy lifting in the background service worker. The side panel and content script are deliberately thin.

Background service worker is the control plane. It hosts the model pipelines, manages the agent lifecycle, runs tool execution, and keeps the conversation history. This is where all inference happens.

Side panel is just the UI layer. It handles chat input/output, streaming updates, and setup controls. It sends typed events to the background and renders what comes back.

Content script is the page bridge. It extracts DOM content and applies highlights, but doesn’t touch models or logic.

This split avoids duplicate model loads across tabs, keeps the UI responsive, and respects Chrome’s security boundaries around DOM access. The conversation history lives in the background, not the UI, which means the side panel can be rebuilt without losing state.

Messaging: The Backbone of MV3 Extensions

Once you separate runtimes, messaging becomes everything. The extension uses typed enums for all communication, which I appreciate because debugging untyped message passing is a nightmare.

Side panel talks to background for model initialization, text generation, and feature extraction. Background pushes download progress and message updates back to the side panel. When the agent needs page data, the background sends tasks to the content script.

The flow is straightforward: side panel sends AGENT_GENERATE_TEXT, background appends the message, runs inference, then emits MESSAGES_UPDATE. The side panel re-renders from the updated list. No circular dependencies, no race conditions.

Model Loading: Two Pipelines, One Host

The extension uses two models with distinct roles:

Text generation: onnx-community/gemma-4-E2B-it-ONNX (q4f16 quantization)
Embeddings: onnx-community/all-MiniLM-L6-v2-ONNX (fp32)

Gemma handles reasoning and tool decisions, while MiniLM generates vector embeddings for semantic search in features like “ask this website” or “find in history”. Smart split.

All inference runs in the background service worker. Text generation uses the pipeline API with consistent KV caching enabled by a custom DynamicCache class. Embeddings go through feature-extraction pipeline with vector normalization.

Because models are loaded from the background, artifacts are cached under the extension origin (chrome-extension://) rather than per-website origins. This gives one shared cache for the entire extension install, which is a nice bonus.

The Caching Gotcha

Model lifecycle is explicit: CHECK_MODELS inspects what’s cached and estimates remaining download size. INITIALIZE_MODELS downloads and initializes, emitting progress to the UI. Once loaded, pipeline instances are reused.

But here’s the thing about Manifest V3 service workers: they can be suspended and restarted at any time. The model runtime state must be treated as recoverable. You can’t assume the pipeline will still be in memory when the user comes back to the extension.

The extension handles this by checking model status on initialization and re-downloading if needed. It’s not elegant, but it works. If you’re building something similar, plan for this from the start.

What I’d Do Differently

A few things stood out as potential pain points:

No fallback for slow downloads. If the model download takes too long or fails mid-way, the UI just shows progress. A retry mechanism or offline fallback would be nice.
Memory pressure. Running a 4-bit quantized Gemma model plus embeddings in a service worker is fine for short sessions, but I wonder about memory leaks over extended use. The DynamicCache class helps, but I’d add explicit cleanup hooks.
Content script communication. The current pattern works, but extracting full page DOM can be slow on complex pages. A streaming extraction approach might be better for large documents.

Bottom Line

This is a solid reference implementation for anyone wanting to run local AI in a Chrome extension. The architecture is clean, the messaging contract is well-defined, and the model caching strategy handles the MV3 quirks reasonably well.

If you’re building something similar, steal the messaging pattern and the model lifecycle management. Just be ready to handle service worker restarts and plan for memory management from day one.