Gemma 4 E2B IT q4f16_1 MLC (Experimental)

Experimental text-first MLC/WebLLM packaging of google/gemma-4-E2B-it in q4f16_1 for browser-local WebGPU and MLC-LLM runtimes.

This repository is a custom build from a local mlc-llm / TVM fork. It is not an official mlc-ai release, and runtime validation is still ongoing.

Model Details

  • Base model: google/gemma-4-E2B-it
  • Quantization: q4f16_1
  • Runtime target: webgpu
  • Model type: gemma4
  • Conversation template: gemma_instruction
  • Packaged context window: 4096
  • Sliding window size: 512
  • Prefill chunk size: 1024
  • Parameter shards: 42
  • Quantized parameter size: 2.453 GB
  • Total artifact size: 2.676 GB
  • WebGPU model library: 8.7 MB
  • Release manifest: release-manifest.json

q4f16_1 is used here as the default browser-focused MLC target: 4-bit weights with FP16 runtime dtype.

Usage

Chat

mlc_llm chat HF://welcoma/gemma-4-E2B-it-q4f16_1-MLC

WebLLM Integration

import { CreateMLCEngine } from "@mlc-ai/web-llm";

const repo = "https://huggingface.co/welcoma/gemma-4-E2B-it-q4f16_1-MLC";

const appConfig = {
  model_list: [
    {
      model: repo,
      model_id: "gemma-4-E2B-it-q4f16_1-MLC",
      model_lib: `${repo}/resolve/main/libs/gemma-4-E2B-it-q4f16_1-MLC-webgpu.wasm`,
      required_features: ["shader-f16"],
    },
  ],
};

const engine = await CreateMLCEngine("gemma-4-E2B-it-q4f16_1-MLC", {
  appConfig,
});

Files

  • mlc-chat-config.json: MLC runtime configuration
  • tokenizer.json: fast tokenizer data
  • tokenizer_config.json: tokenizer settings
  • params_shard_*.bin: quantized parameter shards
  • tensor-cache.json: tensor metadata cache
  • release-manifest.json: file inventory with SHA-256 hashes
  • libs/gemma-4-E2B-it-q4f16_1-MLC-webgpu.wasm: compiled WebGPU model library

Limitations

  • This package is experimental. Successful artifact conversion does not yet imply fully stable browser runtime behavior on every device.
  • This package is text-first. It does not expose Gemma 4 image or audio features in WebLLM.
  • The packaged context window is 4096, even though the base model advertises a much larger native context length.
  • This build currently depends on a Gemma 4 patchset on top of mlc-llm; upstream support is still being worked on.
  • Browser use still requires WebGPU and shader-f16 support.
  • Built-in WebLLM registration and mlc-ai namespace publication are intentionally deferred until runtime stability is demonstrated.

License

This repository follows the same license and usage terms as the base model, google/gemma-4-E2B-it.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cnhktyom/gemma-4-E2B-it-q4f16_1-MLC

Quantized
(130)
this model