Upload MLX bf16 drafter from google/gemma-4-E4B-it-assistant

Browse files

Files changed (7) hide show

.gitattributes +5 -0
README.md +180 -0
config.json +88 -0
generation_config.json +17 -0
model.safetensors +3 -0
tokenizer.json +3 -0
tokenizer_config.json +30 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+ocdbt.process_0/d/34e15ebf6483f34716b5e7e56e8eb731 filter=lfs diff=lfs merge=lfs -text
+ocdbt.process_0/d/55650261a9ee8547ae9c29ce7e2e2f7e filter=lfs diff=lfs merge=lfs -text
+ocdbt.process_0/d/82cf1abb2269c124294bfd46f896cacf filter=lfs diff=lfs merge=lfs -text
+ocdbt.process_0/d/f03474ce9d620e5ab887fd1d181487a4 filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,180 @@

+---
+base_model: google/gemma-4-E4B-it-assistant
+library_name: mlx
+license: gemma
+pipeline_tag: text-generation
+tags:
+- mlx
+- speculative-decoding
+- mtp
+- gemma
+- drafter
+---
+# mlx-community/gemma-4-E4B-it-assistant-bf16
+This model was converted to MLX format from [`google/gemma-4-E4B-it-assistant`](https://huggingface.co/google/gemma-4-E4B-it-assistant) using mlx-vlm version **0.4.5**.
+Refer to the [original model card](https://huggingface.co/google/gemma-4-E4B-it-assistant) for more details on the model.
+## Use with mlx
+```bash
+pip install -U mlx-vlm
+```
+```bash
+python -m mlx_vlm.generate --model mlx-community/gemma-4-E4B-it-assistant-bf16 --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image <path_to_image>
+```
+---
+# Gemma 4 Assistant Drafter (MTP)
+MLX port of Google's Gemma 4 **Multi-Token Prediction (MTP)** drafter for
+speculative decoding. Reference:
+[ai.google.dev/gemma/docs/mtp](https://ai.google.dev/gemma/docs/mtp/mtp).
+## What it is
+A small, 4-layer "assistant" model trained to draft several candidate tokens
+per round; the full Gemma 4 target verifies them in a single forward pass.
+Accepted tokens advance, rejected ones (and everything after) are discarded.
+Quality matches the target at temperature 0 (byte-identical greedy output).
+The drafter is tightly coupled to the target's internals:
+- **KV-cache sharing** — every drafter layer is `is_kv_shared_layer=True` and
+  reads K/V from the target's last full-attention and last sliding-attention
+  layers. The drafter has **no KV cache of its own**; its only recurrent
+  state is the target's last hidden, projected through `post_projection`.
+- **Cross-attention from constant position** — the drafter's queries are
+  RoPE-rotated at the bonus token's absolute position and held constant
+  across all draft steps within a block.
+- **Hidden+token concatenation** — drafter input each step is
+  `concat([target_embed(last_token), last_hidden_state], dim=-1)` of shape
+  `[B, 1, 2 * backbone_hidden_size]`, projected to drafter hidden size by
+  `pre_projection`.
+## Supported pairings
+| Target                                | Drafter                                              | LM head                |
+| ------------------------------------- | ---------------------------------------------------- | ---------------------- |
+| `mlx-community/gemma-4-E2B-it-bf16`               | `mlx-community/gemma-4-E2B-it-assistant-bf16`                  | centroid (sparse)      |
+| `mlx-community/gemma-4-E4B-it-bf16`               | `mlx-community/gemma-4-E4B-it-assistant-bf16`                  | centroid (sparse)      |
+| `mlx-community/gemma-4-26B-A4B-it-bf16`           | `mlx-community/gemma-4-26B-A4B-it-assistant-bf16`              | tied dense             |
+| `mlx-community/gemma-4-31B-it-bf16`               | `mlx-community/gemma-4-31B-it-assistant-bf16  `                  | tied dense             |
+For E2B / E4B drafters, `use_ordered_embeddings=True` and the LM head is a
+**centroid-routed sparse softmax** (`MaskedEmbedder`): the drafter scores
+2048 token clusters, materialises the top-K (default 32) clusters' tokens
+(~4096 of 262144), and scatters those logits back into a full-vocab tensor —
+non-selected positions filled with `min(selected) - 1` so they lose any
+argmax / sampling competition.
+## Files
+- `config.py` — `Gemma4AssistantConfig` (HF-compatible, flattened).
+- `gemma4_assistant.py` — `Gemma4AssistantDraftModel` (forward, `bind`,
+  `set_shared_kv`, `draft_block`, `sanitize`).
+- `masked_embedder.py` — centroid-routed sparse LM head for E2B / E4B.
+- `masks.py` — bidirectional full / SWA masks for the drafter forward.
+- `parity_check.py` — fake-target smoke test.
+## Usage
+The drafter is auto-discovered by HF `model_type == "gemma4_assistant"`;
+just pass `--draft-model` and `--draft-kind mtp` to `mlx_vlm.generate`:
+```bash
+uv run python -m mlx_vlm.generate \
+    --model mlx-community/gemma-4-31B-it-bf16 \
+    --draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
+    --draft-kind mtp \
+    --draft-block-size 4 \
+    --prompt "Explain speculative decoding in 3 sentences." \
+    --max-tokens 256 --temp 0
+```
+`--draft-block-size` is the number of speculatively drafted tokens per
+round (google calls this `num_assistant_tokens`). The first token of the
+block is the most recently accepted bonus, so the drafter actually
+generates `block_size - 1` candidates each round.
+Programmatic use:
+```python
+from mlx_vlm.utils import load
+from mlx_vlm.speculative.drafters import load_drafter
+from mlx_vlm.generate import generate_step
+model, processor = load("mlx-community/gemma-4-31B-it-bf16")
+drafter = load_drafter("mlx-community/gemma-4-31B-it-assistant-bf16", kind="mtp")
+for tok, _ in generate_step(
+    input_ids, model, None, None,
+    max_tokens=256,
+    draft_model=drafter,
+    draft_kind="mtp",
+    draft_block_size=4,
+):
+    ...
+```
+## Performance
+Measured on Apple Silicon (M3 Max, 96GB RAM), 17-token prompt, max 64–96 tokens, greedy
+(`temp=0`), output byte-identical to the no-drafter baseline.
+Best `block_size` per (target, batch):
+| Target  | B  | best bs | tot tok/s | speedup vs no-drafter |
+|---------|----|---------|-----------|-----------------------|
+| 26B-A4B | 4  | 3       | 85.5      | **3.94×**             |
+| 26B-A4B | 8  | 3       | 165.1     | **1.55×**             |
+| 31B     | 4  | 3       | 17.1      | **2.29×**             |
+| 31B     | 8  | 2       | 21.4      | **1.41×**             |
+| E4B     | 4  | 4       | 62.1      | **1.56×**             |
+| E4B     | 8  | 2       | 115.9     | 1.07×                 |
+| E4B     | 16 | —       | —         | drafter slower (≤1.0×)|
+The drafter is most attractive on large/slow targets (26B-A4B, 31B) where
+target forward time dominates. On the small E4B target, target forward is
+already cheap and at high batch sizes the drafter's per-step overhead
+exceeds the speedup it buys.
+Reproduce with `scripts/mtp_batch_sweep.py`:
+```bash
+uv run python scripts/mtp_batch_sweep.py \
+    --model mlx-community/gemma-4-26B-A4B-it-bf16 \
+    --drafter mlx-community/gemma-4-26B-A4B-it-assistant-bf16 \
+    --batch-sizes 4 8 --block-sizes 2 3 --max-tokens 64
+```
+## Smoke test
+```bash
+uv run python -m mlx_vlm.speculative.drafters.gemma4_assistant.parity_check \
+    --drafter mlx-community/gemma-4-E4B-it-assistant-bf16
+```
+Expects `forward OK: logits shape=(1, 1, 262144) ...` and
+`draft_block OK: tokens shape=(1, 3) ...`. For drafters with the centroid
+LM head the parity check exercises `MaskedEmbedder` end-to-end (most
+positions land at the sparse `min - 1` floor).
+## Caveats
+- **Sampling.** Greedy (temp 0) is verified byte-identical. Stochastic
+  sampling works but acceptance rates drop because drafter and target
+  draws diverge.
+- **Multimodal prompts.** Image / audio prefill runs through the target
+  unchanged; speculative decoding only kicks in on the text-decode tail,
+  so multimodal works but the drafter only ever sees text tokens.
+- **Sliding-window masks.** The bidirectional SWA mask in `masks.py`
+  short-circuits to `None` when `kv_len <= sliding_window`, which is the
+  only regime `RotatingKVCache` ever produces. Long-prompt mask paths are
+  effectively dead code today.
+- **Batched generation.** Continuous-batching support is in
+  `_mtp_rounds_batch` (`mlx_vlm/generate.py`). For targets whose KV caches
+  don't implement `.filter()`, finished rows are kept in the batch and
+  simply stop emitting; throughput doesn't shrink with retired rows.

config.json ADDED Viewed

	@@ -0,0 +1,88 @@

+{
+  "architectures": [
+    "Gemma4AssistantForCausalLM"
+  ],
+  "audio_token_id": 258881,
+  "backbone_hidden_size": 2560,
+  "boa_token_id": 256000,
+  "boi_token_id": 255999,
+  "centroid_intermediate_top_k": 32,
+  "dtype": "bfloat16",
+  "eoa_token_id": 258883,
+  "eoi_token_id": 258882,
+  "image_token_id": 258880,
+  "model_type": "gemma4_assistant",
+  "num_centroids": 2048,
+  "text_config": {
+    "_name_or_path": "",
+    "architectures": null,
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "attention_k_eq_v": false,
+    "bos_token_id": 2,
+    "chunk_size_feed_forward": 0,
+    "dtype": "bfloat16",
+    "enable_moe_block": false,
+    "eos_token_id": 1,
+    "final_logit_softcapping": null,
+    "global_head_dim": 512,
+    "head_dim": 256,
+    "hidden_activation": "gelu_pytorch_tanh",
+    "hidden_size": 256,
+    "hidden_size_per_layer_input": 0,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_range": 0.02,
+    "intermediate_size": 2048,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_types": [
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention"
+    ],
+    "max_position_embeddings": 131072,
+    "model_type": "gemma4_text",
+    "moe_intermediate_size": null,
+    "num_attention_heads": 4,
+    "num_experts": null,
+    "num_global_key_value_heads": null,
+    "num_hidden_layers": 4,
+    "num_key_value_heads": 2,
+    "num_kv_shared_layers": 4,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "pad_token_id": 0,
+    "problem_type": null,
+    "return_dict": true,
+    "rms_norm_eps": 1e-06,
+    "rope_parameters": {
+      "full_attention": {
+        "partial_rotary_factor": 0.25,
+        "rope_theta": 1000000.0,
+        "rope_type": "proportional"
+      },
+      "sliding_attention": {
+        "rope_theta": 10000.0,
+        "rope_type": "default"
+      }
+    },
+    "sliding_window": 512,
+    "tie_word_embeddings": true,
+    "top_k_experts": null,
+    "use_bidirectional_attention": null,
+    "use_cache": true,
+    "use_double_wide_mlp": false,
+    "vocab_size": 262144,
+    "vocab_size_per_layer_input": 0
+  },
+  "tie_word_embeddings": true,
+  "transformers_version": "5.7.0.dev0",
+  "use_ordered_embeddings": true
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "bos_token_id": 2,
+  "do_sample": true,
+  "eos_token_id": [
+    1,
+    106,
+    50
+  ],
+  "is_assistant": true,
+  "num_assistant_tokens": 6,
+  "num_assistant_tokens_schedule": "constant",
+  "pad_token_id": 0,
+  "temperature": 1.0,
+  "top_k": 64,
+  "top_p": 0.95,
+  "transformers_version": "5.7.0.dev0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:12875062fc25c51e8fa9b62abd2de7ad48b7d63f8559d5d604fbd5a3d6bcff16
+size 159138208

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:75a6583c1a418e2bbd79c60d95d28e0f5bf549ad3f2990b5bdb5238c6c2bf70c
+size 32169440

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "audio_token": "<|audio|>",
+  "backend": "tokenizers",
+  "boa_token": "<|audio>",
+  "boi_token": "<|image>",
+  "bos_token": "<bos>",
+  "eoa_token": "<audio|>",
+  "eoc_token": "<channel|>",
+  "eoi_token": "<image|>",
+  "eos_token": "<eos>",
+  "eot_token": "<turn|>",
+  "escape_token": "<|\"|>",
+  "etc_token": "<tool_call|>",
+  "etd_token": "<tool|>",
+  "etr_token": "<tool_response|>",
+  "extra_special_tokens": [],
+  "image_token": "<|image|>",
+  "mask_token": "<mask>",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<pad>",
+  "padding_side": "left",
+  "soc_token": "<|channel>",
+  "sot_token": "<|turn>",
+  "stc_token": "<|tool_call>",
+  "std_token": "<|tool>",
+  "str_token": "<|tool_response>",
+  "think_token": "<|think|>",
+  "tokenizer_class": "GemmaTokenizer",
+  "unk_token": "<unk>"
+}