davide221 commited on
Commit
9a280bb
·
verified ·
1 Parent(s): c4dff4c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +59 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: gguf
4
+ base_model: z-lab/gemma-4-31B-it-DFlash
5
+ tags:
6
+ - gemma
7
+ - gemma-4
8
+ - dflash
9
+ - speculative-decoding
10
+ - block-diffusion
11
+ - draft-model
12
+ - gguf
13
+ - q8_0
14
+ ---
15
+
16
+ # gemma-4-31B-it DFlash Draft — Q8_0 GGUF
17
+
18
+ Q8_0 GGUF quantization of the [z-lab/gemma-4-31B-it-DFlash](https://huggingface.co/z-lab/gemma-4-31B-it-DFlash) draft model, produced for the [Lucebox dflash engine](https://github.com/Luce-Org/lucebox-hub) (speculative decoding for `google/gemma-4-31B-it`).
19
+
20
+ - **Source**: [z-lab/gemma-4-31B-it-DFlash](https://huggingface.co/z-lab/gemma-4-31B-it-DFlash) (BF16 safetensors)
21
+ - **Tool**: `quantize_gemma_dflash_q8.py` (parameterized variant of `dflash/scripts/quantize_draft_q8.py`)
22
+ - **Size**: 1.63 GB (53% of BF16)
23
+ - **Arch**: `gemma4-dflash-draft`
24
+ - **Layers**: 5
25
+ - **Hidden**: 5376
26
+ - **n_head / n_head_kv**: 64 / 8
27
+ - **head_dim**: 128
28
+ - **intermediate_size**: 10752
29
+ - **vocab_size**: 262144
30
+ - **rope_theta**: 1e6, **rms_eps**: 1e-6
31
+ - **sliding_window**: 2048, **final_logit_softcapping**: 30.0
32
+ - **DFlash**: `n_target_layers=60`, `target_layer_ids=[1,12,23,35,46,57]`, `block_size=16`, `mask_token_id=4`
33
+ - **Tensors**: projection weights → Q8_0, norm weights → F32 (precision-critical, tiny)
34
+
35
+ ## Pairs with
36
+
37
+ - Target: [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) (run as Q4_K_M GGUF via `unsloth/gemma-4-31B-it-GGUF` or `bartowski/google_gemma-4-31B-it-GGUF`)
38
+
39
+ ## Notes
40
+
41
+ - The dflash-specific singletons `dflash.fc.weight` and `dflash.hidden_norm.weight` bridge target hidden states into the draft. Do **not** re-quantize with stock `llama-quantize` — it strips these tensors. Use the script above.
42
+ - Loader support for `gemma4-dflash-draft` in `lucebox-hub` is the next step after PR #232 (gemma4 target adapter).
43
+
44
+ ## Usage
45
+
46
+ ```bash
47
+ # With dflash_server (once gemma4-dflash-draft arch is wired in the loader)
48
+ dflash_server gemma-4-31B-it-Q4_K_M.gguf --draft gemma-4-31B-it-DFlash-q8_0.gguf
49
+ ```
50
+
51
+ ## Conversion command
52
+
53
+ ```bash
54
+ PYTHONPATH=lucebox-hub/dflash/deps/llama.cpp/gguf-py \
55
+ python3 quantize_gemma_dflash_q8.py \
56
+ gemma-4-31B-it-DFlash/ \
57
+ gemma-4-31B-it-DFlash-q8_0.gguf \
58
+ --name gemma-4-31B-it-DFlash-Q8_0
59
+ ```