gemma-4-31B-it DFlash Draft β€” Q8_0 GGUF

Q8_0 GGUF quantization of the z-lab/gemma-4-31B-it-DFlash draft model, produced for the Lucebox dflash engine (speculative decoding for google/gemma-4-31B-it).

  • Source: z-lab/gemma-4-31B-it-DFlash (BF16 safetensors)
  • Tool: quantize_gemma_dflash_q8.py (parameterized variant of dflash/scripts/quantize_draft_q8.py)
  • Size: 1.63 GB (53% of BF16)
  • Arch: gemma4-dflash-draft
  • Layers: 5
  • Hidden: 5376
  • n_head / n_head_kv: 64 / 8
  • head_dim: 128
  • intermediate_size: 10752
  • vocab_size: 262144
  • rope_theta: 1e6, rms_eps: 1e-6
  • sliding_window: 2048, final_logit_softcapping: 30.0
  • DFlash: n_target_layers=60, target_layer_ids=[1,12,23,35,46,57], block_size=16, mask_token_id=4
  • Tensors: projection weights β†’ Q8_0, norm weights β†’ F32 (precision-critical, tiny)

Pairs with

  • Target: google/gemma-4-31B-it (run as Q4_K_M GGUF via unsloth/gemma-4-31B-it-GGUF or bartowski/google_gemma-4-31B-it-GGUF)

Notes

  • The dflash-specific singletons dflash.fc.weight and dflash.hidden_norm.weight bridge target hidden states into the draft. Do not re-quantize with stock llama-quantize β€” it strips these tensors. Use the script above.
  • Loader support for gemma4-dflash-draft in lucebox-hub is the next step after PR #232 (gemma4 target adapter).

Usage

# With dflash_server (once gemma4-dflash-draft arch is wired in the loader)
dflash_server gemma-4-31B-it-Q4_K_M.gguf --draft gemma-4-31B-it-DFlash-q8_0.gguf

Conversion command

PYTHONPATH=lucebox-hub/dflash/deps/llama.cpp/gguf-py \
python3 quantize_gemma_dflash_q8.py \
  gemma-4-31B-it-DFlash/ \
  gemma-4-31B-it-DFlash-q8_0.gguf \
  --name gemma-4-31B-it-DFlash-Q8_0
Downloads last month
-
GGUF
Model size
2B params
Architecture
gemma4-dflash-draft
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Lucebox/gemma-4-31B-it-DFlash-GGUF

Quantized
(4)
this model