Gemma 4 31B DECKARD HERETIC Uncensored — NVFP4 SVDQuant

SVDQuant-quantized version of DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking. Quantized using NVIDIA ModelOpt 0.42.0 with SVDQuant (SVD decomposition + NVFP4) for maximum quality at 4-bit precision. Calibrated natively on NVIDIA B200 (Blackwell SM 12.0) for hardware-accurate FP4 scale factors.

See also: AWQ_FULL variant — same model quantized with AWQ_FULL instead of SVDQuant.

What is SVDQuant?

SVDQuant uses Singular Value Decomposition to separate weight matrices into two components before quantization:

Outlier channels — high-magnitude weight channels that cause large quantization error are extracted into a low-rank BF16 residual matrix
Cleaned weights — the remaining weights (with outliers removed) are quantized to NVFP4 (E2M1) with dramatically reduced quantization error

This produces higher quality than standard AWQ at the cost of a slightly larger model size (~20.9 GB vs ~20.5 GB) due to the low-rank residual matrices stored in BF16.

Original Weight Matrix W (BF16)
    |
    v
[SVD Decomposition]
    |
    ├── Low-rank residual R (BF16, rank=32) — captures outlier channels
    └── Cleaned weights W' = W - R
            |
            v
        [NVFP4 Quantization] — much lower error without outliers
            |
            v
        W'_quant (FP4 E2M1)

Inference: output = W'_quant @ x + R @ x

Model Details

Property	Value
Base Model	Gemma 4 31B-it DECKARD HERETIC
Architecture	Gemma 4 (Dense, 31B parameters)
Layers	60
Max Context	131,072 tokens
Hidden Size	5376
Intermediate Size	21,504
Attention Heads	32 (16 KV heads)
Vocabulary	262,144 tokens
Quantization	NVFP4 SVDQuant (ModelOpt format)
SVD Low-Rank	32
Model Size	~20.9 GB
Calibration Hardware	NVIDIA B200 (native Blackwell FP4)

Quantization Details

Gemma 4 31B DECKARD HERETIC (BF16, ~62 GB)
    |
    v
[NVFP4 SVDQuant on B200]
    - ModelOpt 0.42.0 with NVFP4_SVDQUANT_DEFAULT_CFG
    - Low-rank = 32 (BF16 residual matrices)
    - 2048 calibration samples (CNN DailyMail)
    - Native Blackwell FP4 hardware calibration (SM 12.0)
    - Excluded: vision tower, embed_vision, multi_modal_projector
    - Quantization time: ~69 minutes on B200
    |
    v
Gemma-4-31B-DECKARD-HERETIC-NVFP4-SVDQuant (~20.9 GB)

AWQ_FULL vs SVDQuant Comparison

Metric	AWQ_FULL	SVDQuant
Technique	Channel scaling + clipping optimization	SVD decomposition + low-rank residual
Model Size	~20.5 GB	~20.9 GB
Quant Time	~75 min	~69 min
Quality	Excellent	Potentially higher (preserves outliers in BF16)
Speed	Slightly faster (smaller)	Slightly slower (low-rank matmul overhead)
Best For	Maximum throughput	Maximum quality

Deployment

vLLM

vllm serve /path/to/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant \
  --served-model-name deckard-svdquant \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 65536 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

Docker Compose

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    ports:
      - "8000:8000"
    volumes:
      - ./models:/models
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - VLLM_ATTENTION_BACKEND=FLASHINFER
    command: >
      --model /models/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant
      --served-model-name deckard-svdquant
      --quantization modelopt
      --dtype auto
      --kv-cache-dtype fp8
      --max-model-len 65536
      --max-num-seqs 8
      --gpu-memory-utilization 0.85
      --trust-remote-code
      --enable-chunked-prefill
      --enable-prefix-caching
      --enable-auto-tool-choice
      --tool-call-parser gemma4
      --reasoning-parser gemma4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

DGX Spark Performance Estimates

Configuration	Estimated tok/s
BF16 (no quantization)	~3-5
NVFP4 AWQ_FULL	~12-14
NVFP4 SVDQuant	~10-13

Key Deployment Flags

Flag	Purpose
`--quantization modelopt`	Required — tells vLLM to use ModelOpt NVFP4 format
`--kv-cache-dtype fp8`	Reduces KV cache memory by 2x for longer contexts
`--reasoning-parser gemma4`	Extracts `<think>` blocks for thinking/reasoning display
`--tool-call-parser gemma4`	Enables native function calling
`--enable-chunked-prefill`	Processes long prompts in chunks to avoid OOM
`--enable-prefix-caching`	Caches common prompt prefixes for faster responses

Related Models

GitHub repo: AEON-7/Gemma-4-31B-DECKARD-HERETIC-Uncensored-NVFP4 — deployment docs, Docker Compose, cross-model comparison
AWQ_FULL variant: AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4 — same base model, AWQ_FULL quantization
Gemma 4 MoE NVFP4: AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 — MoE variant, faster throughput
Base model: DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking

Advanced Techniques

Native B200 Calibration

Quantized on NVIDIA B200 with native FP4 hardware instructions (SM 12.0). The SVDQuant calibration measures actual FP4 rounding behavior on real hardware rather than simulating it, producing more accurate scale factors and SVD decomposition decisions than calibrating on non-FP4 hardware.

SVD Low-Rank Selection

The default low-rank of 32 was used, which provides an excellent balance between quality preservation and model size overhead. Each weight matrix has a 32-column BF16 residual that captures the most important outlier channels, keeping them at full precision while the remaining weights are safely quantized to FP4.

License

This model inherits the Gemma license from the base model.

Legal Disclaimer

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. The authors make no representations regarding accuracy, reliability, or fitness for any purpose. Use at your own risk. By downloading or using this model, you agree that the authors shall not be liable for any claims, damages, or losses arising from its use.

Downloads last month: 319

Safetensors

Model size

19B params

Tensor type

BF16

F8_E4M3

Model tree for AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant

Base model

google/gemma-4-31B-it

Finetuned

DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking

Quantized

(9)

this model