Gemma 4 31B DECKARD HERETIC Uncensored — NVFP4 SVDQuant

SVDQuant-quantized version of DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking. Quantized using NVIDIA ModelOpt 0.42.0 with SVDQuant (SVD decomposition + NVFP4) for maximum quality at 4-bit precision. Calibrated natively on NVIDIA B200 (Blackwell SM 12.0) for hardware-accurate FP4 scale factors.

See also: AWQ_FULL variant — same model quantized with AWQ_FULL instead of SVDQuant.

What is SVDQuant?

SVDQuant uses Singular Value Decomposition to separate weight matrices into two components before quantization:

  1. Outlier channels — high-magnitude weight channels that cause large quantization error are extracted into a low-rank BF16 residual matrix
  2. Cleaned weights — the remaining weights (with outliers removed) are quantized to NVFP4 (E2M1) with dramatically reduced quantization error

This produces higher quality than standard AWQ at the cost of a slightly larger model size (~20.9 GB vs ~20.5 GB) due to the low-rank residual matrices stored in BF16.

Original Weight Matrix W (BF16)
    |
    v
[SVD Decomposition]
    |
    ├── Low-rank residual R (BF16, rank=32) — captures outlier channels
    └── Cleaned weights W' = W - R
            |
            v
        [NVFP4 Quantization] — much lower error without outliers
            |
            v
        W'_quant (FP4 E2M1)

Inference: output = W'_quant @ x + R @ x

Model Details

Property Value
Base Model Gemma 4 31B-it DECKARD HERETIC
Architecture Gemma 4 (Dense, 31B parameters)
Layers 60
Max Context 131,072 tokens
Hidden Size 5376
Intermediate Size 21,504
Attention Heads 32 (16 KV heads)
Vocabulary 262,144 tokens
Quantization NVFP4 SVDQuant (ModelOpt format)
SVD Low-Rank 32
Model Size ~20.9 GB
Calibration Hardware NVIDIA B200 (native Blackwell FP4)

Quantization Details

Gemma 4 31B DECKARD HERETIC (BF16, ~62 GB)
    |
    v
[NVFP4 SVDQuant on B200]
    - ModelOpt 0.42.0 with NVFP4_SVDQUANT_DEFAULT_CFG
    - Low-rank = 32 (BF16 residual matrices)
    - 2048 calibration samples (CNN DailyMail)
    - Native Blackwell FP4 hardware calibration (SM 12.0)
    - Excluded: vision tower, embed_vision, multi_modal_projector
    - Quantization time: ~69 minutes on B200
    |
    v
Gemma-4-31B-DECKARD-HERETIC-NVFP4-SVDQuant (~20.9 GB)

AWQ_FULL vs SVDQuant Comparison

Metric AWQ_FULL SVDQuant
Technique Channel scaling + clipping optimization SVD decomposition + low-rank residual
Model Size ~20.5 GB ~20.9 GB
Quant Time ~75 min ~69 min
Quality Excellent Potentially higher (preserves outliers in BF16)
Speed Slightly faster (smaller) Slightly slower (low-rank matmul overhead)
Best For Maximum throughput Maximum quality

Deployment

vLLM

vllm serve /path/to/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant \
  --served-model-name deckard-svdquant \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 65536 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

Docker Compose

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    ports:
      - "8000:8000"
    volumes:
      - ./models:/models
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - VLLM_ATTENTION_BACKEND=FLASHINFER
    command: >
      --model /models/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant
      --served-model-name deckard-svdquant
      --quantization modelopt
      --dtype auto
      --kv-cache-dtype fp8
      --max-model-len 65536
      --max-num-seqs 8
      --gpu-memory-utilization 0.85
      --trust-remote-code
      --enable-chunked-prefill
      --enable-prefix-caching
      --enable-auto-tool-choice
      --tool-call-parser gemma4
      --reasoning-parser gemma4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

DGX Spark Performance Estimates

Configuration Estimated tok/s
BF16 (no quantization) ~3-5
NVFP4 AWQ_FULL ~12-14
NVFP4 SVDQuant ~10-13

Key Deployment Flags

Flag Purpose
--quantization modelopt Required — tells vLLM to use ModelOpt NVFP4 format
--kv-cache-dtype fp8 Reduces KV cache memory by 2x for longer contexts
--reasoning-parser gemma4 Extracts <think> blocks for thinking/reasoning display
--tool-call-parser gemma4 Enables native function calling
--enable-chunked-prefill Processes long prompts in chunks to avoid OOM
--enable-prefix-caching Caches common prompt prefixes for faster responses

Related Models

Advanced Techniques

Native B200 Calibration

Quantized on NVIDIA B200 with native FP4 hardware instructions (SM 12.0). The SVDQuant calibration measures actual FP4 rounding behavior on real hardware rather than simulating it, producing more accurate scale factors and SVD decomposition decisions than calibrating on non-FP4 hardware.

SVD Low-Rank Selection

The default low-rank of 32 was used, which provides an excellent balance between quality preservation and model size overhead. Each weight matrix has a 32-column BF16 residual that captures the most important outlier channels, keeping them at full precision while the remaining weights are safely quantized to FP4.

License

This model inherits the Gemma license from the base model.

Legal Disclaimer

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. The authors make no representations regarding accuracy, reliability, or fitness for any purpose. Use at your own risk. By downloading or using this model, you agree that the authors shall not be liable for any claims, damages, or losses arising from its use.

Downloads last month
319
Safetensors
Model size
19B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant