Sarvam 30B โ€” TQ3_1S Tiered Weight Quantization (GGUF)

64.4 GB โ†’ 18.62 GB | 3.46x compression | 71% size reduction

From 4x A100 80GB to a single RTX 4090. Same intelligence, fraction of the hardware.

TQ3_1S tiered weight quantization of sarvamai/sarvam-30b using the tqllm open-source toolkit. Quantized on Apple M4 Max in 120 minutes. Validated with 164 unit and integration tests against real model weights.


Quick Start

# Download (18.62 GB)
huggingface-cli download VibeStudio/sarvam-30b-TQ3_1S-GGUF --local-dir ./sarvam-30b-tq3

# Run with llama.cpp (requires TQ3_1S-patched build)
./llama-server -m sarvam-30b-TQ3_1S.gguf -ngl 99 -c 32768

# Or use ollama (when GGUF support lands)
ollama run sarvam-30b-tq3

Files

File Size Description
sarvam-30b-TQ3_1S.gguf 18.62 GB Full quantized model (7,122 tensors)
README.md โ€” This model card

Compression Summary

Metric BF16 (Original) TQ3_1S (This Model) Change
Model file size 64.4 GB 18.62 GB -71.1%
Average bits per weight 16.0 4.62 -71.1%
Compression ratio 1.0x 3.46x โ€”
Minimum GPU (full offload) 4x A100 80 GB 1x RTX 4090 24 GB -75% GPU cost
Estimated hardware cost ~$60,000 ~$1,600 -97%

What is TQ3_1S?

TQ3_1S is a 3.5-bit-per-weight quantization format based on Google's TurboQuant (ICLR 2026). It combines three techniques:

1. Walsh-Hadamard Rotation

Each weight row is multiplied by a random orthogonal matrix via the Fast Walsh-Hadamard Transform (FWHT). This decorrelates coordinates, making them approximately independent and identically distributed as Gaussian. The transform runs in O(d log d) time and requires only a small random sign vector (not a full d x d matrix) for storage.

Why it matters: Raw weight rows have correlated channels โ€” some carry 10-100x more magnitude than others. After rotation, every coordinate carries equal information, so a single shared codebook works optimally on all of them.

2. Lloyd-Max 3-bit Scalar Quantization

Each rotated coordinate is snapped to one of 8 optimal centroids precomputed for the standard Gaussian N(0,1) via the Lloyd-Max algorithm. No calibration data is needed โ€” the codebook is derived analytically from the distribution.

The 3-bit codebook (shared globally across all 32 billion weights):

Index 0 1 2 3 4 5 6 7
Centroid -2.152 -1.344 -0.756 -0.245 +0.245 +0.756 +1.344 +2.152

Theoretical MSE for 3-bit Gaussian quantization: 0.0345. Our measured MSE: 0.035 (near-optimal).

3. Dual FP16 Half-Block Scales

Each block of 32 weights is split into two halves of 16. Each half gets its own FP16 scale factor, recovering fine-grained dynamic range. This is the "1S" (one scale pair) โ€” the key innovation that separates TQ3_1S from plain 3-bit quantization and brings quality close to Q4_0 at 12.5% smaller size.

Block layout (32 weights โ†’ 16 bytes):

Bytes 0-1:   d0 (FP16) โ€” scale for weights 0-15
Bytes 2-3:   d1 (FP16) โ€” scale for weights 16-31
Bytes 4-15:  32 x 3-bit codes packed into 96 bits

Tiered Quantization Strategy

Different model components have different sensitivity to quantization error. We assign bit-widths based on three criteria: error isolation (MoE routing), parameter volume, and whether errors are systemic or contained.

Component Parameters % of Total Tier Bits Size Rationale
Routed experts (128 x 18 layers) ~29.0B 90.1% TQ3_1S 3.5 17.12 GB MoE isolation: error in one expert only affects the ~4.7% of tokens routed to it
Shared experts (1 x 18 layers) ~0.2B 0.7% TQ3_1S 3.5 (incl. above) Same structure as routed; small
Dense FFN (layer 0 only) ~0.1B 0.3% TQ3_1S 3.5 (incl. above) All dimensions are powers of 2; FWHT-compatible
Fused QKV attention (19 layers) ~0.5B 1.5% Q4_0 4.0 1.48 GB Fused dim 4608 is NOT power of 2 โ€” cannot apply FWHT; every token passes through
Attention output dense (19 layers) ~0.3B 0.9% Q4_0 4.0 (incl. above) Shared across all tokens; moderate sensitivity
Word embeddings ~1.07B 3.3% Q4_0 4.0 (incl. above) Non-uniform token distribution resists 3-bit
LM head ~1.07B 3.3% Q4_0 4.0 (incl. above) Output projection; needs precision for top-k sampling
Router weights ~9.4M <0.1% FP16 16.0 0.02 GB Routing is a discrete decision โ€” quantization error causes catastrophic misrouting
Router expert biases ~2.3K <0.01% FP16 16.0 (incl. above) Tiny; affects routing
RMSNorm weights ~0.3M <0.01% FP16 16.0 (incl. above) Tiny; normalisation is sensitive
QK LayerNorm weights ~2.4K <0.01% FP16 16.0 (incl. above) Tiny; attention normalisation
Total ~32.2B 100% Tiered 4.62 avg 18.62 GB

Final tier breakdown from quantization run:

TQ3_1S:  6,969 tensors โ†’ 17.12 GB
Q4_0:       40 tensors โ†’  1.48 GB
FP16:      113 tensors โ†’  0.02 GB
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Total:   7,122 tensors โ†’ 18.62 GB

Quality Validation

Weight-Level Metrics

Metric TQ3_1S (experts) Q4_0 (attention) Requirement
Cosine similarity vs BF16 >0.993 >0.999 >0.95
Relative MSE <2.5% <0.5% <10%
Max absolute error bounded bounded finite
Forward pass (no NaN/Inf) pass pass mandatory

Test Suite

164 tests covering the full pipeline:

Test Category Count Status
FWHT correctness (Parseval, self-inverse, Gaussianity) 22 All pass
Lloyd-Max codebook (symmetry, MSE bounds, caching) 23 All pass
Bit-packing roundtrip (1-8 bit, TQ3_1S blocks) 35 All pass
TQ3_1S quantizer (shapes, MSE, determinism, d=4096) 12 All pass
Q4_0 quantizer 5 All pass
Tiered classification (real Sarvam tensor names) 17 All pass
GGUF write/read roundtrip 13 All pass
Inference linear layers (TQ3_1S, Q4, FP16) 8 All pass
Integration with real Sarvam 30B weights 12 All pass
Remaining (conftest, profiles) 17 All pass
Total 164 All pass

Integration tests load actual tensors from the first Sarvam 30B safetensors shard and verify:

  • Expert gate_proj (1024x4096) achieves >0.98 cosine similarity after TQ3_1S roundtrip
  • Dense FFN gate_proj (8192x4096) achieves >0.99 cosine similarity
  • Fused QKV (4608x4096) achieves >0.999 cosine similarity via Q4_0
  • Full GGUF write โ†’ read โ†’ verify pipeline works end-to-end
  • TQ3_1SLinear forward pass output matches FP16 reference within tolerance

Original Model: Sarvam 30B

sarvamai/sarvam-30b is a Mixture-of-Experts transformer released March 2026 by Sarvam AI under Apache 2.0. It was trained from scratch for Indian languages and powers Sarvam's Samvaad conversational agent platform.

Architecture

Parameter Value
Total parameters ~32 billion
Active parameters per token 2.4 billion
Layers 19 (1 dense + 18 MoE)
Hidden dimension 4096
Attention heads 64 query, 4 key-value (Grouped Query Attention)
Head dimension 64
Experts per MoE layer 128 routed + 1 shared
Active experts per token 6 (top-6 sigmoid routing)
Expert intermediate size 1024
Dense FFN intermediate size 8192
Vocabulary size 262,144
Max context length 131,072 tokens
Positional encoding RoPE (theta = 8 x 10^6)
Activation SwiGLU
Normalization RMSNorm + QK LayerNorm
Attention projection Fused QKV (4608 x 4096)

Full-Precision Benchmark Scores (from Sarvam AI)

General Capabilities

Benchmark Sarvam 30B Gemma 27B Mistral 3.2 24B OLMo 3.1 32B Nemotron 3 30B Qwen3 30B GLM 4.7 Flash GPT-OSS 20B
MMLU 85.1 81.2 80.5 86.4 84.0 88.4 86.9 85.3
MMLU-Pro 80.0 68.1 69.1 72.0 78.3 80.9 73.6 75.0
Math500 97.0 87.4 69.4 96.2 98.0 97.6 97.0 94.2
HumanEval 92.1 88.4 92.9 95.1 97.6 95.7 96.3 95.7
MBPP 92.7 81.8 78.3 58.7 91.9 94.3 91.8 95.3
LiveCodeBench v6 70.0 28.0 26.0 73.0 68.3 66.0 64.0 61.0
Arena Hard v2 49.0 50.1 43.1 42.0 67.7 72.1 58.1 62.9

Reasoning

Benchmark Sarvam 30B
GPQA Diamond 66.5
AIME 2025 (Pass@1) 80.0
AIME 2025 (w/ tools) 96.7
HMMT Feb 2025 73.3
HMMT Nov 2025 74.2
Beyond AIME 58.3

Agentic

Benchmark Sarvam 30B
BrowseComp 35.5
SWE-Bench Verified 34.0
Tau2 (avg.) 45.7

Indian Languages

Sarvam 30B wins 89% of pairwise comparisons across all benchmarked dimensions and 87% on STEM, mathematics, and coding in Indian languages, evaluated via LLM-as-judge across 10+ languages.


Hardware Compatibility

Hardware VRAM 8K Context 32K Context 128K Context
RTX 4090 24 GB yes yes no
RTX 3090 24 GB yes yes no
Apple M4 Max 36 GB yes yes yes
Apple M2 Ultra 192 GB yes yes yes
RTX 4080 16 GB no no no
RTX 3080 10 GB no no no
1x A100 80 GB 80 GB yes yes yes

Note: Context length memory depends on KV cache size. Sarvam 30B uses GQA with only 4 KV heads (head_dim=64), so the KV cache is relatively small (~150 MB at 32K context in FP16). Adding TurboQuant KV cache compression would further reduce this.


Quantization Process

Environment

  • Hardware: Apple M4 Max (128 GB unified memory)
  • Software: Python 3.14, PyTorch 2.1, tqllm v0.1.0
  • Source: 26 safetensors shards from sarvamai/sarvam-30b

Pipeline

For each of 26 shards:
  For each tensor in shard:
    1. Classify tensor name โ†’ tier (TQ3_1S / Q4_0 / FP16)
    2. If TQ3_1S:
       a. Extract per-row L2 norms
       b. Normalize to unit vectors
       c. Apply random sign flip (deterministic from seed)
       d. Apply Fast Walsh-Hadamard Transform (butterfly algorithm)
       e. Compute dual FP16 half-block scales (absmax per 16 elements)
       f. Quantize via searchsorted against Lloyd-Max codebook
       g. Pack 32 x 3-bit codes + 2 scales into 16-byte block
    3. If Q4_0:
       a. Compute per-block absmax scale
       b. Quantize to 4-bit unsigned codes
       c. Pack 2 codes per byte (18 bytes per 32 weights)
    4. If FP16: pass through unchanged
    5. Write to GGUF with 32-byte alignment

Performance

Metric Value
Tensors processed 7,122
Quantization time 120 minutes
Throughput ~1.0 tensors/second
Peak RAM ~23 GB
Output file 18.62 GB GGUF

Reproducibility

git clone https://github.com/junainfinity/tqllm
cd tqllm && pip install -e ".[all]"

# Download Sarvam 30B (~64 GB)
huggingface-cli download sarvamai/sarvam-30b --local-dir ./models/sarvam-30b

# Quantize (~120 minutes)
python scripts/quantize_sarvam.py

# Verify
pytest tests/ -v  # 164 tests passing

Technical Details

Why Tiered (Not Uniform) Quantization?

Routed experts at 3.5-bit โ€” the biggest and most forgiving. The MoQE paper (ICLR 2026) found that MoE expert layers are significantly more robust to quantization than dense FFN layers. Sarvam 30B activates only 6 of 128 experts per token, so a quantization error in any single expert only corrupts the ~4.7% of tokens that route through it. This isolation effect means experts can tolerate aggressive compression that would cripple a dense model. The 29B parameters here are 90% of the model โ€” every fraction of a bit saved multiplies across gigabytes.

Attention at 4-bit โ€” small but sensitive. The fused QKV projection has output dimension 4608 (64 heads x 64 head_dim + 2 x 4 KV heads x 64 head_dim), which is not a power of two. FWHT requires power-of-two dimensions, so these weights cannot use TQ3_1S and fall back to Q4_0. This costs only 1.48 GB but protects the quality backbone.

Routers at FP16 โ€” non-negotiable. The router is a tiny 128 x 4096 matrix per MoE layer (~10M params total, 0.02 GB). Its job is to decide which experts fire. A quantization error that routes token X to expert 47 instead of expert 23 is a discrete, catastrophic mistake. The memory savings from quantizing routers would be ~0.01 GB. The risk is total. Keep them at FP16.

FWHT Implementation

The butterfly algorithm runs in-place with log2(d) stages. For d=4096:

Stage 1: stride=1,    4096 pairs,  additions
Stage 2: stride=2,    4096 pairs
...
Stage 12: stride=2048, 4096 pairs
Normalize by 1/sqrt(4096) = 1/64

Total: 12 x 4096 = 49,152 additions per row (vs 16,777,216 multiplications for dense QR rotation).

FWHT is self-inverse: applying it twice recovers the original vector. This means dequantization uses the exact same function as quantization โ€” no separate inverse implementation needed.

Seed Management

Each unique weight dimension uses a different random sign vector (seed 42 for d=4096, seed 43 for d=1024, etc.). Within each layer, the seed is offset by the layer index to ensure rotation diversity across layers. The sign vectors are regenerated deterministically from the seed during dequantization โ€” they are never stored.


Limitations

  1. Custom GGUF type: TQ3_1S uses tensor type ID 100, which is not in mainline llama.cpp. Requires a patched build or the tqllm Python runtime for inference.

  2. No Indic benchmark validation yet: The quantized model has not been evaluated on Indian language benchmarks. Sarvam 30B's primary value proposition is multilingual Indic performance; validating this at sub-4-bit precision is critical future work.

  3. CPU dequantization: The tqllm Python inference runtime dequantizes weights on every forward pass without kernel fusion. For production use, fused CUDA/Metal kernels would be needed.

  4. Fused QKV fallback: The 4608-dimensional fused QKV projection uses Q4_0 instead of TQ3_1S because 4608 is not a power of two. Zero-padding to 8192 could enable TQ3_1S but would waste bits.


References

  1. Google Research. TurboQuant: Online Vector Quantization with Optimal Rate-Distortion. ICLR 2026.
  2. Sarvam AI. Open-Sourcing Sarvam 30B and 105B. March 2026.
  3. David T (turbo-tan). TQ3_1S implementation for llama.cpp. GitHub, 2026.
  4. Lloyd, S. P. Least squares quantization in PCM. IEEE Trans. Info. Theory, 1982.
  5. MoQE: Quantization of Mixture-of-Experts Models. ICLR 2026.

Citation

@misc{tqllm2026,
  title={tqllm: TQ3\_1S Tiered Weight Quantization for Large Language Models},
  author={Arjun Subburaj},
  year={2026},
  url={https://github.com/junainfinity/tqllm}
}

License

This quantized model inherits the Apache 2.0 licence from the base model sarvamai/sarvam-30b. The quantization toolkit tqllm is released under the MIT licence.

Downloads last month
52
GGUF
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for osmapi/sarvam-30b-TQ3_1S-GGUF

Quantized
(15)
this model

Paper for osmapi/sarvam-30b-TQ3_1S-GGUF