Sarvam 30B โ TQ3_1S Tiered Weight Quantization (GGUF)
64.4 GB โ 18.62 GB | 3.46x compression | 71% size reduction
From 4x A100 80GB to a single RTX 4090. Same intelligence, fraction of the hardware.
TQ3_1S tiered weight quantization of sarvamai/sarvam-30b using the tqllm open-source toolkit. Quantized on Apple M4 Max in 120 minutes. Validated with 164 unit and integration tests against real model weights.
Quick Start
# Download (18.62 GB)
huggingface-cli download VibeStudio/sarvam-30b-TQ3_1S-GGUF --local-dir ./sarvam-30b-tq3
# Run with llama.cpp (requires TQ3_1S-patched build)
./llama-server -m sarvam-30b-TQ3_1S.gguf -ngl 99 -c 32768
# Or use ollama (when GGUF support lands)
ollama run sarvam-30b-tq3
Files
| File | Size | Description |
|---|---|---|
sarvam-30b-TQ3_1S.gguf |
18.62 GB | Full quantized model (7,122 tensors) |
README.md |
โ | This model card |
Compression Summary
| Metric | BF16 (Original) | TQ3_1S (This Model) | Change |
|---|---|---|---|
| Model file size | 64.4 GB | 18.62 GB | -71.1% |
| Average bits per weight | 16.0 | 4.62 | -71.1% |
| Compression ratio | 1.0x | 3.46x | โ |
| Minimum GPU (full offload) | 4x A100 80 GB | 1x RTX 4090 24 GB | -75% GPU cost |
| Estimated hardware cost | ~$60,000 | ~$1,600 | -97% |
What is TQ3_1S?
TQ3_1S is a 3.5-bit-per-weight quantization format based on Google's TurboQuant (ICLR 2026). It combines three techniques:
1. Walsh-Hadamard Rotation
Each weight row is multiplied by a random orthogonal matrix via the Fast Walsh-Hadamard Transform (FWHT). This decorrelates coordinates, making them approximately independent and identically distributed as Gaussian. The transform runs in O(d log d) time and requires only a small random sign vector (not a full d x d matrix) for storage.
Why it matters: Raw weight rows have correlated channels โ some carry 10-100x more magnitude than others. After rotation, every coordinate carries equal information, so a single shared codebook works optimally on all of them.
2. Lloyd-Max 3-bit Scalar Quantization
Each rotated coordinate is snapped to one of 8 optimal centroids precomputed for the standard Gaussian N(0,1) via the Lloyd-Max algorithm. No calibration data is needed โ the codebook is derived analytically from the distribution.
The 3-bit codebook (shared globally across all 32 billion weights):
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| Centroid | -2.152 | -1.344 | -0.756 | -0.245 | +0.245 | +0.756 | +1.344 | +2.152 |
Theoretical MSE for 3-bit Gaussian quantization: 0.0345. Our measured MSE: 0.035 (near-optimal).
3. Dual FP16 Half-Block Scales
Each block of 32 weights is split into two halves of 16. Each half gets its own FP16 scale factor, recovering fine-grained dynamic range. This is the "1S" (one scale pair) โ the key innovation that separates TQ3_1S from plain 3-bit quantization and brings quality close to Q4_0 at 12.5% smaller size.
Block layout (32 weights โ 16 bytes):
Bytes 0-1: d0 (FP16) โ scale for weights 0-15
Bytes 2-3: d1 (FP16) โ scale for weights 16-31
Bytes 4-15: 32 x 3-bit codes packed into 96 bits
Tiered Quantization Strategy
Different model components have different sensitivity to quantization error. We assign bit-widths based on three criteria: error isolation (MoE routing), parameter volume, and whether errors are systemic or contained.
| Component | Parameters | % of Total | Tier | Bits | Size | Rationale |
|---|---|---|---|---|---|---|
| Routed experts (128 x 18 layers) | ~29.0B | 90.1% | TQ3_1S | 3.5 | 17.12 GB | MoE isolation: error in one expert only affects the ~4.7% of tokens routed to it |
| Shared experts (1 x 18 layers) | ~0.2B | 0.7% | TQ3_1S | 3.5 | (incl. above) | Same structure as routed; small |
| Dense FFN (layer 0 only) | ~0.1B | 0.3% | TQ3_1S | 3.5 | (incl. above) | All dimensions are powers of 2; FWHT-compatible |
| Fused QKV attention (19 layers) | ~0.5B | 1.5% | Q4_0 | 4.0 | 1.48 GB | Fused dim 4608 is NOT power of 2 โ cannot apply FWHT; every token passes through |
| Attention output dense (19 layers) | ~0.3B | 0.9% | Q4_0 | 4.0 | (incl. above) | Shared across all tokens; moderate sensitivity |
| Word embeddings | ~1.07B | 3.3% | Q4_0 | 4.0 | (incl. above) | Non-uniform token distribution resists 3-bit |
| LM head | ~1.07B | 3.3% | Q4_0 | 4.0 | (incl. above) | Output projection; needs precision for top-k sampling |
| Router weights | ~9.4M | <0.1% | FP16 | 16.0 | 0.02 GB | Routing is a discrete decision โ quantization error causes catastrophic misrouting |
| Router expert biases | ~2.3K | <0.01% | FP16 | 16.0 | (incl. above) | Tiny; affects routing |
| RMSNorm weights | ~0.3M | <0.01% | FP16 | 16.0 | (incl. above) | Tiny; normalisation is sensitive |
| QK LayerNorm weights | ~2.4K | <0.01% | FP16 | 16.0 | (incl. above) | Tiny; attention normalisation |
| Total | ~32.2B | 100% | Tiered | 4.62 avg | 18.62 GB |
Final tier breakdown from quantization run:
TQ3_1S: 6,969 tensors โ 17.12 GB
Q4_0: 40 tensors โ 1.48 GB
FP16: 113 tensors โ 0.02 GB
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Total: 7,122 tensors โ 18.62 GB
Quality Validation
Weight-Level Metrics
| Metric | TQ3_1S (experts) | Q4_0 (attention) | Requirement |
|---|---|---|---|
| Cosine similarity vs BF16 | >0.993 | >0.999 | >0.95 |
| Relative MSE | <2.5% | <0.5% | <10% |
| Max absolute error | bounded | bounded | finite |
| Forward pass (no NaN/Inf) | pass | pass | mandatory |
Test Suite
164 tests covering the full pipeline:
| Test Category | Count | Status |
|---|---|---|
| FWHT correctness (Parseval, self-inverse, Gaussianity) | 22 | All pass |
| Lloyd-Max codebook (symmetry, MSE bounds, caching) | 23 | All pass |
| Bit-packing roundtrip (1-8 bit, TQ3_1S blocks) | 35 | All pass |
| TQ3_1S quantizer (shapes, MSE, determinism, d=4096) | 12 | All pass |
| Q4_0 quantizer | 5 | All pass |
| Tiered classification (real Sarvam tensor names) | 17 | All pass |
| GGUF write/read roundtrip | 13 | All pass |
| Inference linear layers (TQ3_1S, Q4, FP16) | 8 | All pass |
| Integration with real Sarvam 30B weights | 12 | All pass |
| Remaining (conftest, profiles) | 17 | All pass |
| Total | 164 | All pass |
Integration tests load actual tensors from the first Sarvam 30B safetensors shard and verify:
- Expert
gate_proj(1024x4096) achieves >0.98 cosine similarity after TQ3_1S roundtrip - Dense FFN
gate_proj(8192x4096) achieves >0.99 cosine similarity - Fused QKV (4608x4096) achieves >0.999 cosine similarity via Q4_0
- Full GGUF write โ read โ verify pipeline works end-to-end
- TQ3_1SLinear forward pass output matches FP16 reference within tolerance
Original Model: Sarvam 30B
sarvamai/sarvam-30b is a Mixture-of-Experts transformer released March 2026 by Sarvam AI under Apache 2.0. It was trained from scratch for Indian languages and powers Sarvam's Samvaad conversational agent platform.
Architecture
| Parameter | Value |
|---|---|
| Total parameters | ~32 billion |
| Active parameters per token | 2.4 billion |
| Layers | 19 (1 dense + 18 MoE) |
| Hidden dimension | 4096 |
| Attention heads | 64 query, 4 key-value (Grouped Query Attention) |
| Head dimension | 64 |
| Experts per MoE layer | 128 routed + 1 shared |
| Active experts per token | 6 (top-6 sigmoid routing) |
| Expert intermediate size | 1024 |
| Dense FFN intermediate size | 8192 |
| Vocabulary size | 262,144 |
| Max context length | 131,072 tokens |
| Positional encoding | RoPE (theta = 8 x 10^6) |
| Activation | SwiGLU |
| Normalization | RMSNorm + QK LayerNorm |
| Attention projection | Fused QKV (4608 x 4096) |
Full-Precision Benchmark Scores (from Sarvam AI)
General Capabilities
| Benchmark | Sarvam 30B | Gemma 27B | Mistral 3.2 24B | OLMo 3.1 32B | Nemotron 3 30B | Qwen3 30B | GLM 4.7 Flash | GPT-OSS 20B |
|---|---|---|---|---|---|---|---|---|
| MMLU | 85.1 | 81.2 | 80.5 | 86.4 | 84.0 | 88.4 | 86.9 | 85.3 |
| MMLU-Pro | 80.0 | 68.1 | 69.1 | 72.0 | 78.3 | 80.9 | 73.6 | 75.0 |
| Math500 | 97.0 | 87.4 | 69.4 | 96.2 | 98.0 | 97.6 | 97.0 | 94.2 |
| HumanEval | 92.1 | 88.4 | 92.9 | 95.1 | 97.6 | 95.7 | 96.3 | 95.7 |
| MBPP | 92.7 | 81.8 | 78.3 | 58.7 | 91.9 | 94.3 | 91.8 | 95.3 |
| LiveCodeBench v6 | 70.0 | 28.0 | 26.0 | 73.0 | 68.3 | 66.0 | 64.0 | 61.0 |
| Arena Hard v2 | 49.0 | 50.1 | 43.1 | 42.0 | 67.7 | 72.1 | 58.1 | 62.9 |
Reasoning
| Benchmark | Sarvam 30B |
|---|---|
| GPQA Diamond | 66.5 |
| AIME 2025 (Pass@1) | 80.0 |
| AIME 2025 (w/ tools) | 96.7 |
| HMMT Feb 2025 | 73.3 |
| HMMT Nov 2025 | 74.2 |
| Beyond AIME | 58.3 |
Agentic
| Benchmark | Sarvam 30B |
|---|---|
| BrowseComp | 35.5 |
| SWE-Bench Verified | 34.0 |
| Tau2 (avg.) | 45.7 |
Indian Languages
Sarvam 30B wins 89% of pairwise comparisons across all benchmarked dimensions and 87% on STEM, mathematics, and coding in Indian languages, evaluated via LLM-as-judge across 10+ languages.
Hardware Compatibility
| Hardware | VRAM | 8K Context | 32K Context | 128K Context |
|---|---|---|---|---|
| RTX 4090 | 24 GB | yes | yes | no |
| RTX 3090 | 24 GB | yes | yes | no |
| Apple M4 Max | 36 GB | yes | yes | yes |
| Apple M2 Ultra | 192 GB | yes | yes | yes |
| RTX 4080 | 16 GB | no | no | no |
| RTX 3080 | 10 GB | no | no | no |
| 1x A100 80 GB | 80 GB | yes | yes | yes |
Note: Context length memory depends on KV cache size. Sarvam 30B uses GQA with only 4 KV heads (head_dim=64), so the KV cache is relatively small (~150 MB at 32K context in FP16). Adding TurboQuant KV cache compression would further reduce this.
Quantization Process
Environment
- Hardware: Apple M4 Max (128 GB unified memory)
- Software: Python 3.14, PyTorch 2.1, tqllm v0.1.0
- Source: 26 safetensors shards from
sarvamai/sarvam-30b
Pipeline
For each of 26 shards:
For each tensor in shard:
1. Classify tensor name โ tier (TQ3_1S / Q4_0 / FP16)
2. If TQ3_1S:
a. Extract per-row L2 norms
b. Normalize to unit vectors
c. Apply random sign flip (deterministic from seed)
d. Apply Fast Walsh-Hadamard Transform (butterfly algorithm)
e. Compute dual FP16 half-block scales (absmax per 16 elements)
f. Quantize via searchsorted against Lloyd-Max codebook
g. Pack 32 x 3-bit codes + 2 scales into 16-byte block
3. If Q4_0:
a. Compute per-block absmax scale
b. Quantize to 4-bit unsigned codes
c. Pack 2 codes per byte (18 bytes per 32 weights)
4. If FP16: pass through unchanged
5. Write to GGUF with 32-byte alignment
Performance
| Metric | Value |
|---|---|
| Tensors processed | 7,122 |
| Quantization time | 120 minutes |
| Throughput | ~1.0 tensors/second |
| Peak RAM | ~23 GB |
| Output file | 18.62 GB GGUF |
Reproducibility
git clone https://github.com/junainfinity/tqllm
cd tqllm && pip install -e ".[all]"
# Download Sarvam 30B (~64 GB)
huggingface-cli download sarvamai/sarvam-30b --local-dir ./models/sarvam-30b
# Quantize (~120 minutes)
python scripts/quantize_sarvam.py
# Verify
pytest tests/ -v # 164 tests passing
Technical Details
Why Tiered (Not Uniform) Quantization?
Routed experts at 3.5-bit โ the biggest and most forgiving. The MoQE paper (ICLR 2026) found that MoE expert layers are significantly more robust to quantization than dense FFN layers. Sarvam 30B activates only 6 of 128 experts per token, so a quantization error in any single expert only corrupts the ~4.7% of tokens that route through it. This isolation effect means experts can tolerate aggressive compression that would cripple a dense model. The 29B parameters here are 90% of the model โ every fraction of a bit saved multiplies across gigabytes.
Attention at 4-bit โ small but sensitive. The fused QKV projection has output dimension 4608 (64 heads x 64 head_dim + 2 x 4 KV heads x 64 head_dim), which is not a power of two. FWHT requires power-of-two dimensions, so these weights cannot use TQ3_1S and fall back to Q4_0. This costs only 1.48 GB but protects the quality backbone.
Routers at FP16 โ non-negotiable. The router is a tiny 128 x 4096 matrix per MoE layer (~10M params total, 0.02 GB). Its job is to decide which experts fire. A quantization error that routes token X to expert 47 instead of expert 23 is a discrete, catastrophic mistake. The memory savings from quantizing routers would be ~0.01 GB. The risk is total. Keep them at FP16.
FWHT Implementation
The butterfly algorithm runs in-place with log2(d) stages. For d=4096:
Stage 1: stride=1, 4096 pairs, additions
Stage 2: stride=2, 4096 pairs
...
Stage 12: stride=2048, 4096 pairs
Normalize by 1/sqrt(4096) = 1/64
Total: 12 x 4096 = 49,152 additions per row (vs 16,777,216 multiplications for dense QR rotation).
FWHT is self-inverse: applying it twice recovers the original vector. This means dequantization uses the exact same function as quantization โ no separate inverse implementation needed.
Seed Management
Each unique weight dimension uses a different random sign vector (seed 42 for d=4096, seed 43 for d=1024, etc.). Within each layer, the seed is offset by the layer index to ensure rotation diversity across layers. The sign vectors are regenerated deterministically from the seed during dequantization โ they are never stored.
Limitations
Custom GGUF type: TQ3_1S uses tensor type ID 100, which is not in mainline llama.cpp. Requires a patched build or the tqllm Python runtime for inference.
No Indic benchmark validation yet: The quantized model has not been evaluated on Indian language benchmarks. Sarvam 30B's primary value proposition is multilingual Indic performance; validating this at sub-4-bit precision is critical future work.
CPU dequantization: The tqllm Python inference runtime dequantizes weights on every forward pass without kernel fusion. For production use, fused CUDA/Metal kernels would be needed.
Fused QKV fallback: The 4608-dimensional fused QKV projection uses Q4_0 instead of TQ3_1S because 4608 is not a power of two. Zero-padding to 8192 could enable TQ3_1S but would waste bits.
References
- Google Research. TurboQuant: Online Vector Quantization with Optimal Rate-Distortion. ICLR 2026.
- Sarvam AI. Open-Sourcing Sarvam 30B and 105B. March 2026.
- David T (turbo-tan). TQ3_1S implementation for llama.cpp. GitHub, 2026.
- Lloyd, S. P. Least squares quantization in PCM. IEEE Trans. Info. Theory, 1982.
- MoQE: Quantization of Mixture-of-Experts Models. ICLR 2026.
Citation
@misc{tqllm2026,
title={tqllm: TQ3\_1S Tiered Weight Quantization for Large Language Models},
author={Arjun Subburaj},
year={2026},
url={https://github.com/junainfinity/tqllm}
}
License
This quantized model inherits the Apache 2.0 licence from the base model sarvamai/sarvam-30b. The quantization toolkit tqllm is released under the MIT licence.
- Downloads last month
- 52
We're not able to determine the quantization variants.
Model tree for osmapi/sarvam-30b-TQ3_1S-GGUF
Base model
sarvamai/sarvam-30b