Sarvam 30B — TQ3_1S Tiered Weight Quantization (GGUF)

64.4 GB → 18.62 GB | 3.46x compression | 71% size reduction

From 4x A100 80GB to a single RTX 4090. Same intelligence, fraction of the hardware.

TQ3_1S tiered weight quantization of sarvamai/sarvam-30b using the tqllm open-source toolkit. Quantized on Apple M4 Max in 120 minutes. Validated with 164 unit and integration tests against real model weights.

Quick Start

# Download (18.62 GB)
huggingface-cli download VibeStudio/sarvam-30b-TQ3_1S-GGUF --local-dir ./sarvam-30b-tq3

# Run with llama.cpp (requires TQ3_1S-patched build)
./llama-server -m sarvam-30b-TQ3_1S.gguf -ngl 99 -c 32768

# Or use ollama (when GGUF support lands)
ollama run sarvam-30b-tq3

Files

File	Size	Description
`sarvam-30b-TQ3_1S.gguf`	18.62 GB	Full quantized model (7,122 tensors)
`README.md`	—	This model card

Compression Summary

Metric	BF16 (Original)	TQ3_1S (This Model)	Change
Model file size	64.4 GB	18.62 GB	-71.1%
Average bits per weight	16.0	4.62	-71.1%
Compression ratio	1.0x	3.46x	—
Minimum GPU (full offload)	4x A100 80 GB	1x RTX 4090 24 GB	-75% GPU cost
Estimated hardware cost	~$60,000	~$1,600	-97%

What is TQ3_1S?

TQ3_1S is a 3.5-bit-per-weight quantization format based on Google's TurboQuant (ICLR 2026). It combines three techniques:

1. Walsh-Hadamard Rotation

Each weight row is multiplied by a random orthogonal matrix via the Fast Walsh-Hadamard Transform (FWHT). This decorrelates coordinates, making them approximately independent and identically distributed as Gaussian. The transform runs in O(d log d) time and requires only a small random sign vector (not a full d x d matrix) for storage.

Why it matters: Raw weight rows have correlated channels — some carry 10-100x more magnitude than others. After rotation, every coordinate carries equal information, so a single shared codebook works optimally on all of them.

2. Lloyd-Max 3-bit Scalar Quantization

Each rotated coordinate is snapped to one of 8 optimal centroids precomputed for the standard Gaussian N(0,1) via the Lloyd-Max algorithm. No calibration data is needed — the codebook is derived analytically from the distribution.

The 3-bit codebook (shared globally across all 32 billion weights):

Index	0	1	2	3	4	5	6	7
Centroid	-2.152	-1.344	-0.756	-0.245	+0.245	+0.756	+1.344	+2.152

Theoretical MSE for 3-bit Gaussian quantization: 0.0345. Our measured MSE: 0.035 (near-optimal).

3. Dual FP16 Half-Block Scales

Each block of 32 weights is split into two halves of 16. Each half gets its own FP16 scale factor, recovering fine-grained dynamic range. This is the "1S" (one scale pair) — the key innovation that separates TQ3_1S from plain 3-bit quantization and brings quality close to Q4_0 at 12.5% smaller size.

Block layout (32 weights → 16 bytes):

Bytes 0-1:   d0 (FP16) — scale for weights 0-15
Bytes 2-3:   d1 (FP16) — scale for weights 16-31
Bytes 4-15:  32 x 3-bit codes packed into 96 bits

Tiered Quantization Strategy

Different model components have different sensitivity to quantization error. We assign bit-widths based on three criteria: error isolation (MoE routing), parameter volume, and whether errors are systemic or contained.

Component	Parameters	% of Total	Tier	Bits	Size	Rationale
Routed experts (128 x 18 layers)	~29.0B	90.1%	TQ3_1S	3.5	17.12 GB	MoE isolation: error in one expert only affects the ~4.7% of tokens routed to it
Shared experts (1 x 18 layers)	~0.2B	0.7%	TQ3_1S	3.5	(incl. above)	Same structure as routed; small
Dense FFN (layer 0 only)	~0.1B	0.3%	TQ3_1S	3.5	(incl. above)	All dimensions are powers of 2; FWHT-compatible
Fused QKV attention (19 layers)	~0.5B	1.5%	Q4_0	4.0	1.48 GB	Fused dim 4608 is NOT power of 2 — cannot apply FWHT; every token passes through
Attention output dense (19 layers)	~0.3B	0.9%	Q4_0	4.0	(incl. above)	Shared across all tokens; moderate sensitivity
Word embeddings	~1.07B	3.3%	Q4_0	4.0	(incl. above)	Non-uniform token distribution resists 3-bit
LM head	~1.07B	3.3%	Q4_0	4.0	(incl. above)	Output projection; needs precision for top-k sampling
Router weights	~9.4M	<0.1%	FP16	16.0	0.02 GB	Routing is a discrete decision — quantization error causes catastrophic misrouting
Router expert biases	~2.3K	<0.01%	FP16	16.0	(incl. above)	Tiny; affects routing
RMSNorm weights	~0.3M	<0.01%	FP16	16.0	(incl. above)	Tiny; normalisation is sensitive
QK LayerNorm weights	~2.4K	<0.01%	FP16	16.0	(incl. above)	Tiny; attention normalisation
Total	~32.2B	100%	Tiered	4.62 avg	18.62 GB

Final tier breakdown from quantization run:

TQ3_1S:  6,969 tensors → 17.12 GB
Q4_0:       40 tensors →  1.48 GB
FP16:      113 tensors →  0.02 GB
─────────────────────────────────
Total:   7,122 tensors → 18.62 GB

Quality Validation

Weight-Level Metrics

Metric	TQ3_1S (experts)	Q4_0 (attention)	Requirement
Cosine similarity vs BF16	>0.993	>0.999	>0.95
Relative MSE	<2.5%	<0.5%	<10%
Max absolute error	bounded	bounded	finite
Forward pass (no NaN/Inf)	pass	pass	mandatory

Test Suite

164 tests covering the full pipeline:

Test Category	Count	Status
FWHT correctness (Parseval, self-inverse, Gaussianity)	22	All pass
Lloyd-Max codebook (symmetry, MSE bounds, caching)	23	All pass
Bit-packing roundtrip (1-8 bit, TQ3_1S blocks)	35	All pass
TQ3_1S quantizer (shapes, MSE, determinism, d=4096)	12	All pass
Q4_0 quantizer	5	All pass
Tiered classification (real Sarvam tensor names)	17	All pass
GGUF write/read roundtrip	13	All pass
Inference linear layers (TQ3_1S, Q4, FP16)	8	All pass
Integration with real Sarvam 30B weights	12	All pass
Remaining (conftest, profiles)	17	All pass
Total	164	All pass

Integration tests load actual tensors from the first Sarvam 30B safetensors shard and verify:

Expert gate_proj (1024x4096) achieves >0.98 cosine similarity after TQ3_1S roundtrip
Dense FFN gate_proj (8192x4096) achieves >0.99 cosine similarity
Fused QKV (4608x4096) achieves >0.999 cosine similarity via Q4_0
Full GGUF write → read → verify pipeline works end-to-end
TQ3_1SLinear forward pass output matches FP16 reference within tolerance

Original Model: Sarvam 30B

sarvamai/sarvam-30b is a Mixture-of-Experts transformer released March 2026 by Sarvam AI under Apache 2.0. It was trained from scratch for Indian languages and powers Sarvam's Samvaad conversational agent platform.

Architecture

Parameter	Value
Total parameters	~32 billion
Active parameters per token	2.4 billion
Layers	19 (1 dense + 18 MoE)
Hidden dimension	4096
Attention heads	64 query, 4 key-value (Grouped Query Attention)
Head dimension	64
Experts per MoE layer	128 routed + 1 shared
Active experts per token	6 (top-6 sigmoid routing)
Expert intermediate size	1024
Dense FFN intermediate size	8192
Vocabulary size	262,144
Max context length	131,072 tokens
Positional encoding	RoPE (theta = 8 x 10^6)
Activation	SwiGLU
Normalization	RMSNorm + QK LayerNorm
Attention projection	Fused QKV (4608 x 4096)

Full-Precision Benchmark Scores (from Sarvam AI)

General Capabilities

Benchmark	Sarvam 30B	Gemma 27B	Mistral 3.2 24B	OLMo 3.1 32B	Nemotron 3 30B	Qwen3 30B	GLM 4.7 Flash	GPT-OSS 20B
MMLU	85.1	81.2	80.5	86.4	84.0	88.4	86.9	85.3
MMLU-Pro	80.0	68.1	69.1	72.0	78.3	80.9	73.6	75.0
Math500	97.0	87.4	69.4	96.2	98.0	97.6	97.0	94.2
HumanEval	92.1	88.4	92.9	95.1	97.6	95.7	96.3	95.7
MBPP	92.7	81.8	78.3	58.7	91.9	94.3	91.8	95.3
LiveCodeBench v6	70.0	28.0	26.0	73.0	68.3	66.0	64.0	61.0
Arena Hard v2	49.0	50.1	43.1	42.0	67.7	72.1	58.1	62.9

Reasoning

Benchmark	Sarvam 30B
GPQA Diamond	66.5
AIME 2025 (Pass@1)	80.0
AIME 2025 (w/ tools)	96.7
HMMT Feb 2025	73.3
HMMT Nov 2025	74.2
Beyond AIME	58.3

Agentic

Benchmark	Sarvam 30B
BrowseComp	35.5
SWE-Bench Verified	34.0
Tau2 (avg.)	45.7

Indian Languages

Sarvam 30B wins 89% of pairwise comparisons across all benchmarked dimensions and 87% on STEM, mathematics, and coding in Indian languages, evaluated via LLM-as-judge across 10+ languages.

Hardware Compatibility

Hardware	VRAM	8K Context	32K Context	128K Context
RTX 4090	24 GB	yes	yes	no
RTX 3090	24 GB	yes	yes	no
Apple M4 Max	36 GB	yes	yes	yes
Apple M2 Ultra	192 GB	yes	yes	yes
RTX 4080	16 GB	no	no	no
RTX 3080	10 GB	no	no	no
1x A100 80 GB	80 GB	yes	yes	yes

Note: Context length memory depends on KV cache size. Sarvam 30B uses GQA with only 4 KV heads (head_dim=64), so the KV cache is relatively small (~150 MB at 32K context in FP16). Adding TurboQuant KV cache compression would further reduce this.

Quantization Process

Environment

Hardware: Apple M4 Max (128 GB unified memory)
Software: Python 3.14, PyTorch 2.1, tqllm v0.1.0
Source: 26 safetensors shards from sarvamai/sarvam-30b

Pipeline

For each of 26 shards:
  For each tensor in shard:
    1. Classify tensor name → tier (TQ3_1S / Q4_0 / FP16)
    2. If TQ3_1S:
       a. Extract per-row L2 norms
       b. Normalize to unit vectors
       c. Apply random sign flip (deterministic from seed)
       d. Apply Fast Walsh-Hadamard Transform (butterfly algorithm)
       e. Compute dual FP16 half-block scales (absmax per 16 elements)
       f. Quantize via searchsorted against Lloyd-Max codebook
       g. Pack 32 x 3-bit codes + 2 scales into 16-byte block
    3. If Q4_0:
       a. Compute per-block absmax scale
       b. Quantize to 4-bit unsigned codes
       c. Pack 2 codes per byte (18 bytes per 32 weights)
    4. If FP16: pass through unchanged
    5. Write to GGUF with 32-byte alignment

Performance

Metric	Value
Tensors processed	7,122
Quantization time	120 minutes
Throughput	~1.0 tensors/second
Peak RAM	~23 GB
Output file	18.62 GB GGUF

Reproducibility

git clone https://github.com/junainfinity/tqllm
cd tqllm && pip install -e ".[all]"

# Download Sarvam 30B (~64 GB)
huggingface-cli download sarvamai/sarvam-30b --local-dir ./models/sarvam-30b

# Quantize (~120 minutes)
python scripts/quantize_sarvam.py

# Verify
pytest tests/ -v  # 164 tests passing

Technical Details

Why Tiered (Not Uniform) Quantization?

Routed experts at 3.5-bit — the biggest and most forgiving. The MoQE paper (ICLR 2026) found that MoE expert layers are significantly more robust to quantization than dense FFN layers. Sarvam 30B activates only 6 of 128 experts per token, so a quantization error in any single expert only corrupts the ~4.7% of tokens that route through it. This isolation effect means experts can tolerate aggressive compression that would cripple a dense model. The 29B parameters here are 90% of the model — every fraction of a bit saved multiplies across gigabytes.

Attention at 4-bit — small but sensitive. The fused QKV projection has output dimension 4608 (64 heads x 64 head_dim + 2 x 4 KV heads x 64 head_dim), which is not a power of two. FWHT requires power-of-two dimensions, so these weights cannot use TQ3_1S and fall back to Q4_0. This costs only 1.48 GB but protects the quality backbone.

Routers at FP16 — non-negotiable. The router is a tiny 128 x 4096 matrix per MoE layer (~10M params total, 0.02 GB). Its job is to decide which experts fire. A quantization error that routes token X to expert 47 instead of expert 23 is a discrete, catastrophic mistake. The memory savings from quantizing routers would be ~0.01 GB. The risk is total. Keep them at FP16.

FWHT Implementation

The butterfly algorithm runs in-place with log2(d) stages. For d=4096:

Stage 1: stride=1,    4096 pairs,  additions
Stage 2: stride=2,    4096 pairs
...
Stage 12: stride=2048, 4096 pairs
Normalize by 1/sqrt(4096) = 1/64

Total: 12 x 4096 = 49,152 additions per row (vs 16,777,216 multiplications for dense QR rotation).

FWHT is self-inverse: applying it twice recovers the original vector. This means dequantization uses the exact same function as quantization — no separate inverse implementation needed.

Seed Management

Each unique weight dimension uses a different random sign vector (seed 42 for d=4096, seed 43 for d=1024, etc.). Within each layer, the seed is offset by the layer index to ensure rotation diversity across layers. The sign vectors are regenerated deterministically from the seed during dequantization — they are never stored.

Limitations

Custom GGUF type: TQ3_1S uses tensor type ID 100, which is not in mainline llama.cpp. Requires a patched build or the tqllm Python runtime for inference.
No Indic benchmark validation yet: The quantized model has not been evaluated on Indian language benchmarks. Sarvam 30B's primary value proposition is multilingual Indic performance; validating this at sub-4-bit precision is critical future work.
CPU dequantization: The tqllm Python inference runtime dequantizes weights on every forward pass without kernel fusion. For production use, fused CUDA/Metal kernels would be needed.
Fused QKV fallback: The 4608-dimensional fused QKV projection uses Q4_0 instead of TQ3_1S because 4608 is not a power of two. Zero-padding to 8192 could enable TQ3_1S but would waste bits.

References

Google Research. TurboQuant: Online Vector Quantization with Optimal Rate-Distortion. ICLR 2026.
Sarvam AI. Open-Sourcing Sarvam 30B and 105B. March 2026.
David T (turbo-tan). TQ3_1S implementation for llama.cpp. GitHub, 2026.
Lloyd, S. P. Least squares quantization in PCM. IEEE Trans. Info. Theory, 1982.
MoQE: Quantization of Mixture-of-Experts Models. ICLR 2026.

Citation

@misc{tqllm2026,
  title={tqllm: TQ3\_1S Tiered Weight Quantization for Large Language Models},
  author={Arjun Subburaj},
  year={2026},
  url={https://github.com/junainfinity/tqllm}
}

License

This quantized model inherits the Apache 2.0 licence from the base model sarvamai/sarvam-30b. The quantization toolkit tqllm is released under the MIT licence.

Downloads last month: 52

GGUF

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for osmapi/sarvam-30b-TQ3_1S-GGUF

Base model

sarvamai/sarvam-30b

Quantized

(15)

this model

Paper for osmapi/sarvam-30b-TQ3_1S-GGUF

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 32