Sarvam-30B — IQ2_M GGUF (Indic Calibrated)
This is an IQ2_M quantization of sarvamai/sarvam-30b, produced using llama.cpp with a multilingual Indic calibration dataset.
What makes this different
The official Sarvam GGUF release provides Q4_K_M only. This repo adds:
- IQ2_M quantization — approximately 10 GB vs 19.6 GB for Q4_K_M
- Indic imatrix calibration — importance matrix computed from text spanning 22 Indian languages including Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Gujarati, Marathi, Punjabi, Urdu, and Odia. This preserves Indic language weight quality at 2-bit compression better than a generic English calibration would.
- imatrix file included — reusable for producing other quant levels
Files
| File | Size | Description |
|---|---|---|
sarvam-30b-IQ2_M-indic.gguf |
~10 GB | IQ2_M quantized model |
sarvam-30b-indic.imatrix |
~82 MB | Indic calibration imatrix |
indic_calibration_final.txt |
~67 MB | Indic calibration data |
Usage with llama.cpp
# Download
huggingface-cli download YOUR_USERNAME/sarvam-30b-IQ2_M-indic \
--local-dir ./sarvam-30b-IQ2_M
# Run server
./llama-server \
-m ./sarvam-30b-IQ2_M/sarvam-30b-IQ2_M-indic.gguf \
--port 8080 -n 2000
# Run CLI
./llama-cli \
-m ./sarvam-30b-IQ2_M/sarvam-30b-IQ2_M-indic.gguf \
-n 2000 --no-warmup
Important: reasoning model behaviour
Sarvam-30B is a reasoning model. It will produce a [Start thinking] chain
before answering — typically 500-2000 tokens. This is expected behaviour,
not a bug. Set -n to at least 2000 to ensure it reaches the final answer.
Prompt format
[@BOS@]
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
YOUR QUESTION HERE<|im_end|>
<|im_start|>assistant
Quantization details
- Base model: sarvamai/sarvam-30b (Q4_K_M GGUF as source)
- Tool: llama.cpp build b8651
- Method: IQ2_M with importance matrix (--allow-requantize)
- Calibration: 128 chunks, Indic multilingual text, 22 languages
- Architecture: bailingmoe2 (19 layers, 128 experts, top-6 routing)
- Platform: Produced on Windows x64, CPU only
Hardware requirements
| Setup | Works? | Notes |
|---|---|---|
| 16 GB RAM, no GPU | Yes | ~6-8 t/s, slow but functional |
| 32 GB RAM, no GPU | Yes | ~8-10 t/s |
| Any NVIDIA GPU | Yes | Add -ngl 99 for GPU offload |
Sample output
Prompt: भारत के बारे में बताओ
Response (after thinking chain):
भारत एक विशाल और विविधतापूर्ण देश है जो दक्षिणी एशिया में स्थित है...
Usage with Ollama
Create a Modelfile:
FROM deepak-p-yadav/sarvam-30b-IQ2_M-indic/sarvam-30b-IQ2_M-indic.gguf
TEMPLATE """{{ if .System }}<|start_of_turn|><|system|>
{{ .System }}<|end_of_turn|>
{{ end }}{{ if .Prompt }}<|start_of_turn|><|user|>
{{ .Prompt }}<|end_of_turn|>
{{ end }}<|start_of_turn|><|assistant|>
"""
PARAMETER stop "<|end_of_turn|>"
PARAMETER stop "<|start_of_turn|>"
PARAMETER num_predict 3000
SYSTEM "You are a helpful indic multilingual assistant. Answer directly in language user provides."
Then run:
ollama create sarvam-30b-indic -f Modelfile
ollama run sarvam-30b-indic "भारत के बारे में बताओ"
Quality notes
Quantized from Sarvam's official Q4_K_M using --allow-requantize.
Qualitative testing shows coherent Hindi, Tamil, and Telugu output on
simple factual prompts. The IQ2_M compression introduces some quality
degradation compared to Q4_K_M — most noticeable on complex multi-step
reasoning tasks. For maximum quality use the official Q4_K_M from
sarvamai/sarvam-30b-gguf.
Citation
@misc{sarvam_sovereign_models,
title={Introducing Sarvam's Sovereign Models},
author={{Sarvam Foundation Models Team}},
year={2026},
url={https://www.sarvam.ai/blogs/sarvam-30b-105b}
}
License
This quantization is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0),
inherited from the original [sarvamai/sarvam-30b](https://huggingface.co/sarvamai/sarvam-30b) model.
Model tree for deepak-p-yadav/sarvam-30b-IQ2_M-indic
Base model
sarvamai/sarvam-30b