Sarvam-30B — IQ2_M GGUF (Indic Calibrated)

This is an IQ2_M quantization of sarvamai/sarvam-30b, produced using llama.cpp with a multilingual Indic calibration dataset.

What makes this different

The official Sarvam GGUF release provides Q4_K_M only. This repo adds:

IQ2_M quantization — approximately 10 GB vs 19.6 GB for Q4_K_M
Indic imatrix calibration — importance matrix computed from text spanning 22 Indian languages including Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Gujarati, Marathi, Punjabi, Urdu, and Odia. This preserves Indic language weight quality at 2-bit compression better than a generic English calibration would.
imatrix file included — reusable for producing other quant levels

Files

File	Size	Description
`sarvam-30b-IQ2_M-indic.gguf`	~10 GB	IQ2_M quantized model
`sarvam-30b-indic.imatrix`	~82 MB	Indic calibration imatrix
`indic_calibration_final.txt`	~67 MB	Indic calibration data

Usage with llama.cpp

# Download
huggingface-cli download YOUR_USERNAME/sarvam-30b-IQ2_M-indic \
  --local-dir ./sarvam-30b-IQ2_M

# Run server
./llama-server \
  -m ./sarvam-30b-IQ2_M/sarvam-30b-IQ2_M-indic.gguf \
  --port 8080 -n 2000

# Run CLI
./llama-cli \
  -m ./sarvam-30b-IQ2_M/sarvam-30b-IQ2_M-indic.gguf \
  -n 2000 --no-warmup

Important: reasoning model behaviour

Sarvam-30B is a reasoning model. It will produce a [Start thinking] chain before answering — typically 500-2000 tokens. This is expected behaviour, not a bug. Set -n to at least 2000 to ensure it reaches the final answer.

Prompt format

[@BOS@]
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
YOUR QUESTION HERE<|im_end|>
<|im_start|>assistant

Quantization details

Base model: sarvamai/sarvam-30b (Q4_K_M GGUF as source)
Tool: llama.cpp build b8651
Method: IQ2_M with importance matrix (--allow-requantize)
Calibration: 128 chunks, Indic multilingual text, 22 languages
Architecture: bailingmoe2 (19 layers, 128 experts, top-6 routing)
Platform: Produced on Windows x64, CPU only

Hardware requirements

Setup	Works?	Notes
16 GB RAM, no GPU	Yes	~6-8 t/s, slow but functional
32 GB RAM, no GPU	Yes	~8-10 t/s
Any NVIDIA GPU	Yes	Add `-ngl 99` for GPU offload

Sample output

Prompt: भारत के बारे में बताओ

Response (after thinking chain):

भारत एक विशाल और विविधतापूर्ण देश है जो दक्षिणी एशिया में स्थित है...

Usage with Ollama

Create a Modelfile:

FROM deepak-p-yadav/sarvam-30b-IQ2_M-indic/sarvam-30b-IQ2_M-indic.gguf

TEMPLATE """{{ if .System }}<|start_of_turn|><|system|>
{{ .System }}<|end_of_turn|>
{{ end }}{{ if .Prompt }}<|start_of_turn|><|user|>
{{ .Prompt }}<|end_of_turn|>
{{ end }}<|start_of_turn|><|assistant|>
"""

PARAMETER stop "<|end_of_turn|>"
PARAMETER stop "<|start_of_turn|>"
PARAMETER num_predict 3000
SYSTEM "You are a helpful indic multilingual assistant. Answer directly in language user provides."

Then run:

ollama create sarvam-30b-indic -f Modelfile
ollama run sarvam-30b-indic "भारत के बारे में बताओ"

Quality notes

Quantized from Sarvam's official Q4_K_M using --allow-requantize.
Qualitative testing shows coherent Hindi, Tamil, and Telugu output on
simple factual prompts. The IQ2_M compression introduces some quality
degradation compared to Q4_K_M — most noticeable on complex multi-step
reasoning tasks. For maximum quality use the official Q4_K_M from
sarvamai/sarvam-30b-gguf.

Citation

@misc{sarvam_sovereign_models,
  title={Introducing Sarvam's Sovereign Models},
  author={{Sarvam Foundation Models Team}},
  year={2026},
  url={https://www.sarvam.ai/blogs/sarvam-30b-105b}
}

License

This quantization is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0),
inherited from the original [sarvamai/sarvam-30b](https://huggingface.co/sarvamai/sarvam-30b) model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for deepak-p-yadav/sarvam-30b-IQ2_M-indic

Base model

sarvamai/sarvam-30b

Finetuned

(4)

this model

deepak-p-yadav
/

sarvam-30b-IQ2_M-indic