Index

Introduction
Architecture
Benchmarks
- Knowledge & Coding
- Reasoning & Math
- Agentic
Inference
Footnote
Citation

Introduction

Sarvam-105B is an advanced Mixture-of-Experts (MoE) model with 10.3B active parameters, designed for superior performance across a wide range of complex tasks. It is highly optimized for complex reasoning, with particular strength in agentic tasks, mathematics, and coding.

Sarvam-105B is a top-tier performer, consistently matching or surpassing several major closed-source models and staying within a narrow margin of frontier models across diverse reasoning and agentic benchmarks. It demonstrates exceptional agentic and reasoning capabilities in real-world applications such as web search and technical troubleshooting.

A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.

Sarvam-105B is open-sourced under the Apache License. For more details, see our blog.

Architecture

The 105B model adopts an MLA-style attention stack with decoupled QK head dimensions (q_head_dim=192 split into RoPE and noPE components, v_head_dim=128) and a large head_dim of 576, enabling higher representational bandwidth per head while keeping the hidden size at 4096. This approach improves attention expressivity and long-context extrapolation (via YaRN scaling with a factor of 40 and 128K context). It has an intermediate_size (16384) and moe_intermediate_size (2048), combined with top-8 routing over 128 experts, which increases per-token active capacity while keeping activation cost manageable. The model has one shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing.

Benchmarks

Knowledge & Coding

Benchmark	Sarvam-105B	GLM-4.5-Air	GPT-OSS-120B	Qwen3-Next-80B-A3B-Thinking
Math500	98.6	97.2	97.0	98.2
Live Code Bench v6	71.7	59.5	72.3	68.7
MMLU	90.6	87.3	90.0	90.0
MMLU Pro	81.7	81.4	80.8	82.7
Writing Bench	80.5	83.8	86.5	84.6
Arena Hard v2	71.0	68.1	88.5	68.2
IF Eval	84.8	83.5	85.4	88.9

Reasoning & Math

Benchmark	Sarvam-105B	GLM-4.5-Air	GPT-OSS-120B	Qwen3-Next-80B-A3B-Thinking
GPQA Diamond	78.7	75.0	80.1	77.2
AIME 25 (w/ Tools)	88.3 (96.7)	83.3	90.0	87.8
Beyond AIME	69.1	61.5	51.0	68.0
HMMT (Feb 25)	85.8	69.2	90.0	73.9
HMMT (Nov 25)	85.8	75.0	90.0	80.0

Agentic

Benchmark	Sarvam-105B	GLM-4.5-Air	GPT-OSS-120B	Qwen3-Next-80B-A3B-Thinking
BrowseComp	49.5	21.3	-	38.0
SWE Bench Verified (SWE-Agent Harness)	45.0	57.6	50.6	60.9
τ² Bench (avg.)	68.3	53.2	65.8	55.0

See footnote for evaluation details.

Inference

Clone and build

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && cmake -B build && cmake --build build --config Release -j

Download the model (all shards)

huggingface-cli download sarvam/sarvam-105-gguf --local-dir sarvam-105b-gguf

Run interactive chat

./build/bin/llama-cli \
  -m sarvam-105b-gguf/sarvam-105b-Q4_K_M.gguf-00001-of-00009.gguf \
  -c 4096 \
  -n 512 \
  -p "You are a helpful assistant." \
  --conversation

OpenAI-compatible API server

./build/bin/llama-server \
  -m sarvam-105b-gguf/sarvam-105b-Q4_K_M.gguf-00001-of-00009.gguf \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080

Then query it:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "temperature": 0.8,
    "max_tokens": 512
  }'

Note:

Use -ngl -1 to enable GPU acceleration (if available)
Omit -ngl to run on CPU

Footnote

General settings: All benchmarks are evaluated with a maximum context length of 65,536 tokens.
Reasoning & Math benchmarks (Math500, MMLU, MMLU Pro, GPQA Diamond, AIME 25, Beyond AIME, HMMT): Evaluated with temperature=1.0, top_p=1.0, max_new_tokens=65536.
Coding & Knowledge benchmarks (Live Code Bench v6, Arena Hard v2, IF Eval): Evaluated with temperature=1.0, top_p=1.0, max_new_tokens=65536.
Writing Bench: Responses generated using official Writing-Bench parameters: temperature=0.7, top_p=0.8, top_k=20, max_length=16000. Scoring performed using the official Writing-Bench critic model with: temperature=1.0, top_p=0.95, max_length=2048.
Agentic benchmarks (BrowseComp, SWE Bench Verified, τ² Bench): Evaluated with temperature=0.5, top_p=1.0, max_new_tokens=32768.

Citation

@misc{sarvam_sovereign_models,
  title        = {Introducing Sarvam's Sovereign Models},
  author       = {{Sarvam Foundation Models Team}},
  year         = {2026},
  howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
  note         = {Accessed: 2026-03-03}
}

Downloads last month: 580

GGUF

Model size

106B params

Architecture

deepseek2

Hardware compatibility

4-bit

Collection including sarvamai/sarvam-105b-gguf

sovereign models

Collection

4 items • Updated 30 days ago • 6