image

!!! This is the GGUF version of Sarvam-105B !!!

Download the original weights here!

Index

  1. Introduction
  2. Architecture
  3. Benchmarks
    • Knowledge & Coding
    • Reasoning & Math
    • Agentic
  4. Inference
  5. Footnote
  6. Citation

Introduction

Sarvam-105B is an advanced Mixture-of-Experts (MoE) model with 10.3B active parameters, designed for superior performance across a wide range of complex tasks. It is highly optimized for complex reasoning, with particular strength in agentic tasks, mathematics, and coding.

Sarvam-105B is a top-tier performer, consistently matching or surpassing several major closed-source models and staying within a narrow margin of frontier models across diverse reasoning and agentic benchmarks. It demonstrates exceptional agentic and reasoning capabilities in real-world applications such as web search and technical troubleshooting.

A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.

Sarvam-105B is open-sourced under the Apache License. For more details, see our blog.

Architecture

The 105B model adopts an MLA-style attention stack with decoupled QK head dimensions (q_head_dim=192 split into RoPE and noPE components, v_head_dim=128) and a large head_dim of 576, enabling higher representational bandwidth per head while keeping the hidden size at 4096. This approach improves attention expressivity and long-context extrapolation (via YaRN scaling with a factor of 40 and 128K context). It has an intermediate_size (16384) and moe_intermediate_size (2048), combined with top-8 routing over 128 experts, which increases per-token active capacity while keeping activation cost manageable. The model has one shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing.

Benchmarks

Knowledge & Coding
Benchmark Sarvam-105B GLM-4.5-Air GPT-OSS-120B Qwen3-Next-80B-A3B-Thinking
Math500 98.6 97.2 97.0 98.2
Live Code Bench v6 71.7 59.5 72.3 68.7
MMLU 90.6 87.3 90.0 90.0
MMLU Pro 81.7 81.4 80.8 82.7
Writing Bench 80.5 83.8 86.5 84.6
Arena Hard v2 71.0 68.1 88.5 68.2
IF Eval 84.8 83.5 85.4 88.9
Reasoning & Math
Benchmark Sarvam-105B GLM-4.5-Air GPT-OSS-120B Qwen3-Next-80B-A3B-Thinking
GPQA Diamond 78.7 75.0 80.1 77.2
AIME 25 (w/ Tools) 88.3 (96.7) 83.3 90.0 87.8
Beyond AIME 69.1 61.5 51.0 68.0
HMMT (Feb 25) 85.8 69.2 90.0 73.9
HMMT (Nov 25) 85.8 75.0 90.0 80.0
Agentic
Benchmark Sarvam-105B GLM-4.5-Air GPT-OSS-120B Qwen3-Next-80B-A3B-Thinking
BrowseComp 49.5 21.3 - 38.0
SWE Bench Verified (SWE-Agent Harness) 45.0 57.6 50.6 60.9
τ² Bench (avg.) 68.3 53.2 65.8 55.0

See footnote for evaluation details.

Inference

Clone and build

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && cmake -B build && cmake --build build --config Release -j

Download the model (all shards)

huggingface-cli download sarvam/sarvam-105-gguf --local-dir sarvam-105b-gguf

Run interactive chat

./build/bin/llama-cli \
  -m sarvam-105b-gguf/sarvam-105b-Q4_K_M.gguf-00001-of-00009.gguf \
  -c 4096 \
  -n 512 \
  -p "You are a helpful assistant." \
  --conversation

OpenAI-compatible API server

./build/bin/llama-server \
  -m sarvam-105b-gguf/sarvam-105b-Q4_K_M.gguf-00001-of-00009.gguf \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080

Then query it:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "temperature": 0.8,
    "max_tokens": 512
  }'

Note:

  • Use -ngl -1 to enable GPU acceleration (if available)
  • Omit -ngl to run on CPU

Footnote

  • General settings: All benchmarks are evaluated with a maximum context length of 65,536 tokens.
  • Reasoning & Math benchmarks (Math500, MMLU, MMLU Pro, GPQA Diamond, AIME 25, Beyond AIME, HMMT): Evaluated with temperature=1.0, top_p=1.0, max_new_tokens=65536.
  • Coding & Knowledge benchmarks (Live Code Bench v6, Arena Hard v2, IF Eval): Evaluated with temperature=1.0, top_p=1.0, max_new_tokens=65536.
  • Writing Bench: Responses generated using official Writing-Bench parameters: temperature=0.7, top_p=0.8, top_k=20, max_length=16000. Scoring performed using the official Writing-Bench critic model with: temperature=1.0, top_p=0.95, max_length=2048.
  • Agentic benchmarks (BrowseComp, SWE Bench Verified, τ² Bench): Evaluated with temperature=0.5, top_p=1.0, max_new_tokens=32768.

Citation

@misc{sarvam_sovereign_models,
  title        = {Introducing Sarvam's Sovereign Models},
  author       = {{Sarvam Foundation Models Team}},
  year         = {2026},
  howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
  note         = {Accessed: 2026-03-03}
}
Downloads last month
580
GGUF
Model size
106B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Collection including sarvamai/sarvam-105b-gguf