Sarvam-30B GPTQ 4-bit

GPTQ 4-bit quantized version of sarvamai/sarvam-30b, produced with llm-compressor.

  • Base model: sarvamai/sarvam-30b (MoE, 32B total / 2.4B active, 262k vocab)
  • Quantization: GPTQ W4A16, compressed-tensors format
  • VRAM: ~10GB (fits on 1x L40S / A100 / RTX 4090)
  • Calibration: WikiText, 128 samples, seq_len=1024
  • Compatible with: vLLM (native compressed-tensors support)

Usage with vLLM

vllm serve mira-iitjmu/sarvam-30b-gptq-4bit \
    --host 0.0.0.0 --port 8082 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code

Usage with transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mira-iitjmu/sarvam-30b-gptq-4bit", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("mira-iitjmu/sarvam-30b-gptq-4bit", trust_remote_code=True, device_map="auto")

Part of the Svara project by mira-iitjmu.

Downloads last month
154
Safetensors
Model size
6B params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for mira-iitjmu/sarvam-30b-gptq-4bit

Quantized
(15)
this model