Sarvam-30B GPTQ 4-bit
GPTQ 4-bit quantized version of sarvamai/sarvam-30b, produced with llm-compressor.
- Base model: sarvamai/sarvam-30b (MoE, 32B total / 2.4B active, 262k vocab)
- Quantization: GPTQ W4A16, compressed-tensors format
- VRAM: ~10GB (fits on 1x L40S / A100 / RTX 4090)
- Calibration: WikiText, 128 samples, seq_len=1024
- Compatible with: vLLM (native compressed-tensors support)
Usage with vLLM
vllm serve mira-iitjmu/sarvam-30b-gptq-4bit \
--host 0.0.0.0 --port 8082 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--trust-remote-code
Usage with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mira-iitjmu/sarvam-30b-gptq-4bit", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("mira-iitjmu/sarvam-30b-gptq-4bit", trust_remote_code=True, device_map="auto")
Part of the Svara project by mira-iitjmu.
- Downloads last month
- 154
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support
Model tree for mira-iitjmu/sarvam-30b-gptq-4bit
Base model
sarvamai/sarvam-30b