Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
nemotron-3
latent-moe
mtp
conversational
custom_code
Eval Results

AWQ? Autoround? Any ~int4 for vllm?

#12
by JoeSmith245 - opened

Could we get some official ~Q4 quants for vllm, for those of us on ampere without FP8/FP4?

I'm interested in the process to quantize a model like this to AWQ format.

There's the llm-compressor 4-bit AWQ recipe for the 30B model. This could be adapted for this 120B Super version
https://huggingface.co/stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ/blob/main/recipe.yaml

cyankiwi came through on this, for all interested: https://huggingface.co/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit

Sign up or log in to comment