Mistral-Small-24B-Instruct-2501-AWQ
4-bit AWQ quantization of mistralai/Mistral-Small-24B-Instruct-2501, produced for efficient inference on consumer/prosumer GPUs with vLLM.
Quantization details
| Field | Value |
|---|---|
| Method | AWQ (Activation-aware Weight Quantization) |
| Scheme | W4A16_ASYM |
| Group size | 128 |
| Ignored layers | lm_head (kept at full precision) |
| Format | compressed-tensors (pack-quantized) |
| Tool | llmcompressor 0.6.0 |
| Calibration dataset | HuggingFaceH4/ultrachat_200k (train_sft split) |
| Calibration samples | 256 |
| Max sequence length | 512 tokens |
The weights are saved in compressed-tensors format, which vLLM supports natively. No separate autoawq package is needed.
W4A16_ASYM: weights are stored as 4-bit integers; activations remain in 16-bit (fp16/bf16) during inference. Asymmetric quantization allows an independent zero-point per group, giving better coverage of skewed weight distributions. Groups of 128 consecutive weights share a single scale and zero-point.
Usage
vllm serve dark-side-of-the-code/Mistral-Small-24B-Instruct-2501-AWQ --dtype auto
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
response = client.chat.completions.create(
model="dark-side-of-the-code/Mistral-Small-24B-Instruct-2501-AWQ",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Limitations
- Quantization introduces a small accuracy degradation compared to the bf16 base model.
- This model inherits all limitations and intended-use restrictions from the base model. Refer to the base model card for details.
License
Mistral Small is released under the Mistral Research License (MRL-0.1). These quantized weights are a derivative work — verify that your intended use complies with the license before use.
- Downloads last month
- 60
Model tree for dark-side-of-the-code/Mistral-Small-24B-Instruct-2501-AWQ
Base model
mistralai/Mistral-Small-24B-Base-2501