Devstral-24B-AWQ / README.md
mattbucci's picture
Vision tested and working
b68f4c9 verified
metadata
base_model: mistralai/Devstral-Small-2507
tags:
  - awq
  - 4-bit
  - rdna4
  - gfx1201
  - rocm
  - sglang
  - quantized
license: apache-2.0

Devstral-24B AWQ 4-bit

AWQ 4-bit quantization of Devstral Small 24B optimized for AMD RDNA4 (gfx1201) inference with SGLang.

Model Details

Base model mistralai/Devstral-Small-2507
Architecture Dense
Parameters 24B
Layers 40
Context 32K (tested), 393K (max)
Quantization AWQ 4-bit, group_size=128

Performance (2x AMD Radeon AI PRO R9700, TP=2)

  • Decode speed: 37 tok/s single-user on 2x R9700
  • Launch: scripts/launch.sh devstral

Notes

GPTQ-calibrated with 128 samples. BOS token removed from chat template (fixes <unk> output). Text-only warmup to avoid radix cache pollution from vision tokens.

Known Limitations

  • Vision: WORKING. Vision tower weights preserved in original precision (modules_to_not_convert includes vision_tower, multi_modal_projector). Tested: correctly identifies a red square image.

Usage with SGLang

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
scripts/launch.sh devstral

See the RDNA4 Inference Repository for full setup instructions, patches, and benchmarks.

Hardware

Tested on 2x AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32+34 GB VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches.