Exceptional Stability and Performance Quantized Devstral Small 2 (IQ4_XS) at 10K Context

#2
by Kamil21322 - opened

I would like to share my experience and sincere appreciation for the quantized Devstral Small 2 (IQ4_XS) model.

The performance has been outstanding. At a 10K context length, the model consistently runs at approximately 30 tokens per second, which is highly impressive for a quantized configuration. More importantly, it maintains coherence and logical consistency as the context fills. It does not degrade into irrelevant or nonsensical output, which is often a concern with extended contexts.

I also experimented with setting the KV cache to Q5.1 and increasing the context length further. Even under these conditions, the model preserved its stability. I conducted multiple tests across different scenarios (which I won’t be sharing here as they are part of my own projects), and the results were consistently strong. The reliability and balance between efficiency and quality are genuinely remarkable.

When the context limit is fully reached, the model naturally stops upon receiving a new prompt, which is expected behavior. Up until that limit, however, it performs flawlessly.

Your quantized models clearly reflect high-level optimization and engineering excellence. The balance between speed, memory efficiency, and output quality is extremely well executed.

I am also eagerly looking forward to testing the newly released Qwen 3.5 27B and 35B models. If they follow the same level of optimization and stability, they will be absolutely impressive.

My sincere congratulations to the entire team β€” your work is truly commendable.

Try Qwen3-Coder-30B-A3B-Instruct-Q3_K_S-2.69bpw.gguf too! <3

Byteshape quantization works great; they do BF16-quality quantization, but the models themselves don't meet my needs. :(

Sign up or log in to comment