Qwen3.5-122B-A10B-FP8-CacheReady
FP8-quantized Qwen3.5-122B with routing canonicalization for deterministic MoE expert selection.
This is the FP8 variant of dystrio/Qwen3.5-122B-A10B-CacheReady, built directly from Qwen/Qwen3.5-122B-A10B-FP8 (Qwen's official FP8 checkpoint).
Only router (gate) weight matrices were modified. In Qwen's FP8 checkpoint, gate weights are already stored in bf16 precision (excluded from quantization by Qwen). All expert weights, attention weights, embeddings, and quantization scales are byte-for-byte identical to Qwen/Qwen3.5-122B-A10B-FP8. Approximately ~45% of experts fell into equivalence groups across the model.
What CacheReady does
MoE routers choose top-k experts per token. When multiple experts produce near-identical outputs (equivalence groups), the router can flip between them due to tiny numerical differences from quantization, batch shape, or execution order. CacheReady canonicalizes router weights so equivalent experts receive identical routing scores, making expert selection deterministic by construction.
Benchmark results
All benchmarks: 2x NVIDIA H100 NVL 96GB, vLLM 0.18.0, enforce_eager=True, tensor_parallel_size=2.
Routing determinism
| Model | Texts | Determinism (FP8) |
|---|---|---|
| Original FP8 | 20 | 100% |
| FP8-CacheReady | 20 | 100% |
Quantized routing stability also verified at 100% agreement for both models.
Prefix caching throughput (FP8, 2x H100 NVL, TP=2)
| Model | Workload | Without Cache | With Cache | Cache Effect |
|---|---|---|---|---|
| Original FP8 | Shared prefix | 812 tok/s | 819 tok/s | 1.01x |
| Original FP8 | Unique prefix | 840 tok/s | 832 tok/s | 0.99x |
| FP8-CacheReady | Shared prefix | 834 tok/s | 806 tok/s | 0.97x |
| FP8-CacheReady | Unique prefix | 820 tok/s | 804 tok/s | 0.98x |
On this hardware configuration (2x H100 NVL, TP=2), the original FP8 model does not exhibit the prefix caching degradation observed in the bf16 variant. Both models show roughly neutral cache behavior.
Comparison with bf16 CacheReady (4x H100 PCIe, TP=4)
The bf16 CacheReady variant showed a dramatic difference on 4x H100 PCIe with TP=4:
| Model | Shared-prefix cache effect |
|---|---|
| Original bf16 (TP=4) | 0.65x (caching hurts — 35% slower) |
| bf16 CacheReady (TP=4) | 1.31x (caching helps — 31% faster) |
The routing instability that causes prefix caching failures depends on the serving configuration. Higher tensor parallelism (TP=4), PCIe interconnect (vs NVLink), and different batch patterns can trigger expert selection flips that invalidate cached KV states. At TP=2 on NVLink, the problem doesn't manifest — but it may on larger deployments.
When does CacheReady matter?
The patch has zero quality cost and zero performance cost. It is most likely to help when:
- Higher TP counts (TP=4, TP=8) — more expert partition boundaries, more routing divergence opportunities
- PCIe interconnect — less deterministic execution order than NVLink
- Mixed-precision serving — nvfp4 quantization amplifies routing noise more than fp8
- Multi-tenant / high-concurrency — varying batch shapes across requests with shared prefixes
If your serving config already produces stable routing (as our 2x H100 NVL TP=2 bench shows), CacheReady is a free safety net. If your config triggers instability (as the bf16 TP=4 PCIe bench shows), CacheReady fixes it.
Quality preservation
No measurable quality change. The same routing canonicalization patch from the bf16 CacheReady variant applies directly because Qwen stores gate weights in bf16 precision even in FP8 checkpoints. The patch only modifies router weight rows for experts that produce near-identical outputs.
Usage
Drop-in replacement for Qwen/Qwen3.5-122B-A10B-FP8. No code changes needed.
With vLLM
python -m vllm.entrypoints.openai.api_server \
--model dystrio/Qwen3.5-122B-A10B-FP8-CacheReady \
--tensor-parallel-size 2 \
--enable-prefix-caching \
--trust-remote-code
With transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"dystrio/Qwen3.5-122B-A10B-FP8-CacheReady",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("dystrio/Qwen3.5-122B-A10B-FP8-CacheReady")
Compatibility
- vLLM >= 0.17 (native FP8 support)
- SGLang
- Any framework that loads FP8 safetensors checkpoints
Related Models
| Variant | Precision | Size | Shared-prefix improvement | Link |
|---|---|---|---|---|
| CacheReady (bf16) | bf16 | ~240 GB | 0.65x → 1.31x (TP=4 PCIe) | dystrio/Qwen3.5-122B-A10B-CacheReady |
| CacheReady (FP8) | FP8 | ~120 GB | neutral at TP=2 NVLink | this model |
Citation
@misc{dystrio_cacheready_fp8_2026,
title={Routing Canonicalization for Deterministic MoE Inference (FP8)},
author={Dystrio},
year={2026},
url={https://huggingface.co/dystrio/Qwen3.5-122B-A10B-FP8-CacheReady}
}
- Downloads last month
- 91