Qwen3.5-122B-A10B-FP8-CacheReady

FP8-quantized Qwen3.5-122B with routing canonicalization for deterministic MoE expert selection.

This is the FP8 variant of dystrio/Qwen3.5-122B-A10B-CacheReady, built directly from Qwen/Qwen3.5-122B-A10B-FP8 (Qwen's official FP8 checkpoint).

Only router (gate) weight matrices were modified. In Qwen's FP8 checkpoint, gate weights are already stored in bf16 precision (excluded from quantization by Qwen). All expert weights, attention weights, embeddings, and quantization scales are byte-for-byte identical to Qwen/Qwen3.5-122B-A10B-FP8. Approximately ~45% of experts fell into equivalence groups across the model.

What CacheReady does

MoE routers choose top-k experts per token. When multiple experts produce near-identical outputs (equivalence groups), the router can flip between them due to tiny numerical differences from quantization, batch shape, or execution order. CacheReady canonicalizes router weights so equivalent experts receive identical routing scores, making expert selection deterministic by construction.

Benchmark results

All benchmarks: 2x NVIDIA H100 NVL 96GB, vLLM 0.18.0, enforce_eager=True, tensor_parallel_size=2.

Routing determinism

Model Texts Determinism (FP8)
Original FP8 20 100%
FP8-CacheReady 20 100%

Quantized routing stability also verified at 100% agreement for both models.

Prefix caching throughput (FP8, 2x H100 NVL, TP=2)

Model Workload Without Cache With Cache Cache Effect
Original FP8 Shared prefix 812 tok/s 819 tok/s 1.01x
Original FP8 Unique prefix 840 tok/s 832 tok/s 0.99x
FP8-CacheReady Shared prefix 834 tok/s 806 tok/s 0.97x
FP8-CacheReady Unique prefix 820 tok/s 804 tok/s 0.98x

On this hardware configuration (2x H100 NVL, TP=2), the original FP8 model does not exhibit the prefix caching degradation observed in the bf16 variant. Both models show roughly neutral cache behavior.

Comparison with bf16 CacheReady (4x H100 PCIe, TP=4)

The bf16 CacheReady variant showed a dramatic difference on 4x H100 PCIe with TP=4:

Model Shared-prefix cache effect
Original bf16 (TP=4) 0.65x (caching hurts — 35% slower)
bf16 CacheReady (TP=4) 1.31x (caching helps — 31% faster)

The routing instability that causes prefix caching failures depends on the serving configuration. Higher tensor parallelism (TP=4), PCIe interconnect (vs NVLink), and different batch patterns can trigger expert selection flips that invalidate cached KV states. At TP=2 on NVLink, the problem doesn't manifest — but it may on larger deployments.

When does CacheReady matter?

The patch has zero quality cost and zero performance cost. It is most likely to help when:

  • Higher TP counts (TP=4, TP=8) — more expert partition boundaries, more routing divergence opportunities
  • PCIe interconnect — less deterministic execution order than NVLink
  • Mixed-precision serving — nvfp4 quantization amplifies routing noise more than fp8
  • Multi-tenant / high-concurrency — varying batch shapes across requests with shared prefixes

If your serving config already produces stable routing (as our 2x H100 NVL TP=2 bench shows), CacheReady is a free safety net. If your config triggers instability (as the bf16 TP=4 PCIe bench shows), CacheReady fixes it.

Quality preservation

No measurable quality change. The same routing canonicalization patch from the bf16 CacheReady variant applies directly because Qwen stores gate weights in bf16 precision even in FP8 checkpoints. The patch only modifies router weight rows for experts that produce near-identical outputs.

Usage

Drop-in replacement for Qwen/Qwen3.5-122B-A10B-FP8. No code changes needed.

With vLLM

python -m vllm.entrypoints.openai.api_server \
    --model dystrio/Qwen3.5-122B-A10B-FP8-CacheReady \
    --tensor-parallel-size 2 \
    --enable-prefix-caching \
    --trust-remote-code

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "dystrio/Qwen3.5-122B-A10B-FP8-CacheReady",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("dystrio/Qwen3.5-122B-A10B-FP8-CacheReady")

Compatibility

  • vLLM >= 0.17 (native FP8 support)
  • SGLang
  • Any framework that loads FP8 safetensors checkpoints

Related Models

Variant Precision Size Shared-prefix improvement Link
CacheReady (bf16) bf16 ~240 GB 0.65x → 1.31x (TP=4 PCIe) dystrio/Qwen3.5-122B-A10B-CacheReady
CacheReady (FP8) FP8 ~120 GB neutral at TP=2 NVLink this model

Citation

@misc{dystrio_cacheready_fp8_2026,
  title={Routing Canonicalization for Deterministic MoE Inference (FP8)},
  author={Dystrio},
  year={2026},
  url={https://huggingface.co/dystrio/Qwen3.5-122B-A10B-FP8-CacheReady}
}
Downloads last month
91
Safetensors
Model size
125B params
Tensor type
BF16
·
F8_E4M3
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dystrio/Qwen3.5-122B-A10B-FP8-CacheReady

Quantized
(1)
this model