Qwen3.5-122B-A10B-FP8-CacheReady

FP8-quantized Qwen3.5-122B with routing canonicalization for deterministic MoE expert selection.

This is the FP8 variant of dystrio/Qwen3.5-122B-A10B-CacheReady, built directly from Qwen/Qwen3.5-122B-A10B-FP8 (Qwen's official FP8 checkpoint).

Only router (gate) weight matrices were modified. In Qwen's FP8 checkpoint, gate weights are already stored in bf16 precision (excluded from quantization by Qwen). All expert weights, attention weights, embeddings, and quantization scales are byte-for-byte identical to Qwen/Qwen3.5-122B-A10B-FP8. Approximately ~45% of experts fell into equivalence groups across the model.

What CacheReady does

MoE routers choose top-k experts per token. When multiple experts produce near-identical outputs (equivalence groups), the router can flip between them due to tiny numerical differences from quantization, batch shape, or execution order. CacheReady canonicalizes router weights so equivalent experts receive identical routing scores, making expert selection deterministic by construction.

Benchmark results

All benchmarks: 2x NVIDIA H100 NVL 96GB, vLLM 0.18.0, enforce_eager=True, tensor_parallel_size=2.

Routing determinism

Model	Texts	Determinism (FP8)
Original FP8	20	100%
FP8-CacheReady	20	100%

Quantized routing stability also verified at 100% agreement for both models.

Prefix caching throughput (FP8, 2x H100 NVL, TP=2)

Model	Workload	Without Cache	With Cache	Cache Effect
Original FP8	Shared prefix	812 tok/s	819 tok/s	1.01x
Original FP8	Unique prefix	840 tok/s	832 tok/s	0.99x
FP8-CacheReady	Shared prefix	834 tok/s	806 tok/s	0.97x
FP8-CacheReady	Unique prefix	820 tok/s	804 tok/s	0.98x

On this hardware configuration (2x H100 NVL, TP=2), the original FP8 model does not exhibit the prefix caching degradation observed in the bf16 variant. Both models show roughly neutral cache behavior.

Comparison with bf16 CacheReady (4x H100 PCIe, TP=4)

The bf16 CacheReady variant showed a dramatic difference on 4x H100 PCIe with TP=4:

Model	Shared-prefix cache effect
Original bf16 (TP=4)	0.65x (caching hurts — 35% slower)
bf16 CacheReady (TP=4)	1.31x (caching helps — 31% faster)

The routing instability that causes prefix caching failures depends on the serving configuration. Higher tensor parallelism (TP=4), PCIe interconnect (vs NVLink), and different batch patterns can trigger expert selection flips that invalidate cached KV states. At TP=2 on NVLink, the problem doesn't manifest — but it may on larger deployments.

When does CacheReady matter?

The patch has zero quality cost and zero performance cost. It is most likely to help when:

Higher TP counts (TP=4, TP=8) — more expert partition boundaries, more routing divergence opportunities
PCIe interconnect — less deterministic execution order than NVLink
Mixed-precision serving — nvfp4 quantization amplifies routing noise more than fp8
Multi-tenant / high-concurrency — varying batch shapes across requests with shared prefixes

If your serving config already produces stable routing (as our 2x H100 NVL TP=2 bench shows), CacheReady is a free safety net. If your config triggers instability (as the bf16 TP=4 PCIe bench shows), CacheReady fixes it.

Quality preservation

No measurable quality change. The same routing canonicalization patch from the bf16 CacheReady variant applies directly because Qwen stores gate weights in bf16 precision even in FP8 checkpoints. The patch only modifies router weight rows for experts that produce near-identical outputs.

Usage

Drop-in replacement for Qwen/Qwen3.5-122B-A10B-FP8. No code changes needed.

With vLLM

python -m vllm.entrypoints.openai.api_server \
    --model dystrio/Qwen3.5-122B-A10B-FP8-CacheReady \
    --tensor-parallel-size 2 \
    --enable-prefix-caching \
    --trust-remote-code

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "dystrio/Qwen3.5-122B-A10B-FP8-CacheReady",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("dystrio/Qwen3.5-122B-A10B-FP8-CacheReady")

Compatibility

vLLM >= 0.17 (native FP8 support)
SGLang
Any framework that loads FP8 safetensors checkpoints

Related Models

Variant	Precision	Size	Shared-prefix improvement	Link
CacheReady (bf16)	bf16	~240 GB	0.65x → 1.31x (TP=4 PCIe)	dystrio/Qwen3.5-122B-A10B-CacheReady
CacheReady (FP8)	FP8	~120 GB	neutral at TP=2 NVLink	this model

Citation

@misc{dystrio_cacheready_fp8_2026,
  title={Routing Canonicalization for Deterministic MoE Inference (FP8)},
  author={Dystrio},
  year={2026},
  url={https://huggingface.co/dystrio/Qwen3.5-122B-A10B-FP8-CacheReady}
}

Downloads last month: 91

Safetensors

Model size

125B params

Tensor type

BF16

F8_E4M3

F32

Model tree for dystrio/Qwen3.5-122B-A10B-FP8-CacheReady

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

Qwen/Qwen3.5-122B-A10B-FP8

Quantized

(1)

this model