Qwen3.5-122B-A10B-NVFP4

Model Overview

Model Architecture: Qwen3NextForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Weight quantization: FP4
- Activation quantization: FP4
Release Date:
Version: 1.0
Model Developers:: Red Hat

Quantized version of Qwen/Qwen3.5-122B-A10B.

Model Optimizations

This model was obtained by quantizing the weights and activations of Qwen/Qwen3.5-122B-A10B to FP4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.

Deployment

Use with vLLM

This model can be deployed efficiently using vLLM.

Text-Only: Skip the vision encoder to free up memory for additional KV cache:

vllm serve RedHatAI/Qwen3.5-122B-A10B-NVFP4 --reasoning-parser qwen3 --language-model-only --moe_backend flashinfer_cutlass

Multimodal (Text + Image): Serve with full vision support:

vllm serve RedHatAI/Qwen3.5-122B-A10B-NVFP4 --reasoning-parser qwen3 --moe_backend flashinfer_cutlass

Tool Call: Enable tool use support:

vllm serve RedHatAI/Qwen3.5-122B-A10B-NVFP4 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --moe_backend flashinfer_cutlass

Multi-Token Prediction (MTP): For speculative decoding:

vllm serve RedHatAI/Qwen3.5-122B-A10B-NVFP4 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --moe_backend flashinfer_cutlass

Send requests to the server:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/Qwen3.5-122B-A10B-NVFP4"

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = client.chat.completions.create(
    model=model,
    messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was quantized using the llm-compressor library as shown below.

Creation details

import torch
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
from datasets import load_dataset
from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# NOTE: This example requires transformers >= v5

MODEL_ID = "Qwen/Qwen3.5-122B-A10B"

# Load model.
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

# No need to include mtp layers as they are not loaded
# through Qwen3_5MoeForConditionalGeneration
recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",
    ignore=[
        "re:.*lm_head",
        "re:visual.*",
        "re:model.visual.*",
        "re:.*mlp.gate$",
        "re:.*embed_tokens$",
        "re:.*shared_expert_gate$",
        "re:.*linear_attn.*",
    ],
)

NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 4096

ds = load_dataset(
    "HuggingFaceH4/ultrachat_200k",
    split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]",
)
ds = ds.select_columns(["messages"])
ds = ds.shuffle(seed=42)


def preprocess_function(example):
    messages = [
        {"role": m["role"], "content": [{"type": "text", "text": m["content"]}]}
        for m in example["messages"]
    ]
    return processor.apply_chat_template(
        messages,
        return_tensors="pt",
        padding=False,
        truncation=True,
        max_length=MAX_SEQUENCE_LENGTH,
        tokenize=True,
        add_special_tokens=False,
        return_dict=True,
        add_generation_prompt=False,
    )


ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)


def data_collator(batch):
    assert len(batch) == 1
    return {key: torch.tensor(value) for key, value in batch[0].items()}


# Apply quantization.
oneshot(
    model=model,
    recipe=recipe,
    dataset=ds,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    moe_calibrate_all_experts=True,
    data_collator=data_collator,
)

# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4"
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

# MTP layers are excluded from the model through Qwen3_5MoeForConditionalGeneration
# Save them as-is from the original checkpoint into the quantized output.
save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR)

Evaluation

The model was evaluated on the ifeval, mmlu_pro and gsm8k_platinum using lm-evaluation-harness, on reasoning tasks using lighteval. vLLM was used for all evaluations.

Evaluation details

lm-evaluation-harness

lm_eval --model local-chat-completions \
  --tasks mmlu_pro_chat \
  --model_args "model=RedHatAI/Qwen3.5-122B-A10B-NVFP4,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
  --num_fewshot 0 \
  --apply_chat_template \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=5678"

lm_eval --model local-chat-completions \
  --tasks ifeval \
  --model_args "model=RedHatAI/Qwen3.5-122B-A10B-NVFP4,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
  --num_fewshot 0 \
  --apply_chat_template \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=5678"

lm_eval --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=RedHatAI/Qwen3.5-122B-A10B-NVFP4,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
  --num_fewshot 0 \
  --apply_chat_template \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=5678"

lighteval

lighteval_model_arguments.yaml

model_parameters:
  provider: "hosted_vllm"
  model_name: "hosted_vllm/RedHatAI/Qwen3.5-122B-A10B-NVFP4"
  base_url: "http://0.0.0.0:8000/v1"
  api_key: ""
  timeout: 2400
  concurrent_requests: 128
  generation_parameters:
    temperature: 1.0
    max_new_tokens: 64000
    top_p: 0.95
    top_k: 20
    min_p: 0.0
    presence_penalty: 1.5
    repetition_penalty: 1.0
    seed: 5678

lighteval endpoint litellm lighteval_model_arguments.yaml  \
  "aime25|0,math_500|0,gpqa:diamond|0"

Accuracy

Benchmark	Qwen3.5-122B-A10B	Qwen3.5-122B-A10B-NVFP4 (this model)	Recovery (%)
GSM8k Platinum (0-shot)	95.59	95.37	99.77
MMLU-Pro (0-shot)	86.96	86.62	99.61
IfEval (0-shot)	93.80	93.32	99.49
AIME 2025	92.92	91.66	98.65
GPQA diamond	87.54	86.70	99.04
Math 500	84.73	84.80	100.08

Downloads last month: 95,286

Safetensors

Model size

71B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RedHatAI/Qwen3.5-122B-A10B-NVFP4

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(94)

this model