--- base_model: - Qwen/Qwen3.6-35B-A3B tags: - qwen - nvfp4 - vllm - compressed-tensors name: RedHatAI/Qwen3.6-35B-A3B-NVFP4 --- # NVFP4 Quantized RedHatAI/Qwen3.6-35B-A3B-NVFP4 This is a preliminary version (and subject to change) of NVFP4 quantized [Qwen/Qwen3.6-35B-A3B ](https://huggingface.co/Qwen/Qwen3.6-35B-A3B ) model. The model has both weights and activations quantized to NVFP4 format with [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor). It is compatible and tested against vllm main. Deploy it with: `vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 --reasoning-parser qwen3 --moe_backend flashinfer_cutlass` # Creation Script: Run this script with LLM Compressor main and latest transformers.
```python import torch from compressed_tensors.utils import save_mtp_tensors_to_checkpoint from datasets import load_dataset from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier # NOTE: This example requires transformers >= v5 MODEL_ID = "Qwen/Qwen3.6-35B-A3B" # Load model. model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto") processor = AutoProcessor.from_pretrained(MODEL_ID) # No need to include mtp layers as they are not loaded # through Qwen3_5MoeForConditionalGeneration recipe = QuantizationModifier( targets="Linear", scheme="NVFP4", ignore=[ "re:.*lm_head", "re:visual.*", "re:model.visual.*", "re:.*mlp.gate$", "re:.*embed_tokens$", "re:.*shared_expert_gate$", "re:.*linear_attn.*", ], ) NUM_CALIBRATION_SAMPLES = 256 MAX_SEQUENCE_LENGTH = 4096 ds = load_dataset( "HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]", ) ds = ds.select_columns(["messages"]) ds = ds.shuffle(seed=42) def preprocess_function(example): messages = [ {"role": m["role"], "content": [{"type": "text", "text": m["content"]}]} for m in example["messages"] ] return processor.apply_chat_template( messages, tokenize=True, return_dict=True, add_generation_prompt=False, processor_kwargs={ "return_tensors": "pt", "padding": False, "truncation": True, "max_length": MAX_SEQUENCE_LENGTH, "add_special_tokens": False, }, ) ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names) def data_collator(batch): assert len(batch) == 1 return {key: torch.tensor(value) for key, value in batch[0].items()} # Apply quantization. oneshot( model=model, recipe=recipe, dataset=ds, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, moe_calibrate_all_experts=True, data_collator=data_collator, ) # Save to disk in compressed-tensors format. SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4" model.save_pretrained(SAVE_DIR) processor.save_pretrained(SAVE_DIR) # MTP layers are excluded from the model through Qwen3_5MoeForConditionalGeneration # Save them as-is from the original checkpoint into the quantized output. save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR) ```
# Preliminary Evaluations 1) GSM8K Platinum: ``` lm_eval --model local-chat-completions \ --tasks gsm8k_platinum_cot_llama \ --model_args "model=RedHatAI/Qwen3.6-35B-A3B-NVFP4,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \ --num_fewshot 0 \ --apply_chat_template \ --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=5678" ``` Recovery: | | Qwen/Qwen3.6-35B-A3B | RedHatAI/Qwen3.6-35B-A3B-NVFP4
(this model) | | -------- | :--------------------: | :------------------------------------: | | Accuracy | 95.62 | 96.28 | | Recovery | \- | 100.69% | **Note**: More rigorous evaluations are currently in progress and will be available soon.