| --- |
| base_model: |
| - Qwen/Qwen3.6-35B-A3B |
| tags: |
| - qwen |
| - nvfp4 |
| - vllm |
| - compressed-tensors |
| name: RedHatAI/Qwen3.6-35B-A3B-NVFP4 |
| --- |
| |
| # NVFP4 Quantized RedHatAI/Qwen3.6-35B-A3B-NVFP4 |
|
|
| This is a preliminary version (and subject to change) of NVFP4 quantized [Qwen/Qwen3.6-35B-A3B ](https://huggingface.co/Qwen/Qwen3.6-35B-A3B ) model. |
| The model has both weights and activations quantized to NVFP4 format with [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor). |
|
|
| It is compatible and tested against vllm main. Deploy it with: `vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 --reasoning-parser qwen3 --moe_backend flashinfer_cutlass` |
|
|
| # Creation Script: |
|
|
| Run this script with LLM Compressor main and latest transformers. |
|
|
| <details> |
| |
| ```python |
| import torch |
| from compressed_tensors.utils import save_mtp_tensors_to_checkpoint |
| from datasets import load_dataset |
| from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration |
| |
| from llmcompressor import oneshot |
| from llmcompressor.modifiers.quantization import QuantizationModifier |
| |
| # NOTE: This example requires transformers >= v5 |
| |
| MODEL_ID = "Qwen/Qwen3.6-35B-A3B" |
| |
| # Load model. |
| model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto") |
| processor = AutoProcessor.from_pretrained(MODEL_ID) |
| |
| # No need to include mtp layers as they are not loaded |
| # through Qwen3_5MoeForConditionalGeneration |
| recipe = QuantizationModifier( |
| targets="Linear", |
| scheme="NVFP4", |
| ignore=[ |
| "re:.*lm_head", |
| "re:visual.*", |
| "re:model.visual.*", |
| "re:.*mlp.gate$", |
| "re:.*embed_tokens$", |
| "re:.*shared_expert_gate$", |
| "re:.*linear_attn.*", |
| ], |
| ) |
| |
| NUM_CALIBRATION_SAMPLES = 256 |
| MAX_SEQUENCE_LENGTH = 4096 |
| |
| ds = load_dataset( |
| "HuggingFaceH4/ultrachat_200k", |
| split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]", |
| ) |
| ds = ds.select_columns(["messages"]) |
| ds = ds.shuffle(seed=42) |
| |
| |
| def preprocess_function(example): |
| messages = [ |
| {"role": m["role"], "content": [{"type": "text", "text": m["content"]}]} |
| for m in example["messages"] |
| ] |
| return processor.apply_chat_template( |
| messages, |
| tokenize=True, |
| return_dict=True, |
| add_generation_prompt=False, |
| processor_kwargs={ |
| "return_tensors": "pt", |
| "padding": False, |
| "truncation": True, |
| "max_length": MAX_SEQUENCE_LENGTH, |
| "add_special_tokens": False, |
| }, |
| ) |
| |
| |
| ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names) |
| |
| |
| def data_collator(batch): |
| assert len(batch) == 1 |
| return {key: torch.tensor(value) for key, value in batch[0].items()} |
| |
| |
| # Apply quantization. |
| oneshot( |
| model=model, |
| recipe=recipe, |
| dataset=ds, |
| max_seq_length=MAX_SEQUENCE_LENGTH, |
| num_calibration_samples=NUM_CALIBRATION_SAMPLES, |
| moe_calibrate_all_experts=True, |
| data_collator=data_collator, |
| ) |
| |
| # Save to disk in compressed-tensors format. |
| SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4" |
| model.save_pretrained(SAVE_DIR) |
| processor.save_pretrained(SAVE_DIR) |
| |
| # MTP layers are excluded from the model through Qwen3_5MoeForConditionalGeneration |
| # Save them as-is from the original checkpoint into the quantized output. |
| save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR) |
| |
| ``` |
| </details> |
|
|
|
|
| # Preliminary Evaluations |
|
|
| 1) GSM8K Platinum: |
| ``` |
| lm_eval --model local-chat-completions \ |
| --tasks gsm8k_platinum_cot_llama \ |
| --model_args "model=RedHatAI/Qwen3.6-35B-A3B-NVFP4,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \ |
| --num_fewshot 0 \ |
| --apply_chat_template \ |
| --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=5678" |
| |
| |
| ``` |
|
|
| Recovery: |
|
|
| | | Qwen/Qwen3.6-35B-A3B | RedHatAI/Qwen3.6-35B-A3B-NVFP4<br> (this model) | |
| | -------- | :--------------------: | :------------------------------------: | |
| | Accuracy | 95.62 | 96.28 | |
| | Recovery | \- | 100.69% | |
|
|
|
|
| **Note**: More rigorous evaluations are currently in progress and will be available soon. |
|
|
|
|
|
|