| --- |
| language: |
| - en |
| license: apache-2.0 |
| library_name: transformers |
| tags: |
| - rejection-fine-tuning |
| - self-distillation |
| - qwen |
| - qwen3.6 |
| - moe |
| - deltanet |
| - linear-attention |
| - code-generation |
| - coding |
| - lora-merged |
| - bf16 |
| base_model: Qwen/Qwen3.6-35B-A3B |
| pipeline_tag: text-generation |
| model-index: |
| - name: Qwen3.6-35B-A3B-RFT |
| results: |
| - task: |
| type: text-generation |
| dataset: |
| name: Self-generated coding dataset (RFT, filtered) |
| type: custom |
| metrics: |
| - name: Train Loss |
| type: train_loss |
| value: 0.523 |
| - name: avg_sample_pass_rate (temp=0.7, 13 problems, 10 samples each) |
| type: avg_sample_pass_rate |
| value: 0.985 |
| --- |
| |
| # Qwen3.6-35B-A3B-RFT |
|
|
| A fine-tuned version of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) using **Rejection Fine-Tuning (RFT) on self-generated data**, inspired by the [Simple Self-Distillation (SSD)](https://arxiv.org/abs/2604.01193) paper. The LoRA adapter has been merged into the base weights -- this is a standard bf16 model ready for direct use or quantization. |
|
|
| ## Method (RFT, Not Pure SSD) |
|
|
| Our method is **inspired by** the SSD paper ("Embarrassingly Simple Self-Distillation Improves Code Generation", arxiv 2604.01193) but differs in a critical way: |
|
|
| - **SSD (the paper)**: Generates samples from the model and trains on ALL of them -- correct and incorrect -- with NO filtering. That is the paper's key insight: unfiltered self-generated data still improves pass@k. |
| - **Our method**: We generated samples at high temperature, then **filtered for correctness** using execution-based verification (2,000 generated, 1,796 passed tests). We trained only on correct outputs. |
|
|
| This makes our method **Rejection Fine-Tuning (RFT)** -- also known as rejection sampling + SFT or on-policy distillation. RFT is a well-established technique. The difference matters: SSD's claim is that filtering is unnecessary; we used filtering, so we cannot validate or invalidate that claim. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |----------|-------| |
| | Architecture | Qwen3.5 MoE with Gated DeltaNet linear attention (see note below) | |
| | Total parameters | 34.66B | |
| | Active parameters | ~3B (Mixture of Experts, 256 experts, 8 active per token) | |
| | Hidden layers | 40 (30 linear attention + 10 full attention) | |
| | Precision | bfloat16 | |
| | Model size on disk | ~64 GB | |
| | Context length | 262,144 tokens | |
| | License | Apache 2.0 | |
|
|
| > **Architecture note**: The HuggingFace config reports `model_type: qwen3_5_moe` -- Qwen3.6 is built on the Qwen3.5 MoE architecture with the addition of Gated DeltaNet linear attention layers. |
| |
| ## Training Details |
| |
| ### Method |
| |
| 1. Generated 2,000 coding solutions from the base model at temp=1.6, top_k=20, top_p=0.8 |
| 2. Filtered for correctness (execution + test pass) -- 1,796 samples survived |
| 3. Split into 1,616 train / 180 validation |
| 4. Fine-tuned with LoRA, then merged adapter into base weights |
| |
| ### LoRA Configuration |
| |
| | Parameter | Value | |
| |-----------|-------| |
| | Rank (r) | 16 | |
| | Alpha | 16 | |
| | Dropout | 0.0 | |
| | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, in_proj_qkv, in_proj_z, out_proj | |
| | Trainable parameters | 19.2M / 34.66B (0.055%) | |
| |
| The target modules include both standard transformer attention/MLP layers and Qwen3.6's DeltaNet linear attention layers (in_proj_qkv, in_proj_z, out_proj). |
|
|
| ### Training Hyperparameters |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Optimizer | AdamW 8-bit | |
| | Learning rate | 2e-4 (cosine schedule) | |
| | Warmup | 6% of steps | |
| | Max steps | 150 | |
| | Batch size | 4 | |
| | Gradient accumulation | 8 (effective batch = 32) | |
| | Max sequence length | 2,048 | |
| | Weight decay | 0.01 | |
| | Precision | bfloat16 (no quantization during training) | |
| | Seed | 42 | |
|
|
| ### Training Results |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Final train loss | 0.523 | |
| | Eval loss | 0.482 (at step 150) | |
| | Token accuracy | 85.9% | |
| | Training time | 78 min | |
| | Peak GPU memory | 64.7 GB | |
| | Hardware | NVIDIA H200 (Modal cloud) | |
| | Estimated cost | ~$6.20 | |
|
|
| ### Merge |
|
|
| Adapter merged into base weights using `PeftModel.merge_and_unload()` from PEFT 0.19.1. The result is a standard HuggingFace model -- no adapter loading required at inference time. |
|
|
| ## Evaluation |
|
|
| Tested as a 6-bit MLX quantization on Mac Studio M4 Max (128GB) against the base model (unsloth 4-bit quantization). 13 coding problems, 10 samples each at temp=0.7: |
|
|
| | Problem difficulty | Base (4-bit) | Merged (6-bit) | |
| |-------------------|-------------|----------------| |
| | Easy (5 problems) | 50/50 (100%) | 50/50 (100%) | |
| | Hard (8 problems) | 76/80 (95%) | 78/80 (98%) | |
| | **Overall** | **126/130 (97%)** | **128/130 (98%)** | |
|
|
| Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10 -> merged 9/10. |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Inference speed (6-bit MLX) | 78.9 tok/s average | |
| | Base model speed (4-bit MLX) | 86.7 tok/s average | |
|
|
| **Important caveats**: |
|
|
| - **Quantization confound**: The base model was tested at 4-bit quantization while the merged model was tested at 6-bit. Higher quantization preserves more model information. Some or all of the quality difference (128/130 vs 126/130) may be attributable to quantization level rather than the RFT training. A controlled comparison at matched quantization has not been run. |
| - **Statistical significance**: The difference of 2/130 samples is not statistically significant (p ~= 0.28, Fisher's exact test). These results are within noise at this sample size. |
| - **Temp=0 behavior**: At temp=0, the merged model is expected to behave very similarly to the base model, though weights differ due to the LoRA merge. We have not formally tested temp=0 equivalence. |
|
|
| ## How to Use |
|
|
| ### With Transformers |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "shaneMattner/Qwen3.6-35B-A3B-RFT", |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| attn_implementation="eager", |
| ) |
| tokenizer = AutoTokenizer.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT") |
| |
| messages = [ |
| {"role": "user", "content": "Write a Python function to merge two sorted lists into one sorted list."} |
| ] |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ### With MLX (Apple Silicon) |
|
|
| ```bash |
| pip install mlx-lm |
| ``` |
|
|
| ```python |
| from mlx_lm import load, generate |
| |
| model, tokenizer = load("shaneMattner/Qwen3.6-35B-A3B-RFT") |
| response = generate( |
| model, |
| tokenizer, |
| prompt="Write a Python function to merge two sorted lists.", |
| max_tokens=512, |
| ) |
| print(response) |
| ``` |
|
|
| Or quantize first for faster inference: |
|
|
| ```bash |
| # Convert to 6-bit MLX format |
| python -m mlx_lm.convert \ |
| --hf-path shaneMattner/Qwen3.6-35B-A3B-RFT \ |
| --mlx-path Qwen3.6-35B-A3B-RFT-6bit \ |
| -q --q-bits 6 |
| ``` |
|
|
| **Note**: If you encounter errors related to `model_type`, you may need to change `"model_type": "qwen3_5_moe_text"` to `"model_type": "qwen3_5_moe"` in `config.json` for mlx-lm compatibility. |
|
|
| ### With llama.cpp / GGUF |
|
|
| Convert to GGUF for use with llama.cpp, Ollama, or other GGUF-compatible tools: |
|
|
| ```bash |
| # Clone llama.cpp and convert |
| python convert_hf_to_gguf.py shaneMattner/Qwen3.6-35B-A3B-RFT --outtype bf16 |
| |
| # Quantize to desired format |
| ./llama-quantize Qwen3.6-35B-A3B-RFT-bf16.gguf Qwen3.6-35B-A3B-RFT-Q4_K_M.gguf Q4_K_M |
| ``` |
|
|
| ## Limitations |
|
|
| - **Coding-focused**: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model. |
| - **Bounded by base model**: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability. |
| - **Small training set**: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results. |
| - **Eval coverage**: Tested on 13 coding problems only. Broader benchmarks (HumanEval, MBPP, etc.) have not been run. Results are not statistically significant at this sample size. |
| - **Quantization confound**: Base and merged models were evaluated at different quantization levels (4-bit vs 6-bit), confounding the quality comparison. |
| - **DeltaNet targeting**: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations. |
|
|
| ## Architecture Notes |
|
|
| Qwen3.6-35B-A3B uses a hybrid architecture: |
| - **Mixture of Experts (MoE)**: 256 experts with 8 active per token, keeping active compute at ~3B parameters despite 34.66B total |
| - **Gated DeltaNet linear attention**: 30 of 40 layers use linear attention (every 4th layer uses full attention), enabling efficient long-context processing |
| - **262K context window**: Supports up to 262,144 tokens |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{mattner2026qwen36rft, |
| title={Qwen3.6-35B-A3B-RFT: Rejection Fine-Tuned Qwen3.6 for Coding}, |
| author={Shane Mattner}, |
| year={2026}, |
| url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-RFT} |
| } |
| ``` |
|
|
| ### Related Work |
|
|
| - [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) -- Base model by Qwen team |
| - [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) -- Hu et al., 2021 |
| - [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193) -- The SSD paper that inspired this work. Our method deviates from SSD by adding execution-based correctness filtering (making it RFT rather than pure SSD). |
|
|
| ## License |
|
|
| Apache 2.0 (same as the base model [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)) |
|
|