File size: 9,982 Bytes
48c3617 80e0957 48c3617 11f9b93 48c3617 80e0957 48c3617 80e0957 48c3617 11f9b93 48c3617 80e0957 48c3617 11f9b93 48c3617 80e0957 48c3617 80e0957 48c3617 80e0957 48c3617 80e0957 48c3617 80e0957 48c3617 11f9b93 48c3617 11f9b93 48c3617 11f9b93 48c3617 11f9b93 48c3617 11f9b93 48c3617 11f9b93 48c3617 80e0957 48c3617 80e0957 11f9b93 48c3617 11f9b93 48c3617 80e0957 48c3617 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 | ---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- rejection-fine-tuning
- self-distillation
- qwen
- qwen3.6
- moe
- deltanet
- linear-attention
- code-generation
- coding
- lora-merged
- bf16
base_model: Qwen/Qwen3.6-35B-A3B
pipeline_tag: text-generation
model-index:
- name: Qwen3.6-35B-A3B-RFT
results:
- task:
type: text-generation
dataset:
name: Self-generated coding dataset (RFT, filtered)
type: custom
metrics:
- name: Train Loss
type: train_loss
value: 0.523
- name: avg_sample_pass_rate (temp=0.7, 13 problems, 10 samples each)
type: avg_sample_pass_rate
value: 0.985
---
# Qwen3.6-35B-A3B-RFT
A fine-tuned version of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) using **Rejection Fine-Tuning (RFT) on self-generated data**, inspired by the [Simple Self-Distillation (SSD)](https://arxiv.org/abs/2604.01193) paper. The LoRA adapter has been merged into the base weights -- this is a standard bf16 model ready for direct use or quantization.
## Method (RFT, Not Pure SSD)
Our method is **inspired by** the SSD paper ("Embarrassingly Simple Self-Distillation Improves Code Generation", arxiv 2604.01193) but differs in a critical way:
- **SSD (the paper)**: Generates samples from the model and trains on ALL of them -- correct and incorrect -- with NO filtering. That is the paper's key insight: unfiltered self-generated data still improves pass@k.
- **Our method**: We generated samples at high temperature, then **filtered for correctness** using execution-based verification (2,000 generated, 1,796 passed tests). We trained only on correct outputs.
This makes our method **Rejection Fine-Tuning (RFT)** -- also known as rejection sampling + SFT or on-policy distillation. RFT is a well-established technique. The difference matters: SSD's claim is that filtering is unnecessary; we used filtering, so we cannot validate or invalidate that claim.
## Model Details
| Property | Value |
|----------|-------|
| Architecture | Qwen3.5 MoE with Gated DeltaNet linear attention (see note below) |
| Total parameters | 34.66B |
| Active parameters | ~3B (Mixture of Experts, 256 experts, 8 active per token) |
| Hidden layers | 40 (30 linear attention + 10 full attention) |
| Precision | bfloat16 |
| Model size on disk | ~64 GB |
| Context length | 262,144 tokens |
| License | Apache 2.0 |
> **Architecture note**: The HuggingFace config reports `model_type: qwen3_5_moe` -- Qwen3.6 is built on the Qwen3.5 MoE architecture with the addition of Gated DeltaNet linear attention layers.
## Training Details
### Method
1. Generated 2,000 coding solutions from the base model at temp=1.6, top_k=20, top_p=0.8
2. Filtered for correctness (execution + test pass) -- 1,796 samples survived
3. Split into 1,616 train / 180 validation
4. Fine-tuned with LoRA, then merged adapter into base weights
### LoRA Configuration
| Parameter | Value |
|-----------|-------|
| Rank (r) | 16 |
| Alpha | 16 |
| Dropout | 0.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, in_proj_qkv, in_proj_z, out_proj |
| Trainable parameters | 19.2M / 34.66B (0.055%) |
The target modules include both standard transformer attention/MLP layers and Qwen3.6's DeltaNet linear attention layers (in_proj_qkv, in_proj_z, out_proj).
### Training Hyperparameters
| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW 8-bit |
| Learning rate | 2e-4 (cosine schedule) |
| Warmup | 6% of steps |
| Max steps | 150 |
| Batch size | 4 |
| Gradient accumulation | 8 (effective batch = 32) |
| Max sequence length | 2,048 |
| Weight decay | 0.01 |
| Precision | bfloat16 (no quantization during training) |
| Seed | 42 |
### Training Results
| Metric | Value |
|--------|-------|
| Final train loss | 0.523 |
| Eval loss | 0.482 (at step 150) |
| Token accuracy | 85.9% |
| Training time | 78 min |
| Peak GPU memory | 64.7 GB |
| Hardware | NVIDIA H200 (Modal cloud) |
| Estimated cost | ~$6.20 |
### Merge
Adapter merged into base weights using `PeftModel.merge_and_unload()` from PEFT 0.19.1. The result is a standard HuggingFace model -- no adapter loading required at inference time.
## Evaluation
Tested as a 6-bit MLX quantization on Mac Studio M4 Max (128GB) against the base model (unsloth 4-bit quantization). 13 coding problems, 10 samples each at temp=0.7:
| Problem difficulty | Base (4-bit) | Merged (6-bit) |
|-------------------|-------------|----------------|
| Easy (5 problems) | 50/50 (100%) | 50/50 (100%) |
| Hard (8 problems) | 76/80 (95%) | 78/80 (98%) |
| **Overall** | **126/130 (97%)** | **128/130 (98%)** |
Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10 -> merged 9/10.
| Metric | Value |
|--------|-------|
| Inference speed (6-bit MLX) | 78.9 tok/s average |
| Base model speed (4-bit MLX) | 86.7 tok/s average |
**Important caveats**:
- **Quantization confound**: The base model was tested at 4-bit quantization while the merged model was tested at 6-bit. Higher quantization preserves more model information. Some or all of the quality difference (128/130 vs 126/130) may be attributable to quantization level rather than the RFT training. A controlled comparison at matched quantization has not been run.
- **Statistical significance**: The difference of 2/130 samples is not statistically significant (p ~= 0.28, Fisher's exact test). These results are within noise at this sample size.
- **Temp=0 behavior**: At temp=0, the merged model is expected to behave very similarly to the base model, though weights differ due to the LoRA merge. We have not formally tested temp=0 equivalence.
## How to Use
### With Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"shaneMattner/Qwen3.6-35B-A3B-RFT",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager",
)
tokenizer = AutoTokenizer.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")
messages = [
{"role": "user", "content": "Write a Python function to merge two sorted lists into one sorted list."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### With MLX (Apple Silicon)
```bash
pip install mlx-lm
```
```python
from mlx_lm import load, generate
model, tokenizer = load("shaneMattner/Qwen3.6-35B-A3B-RFT")
response = generate(
model,
tokenizer,
prompt="Write a Python function to merge two sorted lists.",
max_tokens=512,
)
print(response)
```
Or quantize first for faster inference:
```bash
# Convert to 6-bit MLX format
python -m mlx_lm.convert \
--hf-path shaneMattner/Qwen3.6-35B-A3B-RFT \
--mlx-path Qwen3.6-35B-A3B-RFT-6bit \
-q --q-bits 6
```
**Note**: If you encounter errors related to `model_type`, you may need to change `"model_type": "qwen3_5_moe_text"` to `"model_type": "qwen3_5_moe"` in `config.json` for mlx-lm compatibility.
### With llama.cpp / GGUF
Convert to GGUF for use with llama.cpp, Ollama, or other GGUF-compatible tools:
```bash
# Clone llama.cpp and convert
python convert_hf_to_gguf.py shaneMattner/Qwen3.6-35B-A3B-RFT --outtype bf16
# Quantize to desired format
./llama-quantize Qwen3.6-35B-A3B-RFT-bf16.gguf Qwen3.6-35B-A3B-RFT-Q4_K_M.gguf Q4_K_M
```
## Limitations
- **Coding-focused**: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model.
- **Bounded by base model**: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability.
- **Small training set**: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results.
- **Eval coverage**: Tested on 13 coding problems only. Broader benchmarks (HumanEval, MBPP, etc.) have not been run. Results are not statistically significant at this sample size.
- **Quantization confound**: Base and merged models were evaluated at different quantization levels (4-bit vs 6-bit), confounding the quality comparison.
- **DeltaNet targeting**: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations.
## Architecture Notes
Qwen3.6-35B-A3B uses a hybrid architecture:
- **Mixture of Experts (MoE)**: 256 experts with 8 active per token, keeping active compute at ~3B parameters despite 34.66B total
- **Gated DeltaNet linear attention**: 30 of 40 layers use linear attention (every 4th layer uses full attention), enabling efficient long-context processing
- **262K context window**: Supports up to 262,144 tokens
## Citation
If you use this model, please cite:
```bibtex
@misc{mattner2026qwen36rft,
title={Qwen3.6-35B-A3B-RFT: Rejection Fine-Tuned Qwen3.6 for Coding},
author={Shane Mattner},
year={2026},
url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-RFT}
}
```
### Related Work
- [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) -- Base model by Qwen team
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) -- Hu et al., 2021
- [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193) -- The SSD paper that inspired this work. Our method deviates from SSD by adding execution-based correctness filtering (making it RFT rather than pure SSD).
## License
Apache 2.0 (same as the base model [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B))
|