File size: 9,982 Bytes
48c3617
 
 
 
 
 
80e0957
48c3617
 
 
 
 
 
 
 
 
 
 
 
 
11f9b93
48c3617
 
 
 
80e0957
48c3617
 
 
 
 
80e0957
 
48c3617
 
 
11f9b93
48c3617
80e0957
48c3617
11f9b93
48c3617
80e0957
 
 
 
 
 
48c3617
 
 
 
 
80e0957
48c3617
 
 
 
 
 
 
 
80e0957
 
48c3617
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80e0957
48c3617
 
 
 
 
 
80e0957
 
 
 
 
48c3617
 
 
 
 
 
 
 
 
 
11f9b93
48c3617
 
 
 
11f9b93
48c3617
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11f9b93
48c3617
 
 
 
 
 
 
 
 
 
 
 
 
 
11f9b93
 
48c3617
 
 
 
 
 
 
 
 
 
 
11f9b93
48c3617
 
11f9b93
48c3617
 
 
 
 
 
 
80e0957
 
48c3617
 
 
 
 
 
 
 
 
 
 
 
 
 
80e0957
11f9b93
48c3617
 
11f9b93
48c3617
 
 
 
 
 
 
80e0957
48c3617
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - rejection-fine-tuning
  - self-distillation
  - qwen
  - qwen3.6
  - moe
  - deltanet
  - linear-attention
  - code-generation
  - coding
  - lora-merged
  - bf16
base_model: Qwen/Qwen3.6-35B-A3B
pipeline_tag: text-generation
model-index:
  - name: Qwen3.6-35B-A3B-RFT
    results:
      - task:
          type: text-generation
        dataset:
          name: Self-generated coding dataset (RFT, filtered)
          type: custom
        metrics:
          - name: Train Loss
            type: train_loss
            value: 0.523
          - name: avg_sample_pass_rate (temp=0.7, 13 problems, 10 samples each)
            type: avg_sample_pass_rate
            value: 0.985
---

# Qwen3.6-35B-A3B-RFT

A fine-tuned version of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) using **Rejection Fine-Tuning (RFT) on self-generated data**, inspired by the [Simple Self-Distillation (SSD)](https://arxiv.org/abs/2604.01193) paper. The LoRA adapter has been merged into the base weights -- this is a standard bf16 model ready for direct use or quantization.

## Method (RFT, Not Pure SSD)

Our method is **inspired by** the SSD paper ("Embarrassingly Simple Self-Distillation Improves Code Generation", arxiv 2604.01193) but differs in a critical way:

- **SSD (the paper)**: Generates samples from the model and trains on ALL of them -- correct and incorrect -- with NO filtering. That is the paper's key insight: unfiltered self-generated data still improves pass@k.
- **Our method**: We generated samples at high temperature, then **filtered for correctness** using execution-based verification (2,000 generated, 1,796 passed tests). We trained only on correct outputs.

This makes our method **Rejection Fine-Tuning (RFT)** -- also known as rejection sampling + SFT or on-policy distillation. RFT is a well-established technique. The difference matters: SSD's claim is that filtering is unnecessary; we used filtering, so we cannot validate or invalidate that claim.

## Model Details

| Property | Value |
|----------|-------|
| Architecture | Qwen3.5 MoE with Gated DeltaNet linear attention (see note below) |
| Total parameters | 34.66B |
| Active parameters | ~3B (Mixture of Experts, 256 experts, 8 active per token) |
| Hidden layers | 40 (30 linear attention + 10 full attention) |
| Precision | bfloat16 |
| Model size on disk | ~64 GB |
| Context length | 262,144 tokens |
| License | Apache 2.0 |

> **Architecture note**: The HuggingFace config reports `model_type: qwen3_5_moe` -- Qwen3.6 is built on the Qwen3.5 MoE architecture with the addition of Gated DeltaNet linear attention layers.

## Training Details

### Method

1. Generated 2,000 coding solutions from the base model at temp=1.6, top_k=20, top_p=0.8
2. Filtered for correctness (execution + test pass) -- 1,796 samples survived
3. Split into 1,616 train / 180 validation
4. Fine-tuned with LoRA, then merged adapter into base weights

### LoRA Configuration

| Parameter | Value |
|-----------|-------|
| Rank (r) | 16 |
| Alpha | 16 |
| Dropout | 0.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, in_proj_qkv, in_proj_z, out_proj |
| Trainable parameters | 19.2M / 34.66B (0.055%) |

The target modules include both standard transformer attention/MLP layers and Qwen3.6's DeltaNet linear attention layers (in_proj_qkv, in_proj_z, out_proj).

### Training Hyperparameters

| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW 8-bit |
| Learning rate | 2e-4 (cosine schedule) |
| Warmup | 6% of steps |
| Max steps | 150 |
| Batch size | 4 |
| Gradient accumulation | 8 (effective batch = 32) |
| Max sequence length | 2,048 |
| Weight decay | 0.01 |
| Precision | bfloat16 (no quantization during training) |
| Seed | 42 |

### Training Results

| Metric | Value |
|--------|-------|
| Final train loss | 0.523 |
| Eval loss | 0.482 (at step 150) |
| Token accuracy | 85.9% |
| Training time | 78 min |
| Peak GPU memory | 64.7 GB |
| Hardware | NVIDIA H200 (Modal cloud) |
| Estimated cost | ~$6.20 |

### Merge

Adapter merged into base weights using `PeftModel.merge_and_unload()` from PEFT 0.19.1. The result is a standard HuggingFace model -- no adapter loading required at inference time.

## Evaluation

Tested as a 6-bit MLX quantization on Mac Studio M4 Max (128GB) against the base model (unsloth 4-bit quantization). 13 coding problems, 10 samples each at temp=0.7:

| Problem difficulty | Base (4-bit) | Merged (6-bit) |
|-------------------|-------------|----------------|
| Easy (5 problems) | 50/50 (100%) | 50/50 (100%) |
| Hard (8 problems) | 76/80 (95%) | 78/80 (98%) |
| **Overall** | **126/130 (97%)** | **128/130 (98%)** |

Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10 -> merged 9/10.

| Metric | Value |
|--------|-------|
| Inference speed (6-bit MLX) | 78.9 tok/s average |
| Base model speed (4-bit MLX) | 86.7 tok/s average |

**Important caveats**:

- **Quantization confound**: The base model was tested at 4-bit quantization while the merged model was tested at 6-bit. Higher quantization preserves more model information. Some or all of the quality difference (128/130 vs 126/130) may be attributable to quantization level rather than the RFT training. A controlled comparison at matched quantization has not been run.
- **Statistical significance**: The difference of 2/130 samples is not statistically significant (p ~= 0.28, Fisher's exact test). These results are within noise at this sample size.
- **Temp=0 behavior**: At temp=0, the merged model is expected to behave very similarly to the base model, though weights differ due to the LoRA merge. We have not formally tested temp=0 equivalence.

## How to Use

### With Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "shaneMattner/Qwen3.6-35B-A3B-RFT",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",
)
tokenizer = AutoTokenizer.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")

messages = [
    {"role": "user", "content": "Write a Python function to merge two sorted lists into one sorted list."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### With MLX (Apple Silicon)

```bash
pip install mlx-lm
```

```python
from mlx_lm import load, generate

model, tokenizer = load("shaneMattner/Qwen3.6-35B-A3B-RFT")
response = generate(
    model,
    tokenizer,
    prompt="Write a Python function to merge two sorted lists.",
    max_tokens=512,
)
print(response)
```

Or quantize first for faster inference:

```bash
# Convert to 6-bit MLX format
python -m mlx_lm.convert \
    --hf-path shaneMattner/Qwen3.6-35B-A3B-RFT \
    --mlx-path Qwen3.6-35B-A3B-RFT-6bit \
    -q --q-bits 6
```

**Note**: If you encounter errors related to `model_type`, you may need to change `"model_type": "qwen3_5_moe_text"` to `"model_type": "qwen3_5_moe"` in `config.json` for mlx-lm compatibility.

### With llama.cpp / GGUF

Convert to GGUF for use with llama.cpp, Ollama, or other GGUF-compatible tools:

```bash
# Clone llama.cpp and convert
python convert_hf_to_gguf.py shaneMattner/Qwen3.6-35B-A3B-RFT --outtype bf16

# Quantize to desired format
./llama-quantize Qwen3.6-35B-A3B-RFT-bf16.gguf Qwen3.6-35B-A3B-RFT-Q4_K_M.gguf Q4_K_M
```

## Limitations

- **Coding-focused**: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model.
- **Bounded by base model**: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability.
- **Small training set**: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results.
- **Eval coverage**: Tested on 13 coding problems only. Broader benchmarks (HumanEval, MBPP, etc.) have not been run. Results are not statistically significant at this sample size.
- **Quantization confound**: Base and merged models were evaluated at different quantization levels (4-bit vs 6-bit), confounding the quality comparison.
- **DeltaNet targeting**: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations.

## Architecture Notes

Qwen3.6-35B-A3B uses a hybrid architecture:
- **Mixture of Experts (MoE)**: 256 experts with 8 active per token, keeping active compute at ~3B parameters despite 34.66B total
- **Gated DeltaNet linear attention**: 30 of 40 layers use linear attention (every 4th layer uses full attention), enabling efficient long-context processing
- **262K context window**: Supports up to 262,144 tokens

## Citation

If you use this model, please cite:

```bibtex
@misc{mattner2026qwen36rft,
  title={Qwen3.6-35B-A3B-RFT: Rejection Fine-Tuned Qwen3.6 for Coding},
  author={Shane Mattner},
  year={2026},
  url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-RFT}
}
```

### Related Work

- [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) -- Base model by Qwen team
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) -- Hu et al., 2021
- [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193) -- The SSD paper that inspired this work. Our method deviates from SSD by adding execution-based correctness filtering (making it RFT rather than pure SSD).

## License

Apache 2.0 (same as the base model [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B))