Qwen3.6-35B-A3B-RFT
A fine-tuned version of Qwen/Qwen3.6-35B-A3B using Rejection Fine-Tuning (RFT) on self-generated data, inspired by the Simple Self-Distillation (SSD) paper. The LoRA adapter has been merged into the base weights -- this is a standard bf16 model ready for direct use or quantization.
Method (RFT, Not Pure SSD)
Our method is inspired by the SSD paper ("Embarrassingly Simple Self-Distillation Improves Code Generation", arxiv 2604.01193) but differs in a critical way:
- SSD (the paper): Generates samples from the model and trains on ALL of them -- correct and incorrect -- with NO filtering. That is the paper's key insight: unfiltered self-generated data still improves pass@k.
- Our method: We generated samples at high temperature, then filtered for correctness using execution-based verification (2,000 generated, 1,796 passed tests). We trained only on correct outputs.
This makes our method Rejection Fine-Tuning (RFT) -- also known as rejection sampling + SFT or on-policy distillation. RFT is a well-established technique. The difference matters: SSD's claim is that filtering is unnecessary; we used filtering, so we cannot validate or invalidate that claim.
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen3.5 MoE with Gated DeltaNet linear attention (see note below) |
| Total parameters | 34.66B |
| Active parameters | ~3B (Mixture of Experts, 256 experts, 8 active per token) |
| Hidden layers | 40 (30 linear attention + 10 full attention) |
| Precision | bfloat16 |
| Model size on disk | ~64 GB |
| Context length | 262,144 tokens |
| License | Apache 2.0 |
Architecture note: The HuggingFace config reports
model_type: qwen3_5_moe-- Qwen3.6 is built on the Qwen3.5 MoE architecture with the addition of Gated DeltaNet linear attention layers.
Training Details
Method
- Generated 2,000 coding solutions from the base model at temp=1.6, top_k=20, top_p=0.8
- Filtered for correctness (execution + test pass) -- 1,796 samples survived
- Split into 1,616 train / 180 validation
- Fine-tuned with LoRA, then merged adapter into base weights
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 16 |
| Alpha | 16 |
| Dropout | 0.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, in_proj_qkv, in_proj_z, out_proj |
| Trainable parameters | 19.2M / 34.66B (0.055%) |
The target modules include both standard transformer attention/MLP layers and Qwen3.6's DeltaNet linear attention layers (in_proj_qkv, in_proj_z, out_proj).
Training Hyperparameters
| Parameter | Value |
|---|---|
| Optimizer | AdamW 8-bit |
| Learning rate | 2e-4 (cosine schedule) |
| Warmup | 6% of steps |
| Max steps | 150 |
| Batch size | 4 |
| Gradient accumulation | 8 (effective batch = 32) |
| Max sequence length | 2,048 |
| Weight decay | 0.01 |
| Precision | bfloat16 (no quantization during training) |
| Seed | 42 |
Training Results
| Metric | Value |
|---|---|
| Final train loss | 0.523 |
| Eval loss | 0.482 (at step 150) |
| Token accuracy | 85.9% |
| Training time | 78 min |
| Peak GPU memory | 64.7 GB |
| Hardware | NVIDIA H200 (Modal cloud) |
| Estimated cost | ~$6.20 |
Merge
Adapter merged into base weights using PeftModel.merge_and_unload() from PEFT 0.19.1. The result is a standard HuggingFace model -- no adapter loading required at inference time.
Evaluation
Tested as a 6-bit MLX quantization on Mac Studio M4 Max (128GB) against the base model (unsloth 4-bit quantization). 13 coding problems, 10 samples each at temp=0.7:
| Problem difficulty | Base (4-bit) | Merged (6-bit) |
|---|---|---|
| Easy (5 problems) | 50/50 (100%) | 50/50 (100%) |
| Hard (8 problems) | 76/80 (95%) | 78/80 (98%) |
| Overall | 126/130 (97%) | 128/130 (98%) |
Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10 -> merged 9/10.
| Metric | Value |
|---|---|
| Inference speed (6-bit MLX) | 78.9 tok/s average |
| Base model speed (4-bit MLX) | 86.7 tok/s average |
Important caveats:
- Quantization confound: The base model was tested at 4-bit quantization while the merged model was tested at 6-bit. Higher quantization preserves more model information. Some or all of the quality difference (128/130 vs 126/130) may be attributable to quantization level rather than the RFT training. A controlled comparison at matched quantization has not been run.
- Statistical significance: The difference of 2/130 samples is not statistically significant (p ~= 0.28, Fisher's exact test). These results are within noise at this sample size.
- Temp=0 behavior: At temp=0, the merged model is expected to behave very similarly to the base model, though weights differ due to the LoRA merge. We have not formally tested temp=0 equivalence.
How to Use
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"shaneMattner/Qwen3.6-35B-A3B-RFT",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager",
)
tokenizer = AutoTokenizer.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")
messages = [
{"role": "user", "content": "Write a Python function to merge two sorted lists into one sorted list."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With MLX (Apple Silicon)
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("shaneMattner/Qwen3.6-35B-A3B-RFT")
response = generate(
model,
tokenizer,
prompt="Write a Python function to merge two sorted lists.",
max_tokens=512,
)
print(response)
Or quantize first for faster inference:
# Convert to 6-bit MLX format
python -m mlx_lm.convert \
--hf-path shaneMattner/Qwen3.6-35B-A3B-RFT \
--mlx-path Qwen3.6-35B-A3B-RFT-6bit \
-q --q-bits 6
Note: If you encounter errors related to model_type, you may need to change "model_type": "qwen3_5_moe_text" to "model_type": "qwen3_5_moe" in config.json for mlx-lm compatibility.
With llama.cpp / GGUF
Convert to GGUF for use with llama.cpp, Ollama, or other GGUF-compatible tools:
# Clone llama.cpp and convert
python convert_hf_to_gguf.py shaneMattner/Qwen3.6-35B-A3B-RFT --outtype bf16
# Quantize to desired format
./llama-quantize Qwen3.6-35B-A3B-RFT-bf16.gguf Qwen3.6-35B-A3B-RFT-Q4_K_M.gguf Q4_K_M
Limitations
- Coding-focused: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model.
- Bounded by base model: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability.
- Small training set: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results.
- Eval coverage: Tested on 13 coding problems only. Broader benchmarks (HumanEval, MBPP, etc.) have not been run. Results are not statistically significant at this sample size.
- Quantization confound: Base and merged models were evaluated at different quantization levels (4-bit vs 6-bit), confounding the quality comparison.
- DeltaNet targeting: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations.
Architecture Notes
Qwen3.6-35B-A3B uses a hybrid architecture:
- Mixture of Experts (MoE): 256 experts with 8 active per token, keeping active compute at ~3B parameters despite 34.66B total
- Gated DeltaNet linear attention: 30 of 40 layers use linear attention (every 4th layer uses full attention), enabling efficient long-context processing
- 262K context window: Supports up to 262,144 tokens
Citation
If you use this model, please cite:
@misc{mattner2026qwen36rft,
title={Qwen3.6-35B-A3B-RFT: Rejection Fine-Tuned Qwen3.6 for Coding},
author={Shane Mattner},
year={2026},
url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-RFT}
}
Related Work
- Qwen3.6-35B-A3B -- Base model by Qwen team
- LoRA: Low-Rank Adaptation of Large Language Models -- Hu et al., 2021
- Embarrassingly Simple Self-Distillation Improves Code Generation -- The SSD paper that inspired this work. Our method deviates from SSD by adding execution-based correctness filtering (making it RFT rather than pure SSD).
License
Apache 2.0 (same as the base model Qwen/Qwen3.6-35B-A3B)
- Downloads last month
- -
Model tree for shaneMattner/Qwen3.6-35B-A3B-RFT
Base model
Qwen/Qwen3.6-35B-A3BPapers for shaneMattner/Qwen3.6-35B-A3B-RFT
LoRA: Low-Rank Adaptation of Large Language Models
Evaluation results
- Train Loss on Self-generated coding dataset (RFT, filtered)self-reported0.523
- avg_sample_pass_rate (temp=0.7, 13 problems, 10 samples each) on Self-generated coding dataset (RFT, filtered)self-reported0.985
docker model run hf.co/shaneMattner/Qwen3.6-35B-A3B-RFT