Qwen3.6-35B-A3B-RFT

A fine-tuned version of Qwen/Qwen3.6-35B-A3B using Rejection Fine-Tuning (RFT) on self-generated data, inspired by the Simple Self-Distillation (SSD) paper. The LoRA adapter has been merged into the base weights -- this is a standard bf16 model ready for direct use or quantization.

Method (RFT, Not Pure SSD)

Our method is inspired by the SSD paper ("Embarrassingly Simple Self-Distillation Improves Code Generation", arxiv 2604.01193) but differs in a critical way:

  • SSD (the paper): Generates samples from the model and trains on ALL of them -- correct and incorrect -- with NO filtering. That is the paper's key insight: unfiltered self-generated data still improves pass@k.
  • Our method: We generated samples at high temperature, then filtered for correctness using execution-based verification (2,000 generated, 1,796 passed tests). We trained only on correct outputs.

This makes our method Rejection Fine-Tuning (RFT) -- also known as rejection sampling + SFT or on-policy distillation. RFT is a well-established technique. The difference matters: SSD's claim is that filtering is unnecessary; we used filtering, so we cannot validate or invalidate that claim.

Model Details

Property Value
Architecture Qwen3.5 MoE with Gated DeltaNet linear attention (see note below)
Total parameters 34.66B
Active parameters ~3B (Mixture of Experts, 256 experts, 8 active per token)
Hidden layers 40 (30 linear attention + 10 full attention)
Precision bfloat16
Model size on disk ~64 GB
Context length 262,144 tokens
License Apache 2.0

Architecture note: The HuggingFace config reports model_type: qwen3_5_moe -- Qwen3.6 is built on the Qwen3.5 MoE architecture with the addition of Gated DeltaNet linear attention layers.

Training Details

Method

  1. Generated 2,000 coding solutions from the base model at temp=1.6, top_k=20, top_p=0.8
  2. Filtered for correctness (execution + test pass) -- 1,796 samples survived
  3. Split into 1,616 train / 180 validation
  4. Fine-tuned with LoRA, then merged adapter into base weights

LoRA Configuration

Parameter Value
Rank (r) 16
Alpha 16
Dropout 0.0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, in_proj_qkv, in_proj_z, out_proj
Trainable parameters 19.2M / 34.66B (0.055%)

The target modules include both standard transformer attention/MLP layers and Qwen3.6's DeltaNet linear attention layers (in_proj_qkv, in_proj_z, out_proj).

Training Hyperparameters

Parameter Value
Optimizer AdamW 8-bit
Learning rate 2e-4 (cosine schedule)
Warmup 6% of steps
Max steps 150
Batch size 4
Gradient accumulation 8 (effective batch = 32)
Max sequence length 2,048
Weight decay 0.01
Precision bfloat16 (no quantization during training)
Seed 42

Training Results

Metric Value
Final train loss 0.523
Eval loss 0.482 (at step 150)
Token accuracy 85.9%
Training time 78 min
Peak GPU memory 64.7 GB
Hardware NVIDIA H200 (Modal cloud)
Estimated cost ~$6.20

Merge

Adapter merged into base weights using PeftModel.merge_and_unload() from PEFT 0.19.1. The result is a standard HuggingFace model -- no adapter loading required at inference time.

Evaluation

Tested as a 6-bit MLX quantization on Mac Studio M4 Max (128GB) against the base model (unsloth 4-bit quantization). 13 coding problems, 10 samples each at temp=0.7:

Problem difficulty Base (4-bit) Merged (6-bit)
Easy (5 problems) 50/50 (100%) 50/50 (100%)
Hard (8 problems) 76/80 (95%) 78/80 (98%)
Overall 126/130 (97%) 128/130 (98%)

Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10 -> merged 9/10.

Metric Value
Inference speed (6-bit MLX) 78.9 tok/s average
Base model speed (4-bit MLX) 86.7 tok/s average

Important caveats:

  • Quantization confound: The base model was tested at 4-bit quantization while the merged model was tested at 6-bit. Higher quantization preserves more model information. Some or all of the quality difference (128/130 vs 126/130) may be attributable to quantization level rather than the RFT training. A controlled comparison at matched quantization has not been run.
  • Statistical significance: The difference of 2/130 samples is not statistically significant (p ~= 0.28, Fisher's exact test). These results are within noise at this sample size.
  • Temp=0 behavior: At temp=0, the merged model is expected to behave very similarly to the base model, though weights differ due to the LoRA merge. We have not formally tested temp=0 equivalence.

How to Use

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "shaneMattner/Qwen3.6-35B-A3B-RFT",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",
)
tokenizer = AutoTokenizer.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")

messages = [
    {"role": "user", "content": "Write a Python function to merge two sorted lists into one sorted list."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With MLX (Apple Silicon)

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("shaneMattner/Qwen3.6-35B-A3B-RFT")
response = generate(
    model,
    tokenizer,
    prompt="Write a Python function to merge two sorted lists.",
    max_tokens=512,
)
print(response)

Or quantize first for faster inference:

# Convert to 6-bit MLX format
python -m mlx_lm.convert \
    --hf-path shaneMattner/Qwen3.6-35B-A3B-RFT \
    --mlx-path Qwen3.6-35B-A3B-RFT-6bit \
    -q --q-bits 6

Note: If you encounter errors related to model_type, you may need to change "model_type": "qwen3_5_moe_text" to "model_type": "qwen3_5_moe" in config.json for mlx-lm compatibility.

With llama.cpp / GGUF

Convert to GGUF for use with llama.cpp, Ollama, or other GGUF-compatible tools:

# Clone llama.cpp and convert
python convert_hf_to_gguf.py shaneMattner/Qwen3.6-35B-A3B-RFT --outtype bf16

# Quantize to desired format
./llama-quantize Qwen3.6-35B-A3B-RFT-bf16.gguf Qwen3.6-35B-A3B-RFT-Q4_K_M.gguf Q4_K_M

Limitations

  • Coding-focused: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model.
  • Bounded by base model: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability.
  • Small training set: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results.
  • Eval coverage: Tested on 13 coding problems only. Broader benchmarks (HumanEval, MBPP, etc.) have not been run. Results are not statistically significant at this sample size.
  • Quantization confound: Base and merged models were evaluated at different quantization levels (4-bit vs 6-bit), confounding the quality comparison.
  • DeltaNet targeting: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations.

Architecture Notes

Qwen3.6-35B-A3B uses a hybrid architecture:

  • Mixture of Experts (MoE): 256 experts with 8 active per token, keeping active compute at ~3B parameters despite 34.66B total
  • Gated DeltaNet linear attention: 30 of 40 layers use linear attention (every 4th layer uses full attention), enabling efficient long-context processing
  • 262K context window: Supports up to 262,144 tokens

Citation

If you use this model, please cite:

@misc{mattner2026qwen36rft,
  title={Qwen3.6-35B-A3B-RFT: Rejection Fine-Tuned Qwen3.6 for Coding},
  author={Shane Mattner},
  year={2026},
  url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-RFT}
}

Related Work

License

Apache 2.0 (same as the base model Qwen/Qwen3.6-35B-A3B)

Downloads last month
-
Safetensors
Model size
35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shaneMattner/Qwen3.6-35B-A3B-RFT

Finetuned
(96)
this model

Papers for shaneMattner/Qwen3.6-35B-A3B-RFT

Evaluation results

  • Train Loss on Self-generated coding dataset (RFT, filtered)
    self-reported
    0.523
  • avg_sample_pass_rate (temp=0.7, 13 problems, 10 samples each) on Self-generated coding dataset (RFT, filtered)
    self-reported
    0.985