Instructions to use shaneMattner/Qwen3.6-35B-A3B-RFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shaneMattner/Qwen3.6-35B-A3B-RFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="shaneMattner/Qwen3.6-35B-A3B-RFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")
model = AutoModelForCausalLM.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use shaneMattner/Qwen3.6-35B-A3B-RFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "shaneMattner/Qwen3.6-35B-A3B-RFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shaneMattner/Qwen3.6-35B-A3B-RFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/shaneMattner/Qwen3.6-35B-A3B-RFT

SGLang

How to use shaneMattner/Qwen3.6-35B-A3B-RFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "shaneMattner/Qwen3.6-35B-A3B-RFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shaneMattner/Qwen3.6-35B-A3B-RFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "shaneMattner/Qwen3.6-35B-A3B-RFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shaneMattner/Qwen3.6-35B-A3B-RFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use shaneMattner/Qwen3.6-35B-A3B-RFT with Docker Model Runner:
```
docker model run hf.co/shaneMattner/Qwen3.6-35B-A3B-RFT
```

Qwen3.6-35B-A3B-RFT

A fine-tuned version of Qwen/Qwen3.6-35B-A3B using Rejection Fine-Tuning (RFT) on self-generated data, inspired by the Simple Self-Distillation (SSD) paper. The LoRA adapter has been merged into the base weights -- this is a standard bf16 model ready for direct use or quantization.

Method (RFT, Not Pure SSD)

Our method is inspired by the SSD paper ("Embarrassingly Simple Self-Distillation Improves Code Generation", arxiv 2604.01193) but differs in a critical way:

SSD (the paper): Generates samples from the model and trains on ALL of them -- correct and incorrect -- with NO filtering. That is the paper's key insight: unfiltered self-generated data still improves pass@k.
Our method: We generated samples at high temperature, then filtered for correctness using execution-based verification (2,000 generated, 1,796 passed tests). We trained only on correct outputs.

This makes our method Rejection Fine-Tuning (RFT) -- also known as rejection sampling + SFT or on-policy distillation. RFT is a well-established technique. The difference matters: SSD's claim is that filtering is unnecessary; we used filtering, so we cannot validate or invalidate that claim.

Model Details

Property	Value
Architecture	Qwen3.5 MoE with Gated DeltaNet linear attention (see note below)
Total parameters	34.66B
Active parameters	~3B (Mixture of Experts, 256 experts, 8 active per token)
Hidden layers	40 (30 linear attention + 10 full attention)
Precision	bfloat16
Model size on disk	~64 GB
Context length	262,144 tokens
License	Apache 2.0

Architecture note: The HuggingFace config reports model_type: qwen3_5_moe -- Qwen3.6 is built on the Qwen3.5 MoE architecture with the addition of Gated DeltaNet linear attention layers.

Training Details

Method

Generated 2,000 coding solutions from the base model at temp=1.6, top_k=20, top_p=0.8
Filtered for correctness (execution + test pass) -- 1,796 samples survived
Split into 1,616 train / 180 validation
Fine-tuned with LoRA, then merged adapter into base weights

LoRA Configuration

Parameter	Value
Rank (r)	16
Alpha	16
Dropout	0.0
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, in_proj_qkv, in_proj_z, out_proj
Trainable parameters	19.2M / 34.66B (0.055%)

The target modules include both standard transformer attention/MLP layers and Qwen3.6's DeltaNet linear attention layers (in_proj_qkv, in_proj_z, out_proj).

Training Hyperparameters

Parameter	Value
Optimizer	AdamW 8-bit
Learning rate	2e-4 (cosine schedule)
Warmup	6% of steps
Max steps	150
Batch size	4
Gradient accumulation	8 (effective batch = 32)
Max sequence length	2,048
Weight decay	0.01
Precision	bfloat16 (no quantization during training)
Seed	42

Training Results

Metric	Value
Final train loss	0.523
Eval loss	0.482 (at step 150)
Token accuracy	85.9%
Training time	78 min
Peak GPU memory	64.7 GB
Hardware	NVIDIA H200 (Modal cloud)
Estimated cost	~$6.20

Merge

Adapter merged into base weights using PeftModel.merge_and_unload() from PEFT 0.19.1. The result is a standard HuggingFace model -- no adapter loading required at inference time.

Evaluation

Tested as a 6-bit MLX quantization on Mac Studio M4 Max (128GB) against the base model (unsloth 4-bit quantization). 13 coding problems, 10 samples each at temp=0.7:

Problem difficulty	Base (4-bit)	Merged (6-bit)
Easy (5 problems)	50/50 (100%)	50/50 (100%)
Hard (8 problems)	76/80 (95%)	78/80 (98%)
Overall	126/130 (97%)	128/130 (98%)

Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10 -> merged 9/10.

Metric	Value
Inference speed (6-bit MLX)	78.9 tok/s average
Base model speed (4-bit MLX)	86.7 tok/s average

Important caveats:

Quantization confound: The base model was tested at 4-bit quantization while the merged model was tested at 6-bit. Higher quantization preserves more model information. Some or all of the quality difference (128/130 vs 126/130) may be attributable to quantization level rather than the RFT training. A controlled comparison at matched quantization has not been run.
Statistical significance: The difference of 2/130 samples is not statistically significant (p ~= 0.28, Fisher's exact test). These results are within noise at this sample size.
Temp=0 behavior: At temp=0, the merged model is expected to behave very similarly to the base model, though weights differ due to the LoRA merge. We have not formally tested temp=0 equivalence.

How to Use

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "shaneMattner/Qwen3.6-35B-A3B-RFT",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",
)
tokenizer = AutoTokenizer.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")

messages = [
    {"role": "user", "content": "Write a Python function to merge two sorted lists into one sorted list."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With MLX (Apple Silicon)

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("shaneMattner/Qwen3.6-35B-A3B-RFT")
response = generate(
    model,
    tokenizer,
    prompt="Write a Python function to merge two sorted lists.",
    max_tokens=512,
)
print(response)

Or quantize first for faster inference:

# Convert to 6-bit MLX format
python -m mlx_lm.convert \
    --hf-path shaneMattner/Qwen3.6-35B-A3B-RFT \
    --mlx-path Qwen3.6-35B-A3B-RFT-6bit \
    -q --q-bits 6

Note: If you encounter errors related to model_type, you may need to change "model_type": "qwen3_5_moe_text" to "model_type": "qwen3_5_moe" in config.json for mlx-lm compatibility.

With llama.cpp / GGUF

Convert to GGUF for use with llama.cpp, Ollama, or other GGUF-compatible tools:

# Clone llama.cpp and convert
python convert_hf_to_gguf.py shaneMattner/Qwen3.6-35B-A3B-RFT --outtype bf16

# Quantize to desired format
./llama-quantize Qwen3.6-35B-A3B-RFT-bf16.gguf Qwen3.6-35B-A3B-RFT-Q4_K_M.gguf Q4_K_M

Limitations

Coding-focused: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model.
Bounded by base model: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability.
Small training set: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results.
Eval coverage: Tested on 13 coding problems only. Broader benchmarks (HumanEval, MBPP, etc.) have not been run. Results are not statistically significant at this sample size.
Quantization confound: Base and merged models were evaluated at different quantization levels (4-bit vs 6-bit), confounding the quality comparison.
DeltaNet targeting: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations.

Architecture Notes

Qwen3.6-35B-A3B uses a hybrid architecture:

Mixture of Experts (MoE): 256 experts with 8 active per token, keeping active compute at ~3B parameters despite 34.66B total
Gated DeltaNet linear attention: 30 of 40 layers use linear attention (every 4th layer uses full attention), enabling efficient long-context processing
262K context window: Supports up to 262,144 tokens

Citation

If you use this model, please cite:

@misc{mattner2026qwen36rft,
  title={Qwen3.6-35B-A3B-RFT: Rejection Fine-Tuned Qwen3.6 for Coding},
  author={Shane Mattner},
  year={2026},
  url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-RFT}
}

Related Work

Qwen3.6-35B-A3B -- Base model by Qwen team
LoRA: Low-Rank Adaptation of Large Language Models -- Hu et al., 2021
Embarrassingly Simple Self-Distillation Improves Code Generation -- The SSD paper that inspired this work. Our method deviates from SSD by adding execution-based correctness filtering (making it RFT rather than pure SSD).

License

Apache 2.0 (same as the base model Qwen/Qwen3.6-35B-A3B)

Downloads last month: -

Safetensors

Model size

35B params

Tensor type

BF16

Model tree for shaneMattner/Qwen3.6-35B-A3B-RFT

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(96)

this model

Papers for shaneMattner/Qwen3.6-35B-A3B-RFT

Embarrassingly Simple Self-Distillation Improves Code Generation

Paper • 2604.01193 • Published Apr 1 • 47

LoRA: Low-Rank Adaptation of Large Language Models

Paper • 2106.09685 • Published Jun 17, 2021 • 60

Evaluation results

Train Loss on Self-generated coding dataset (RFT, filtered)
self-reported

0.523
avg_sample_pass_rate (temp=0.7, 13 problems, 10 samples each) on Self-generated coding dataset (RFT, filtered)
self-reported

0.985