Instructions to use shaneMattner/Qwen3.6-35B-A3B-RFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shaneMattner/Qwen3.6-35B-A3B-RFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="shaneMattner/Qwen3.6-35B-A3B-RFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")
model = AutoModelForCausalLM.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use shaneMattner/Qwen3.6-35B-A3B-RFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "shaneMattner/Qwen3.6-35B-A3B-RFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shaneMattner/Qwen3.6-35B-A3B-RFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/shaneMattner/Qwen3.6-35B-A3B-RFT

SGLang

How to use shaneMattner/Qwen3.6-35B-A3B-RFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "shaneMattner/Qwen3.6-35B-A3B-RFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shaneMattner/Qwen3.6-35B-A3B-RFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "shaneMattner/Qwen3.6-35B-A3B-RFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shaneMattner/Qwen3.6-35B-A3B-RFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use shaneMattner/Qwen3.6-35B-A3B-RFT with Docker Model Runner:
```
docker model run hf.co/shaneMattner/Qwen3.6-35B-A3B-RFT
```

Qwen3.6-35B-A3B-RFT / README.md

shaneMattner

Upload README.md with huggingface_hub

11f9b93 verified 1 day ago

preview code

raw

history blame contribute delete

9.98 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- rejection-fine-tuning
	- self-distillation
	- qwen
	- qwen3.6
	- moe
	- deltanet
	- linear-attention
	- code-generation
	- coding
	- lora-merged
	- bf16
	base_model: Qwen/Qwen3.6-35B-A3B
	pipeline_tag: text-generation
	model-index:
	- name: Qwen3.6-35B-A3B-RFT
	results:
	- task:
	type: text-generation
	dataset:
	name: Self-generated coding dataset (RFT, filtered)
	type: custom
	metrics:
	- name: Train Loss
	type: train_loss
	value: 0.523
	- name: avg_sample_pass_rate (temp=0.7, 13 problems, 10 samples each)
	type: avg_sample_pass_rate
	value: 0.985
	---

	# Qwen3.6-35B-A3B-RFT

	A fine-tuned version of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) using Rejection Fine-Tuning (RFT) on self-generated data, inspired by the [Simple Self-Distillation (SSD)](https://arxiv.org/abs/2604.01193) paper. The LoRA adapter has been merged into the base weights -- this is a standard bf16 model ready for direct use or quantization.

	## Method (RFT, Not Pure SSD)

	Our method is inspired by the SSD paper ("Embarrassingly Simple Self-Distillation Improves Code Generation", arxiv 2604.01193) but differs in a critical way:

	- SSD (the paper): Generates samples from the model and trains on ALL of them -- correct and incorrect -- with NO filtering. That is the paper's key insight: unfiltered self-generated data still improves pass@k.
	- Our method: We generated samples at high temperature, then filtered for correctness using execution-based verification (2,000 generated, 1,796 passed tests). We trained only on correct outputs.

	This makes our method Rejection Fine-Tuning (RFT) -- also known as rejection sampling + SFT or on-policy distillation. RFT is a well-established technique. The difference matters: SSD's claim is that filtering is unnecessary; we used filtering, so we cannot validate or invalidate that claim.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| Qwen3.5 MoE with Gated DeltaNet linear attention (see note below) \|
	\| Total parameters \| 34.66B \|
	\| Active parameters \| ~3B (Mixture of Experts, 256 experts, 8 active per token) \|
	\| Hidden layers \| 40 (30 linear attention + 10 full attention) \|
	\| Precision \| bfloat16 \|
	\| Model size on disk \| ~64 GB \|
	\| Context length \| 262,144 tokens \|
	\| License \| Apache 2.0 \|

	> Architecture note: The HuggingFace config reports `model_type: qwen3_5_moe` -- Qwen3.6 is built on the Qwen3.5 MoE architecture with the addition of Gated DeltaNet linear attention layers.

	## Training Details

	### Method

	1. Generated 2,000 coding solutions from the base model at temp=1.6, top_k=20, top_p=0.8
	2. Filtered for correctness (execution + test pass) -- 1,796 samples survived
	3. Split into 1,616 train / 180 validation
	4. Fine-tuned with LoRA, then merged adapter into base weights

	### LoRA Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Rank (r) \| 16 \|
	\| Alpha \| 16 \|
	\| Dropout \| 0.0 \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, in_proj_qkv, in_proj_z, out_proj \|
	\| Trainable parameters \| 19.2M / 34.66B (0.055%) \|

	The target modules include both standard transformer attention/MLP layers and Qwen3.6's DeltaNet linear attention layers (in_proj_qkv, in_proj_z, out_proj).

	### Training Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Optimizer \| AdamW 8-bit \|
	\| Learning rate \| 2e-4 (cosine schedule) \|
	\| Warmup \| 6% of steps \|
	\| Max steps \| 150 \|
	\| Batch size \| 4 \|
	\| Gradient accumulation \| 8 (effective batch = 32) \|
	\| Max sequence length \| 2,048 \|
	\| Weight decay \| 0.01 \|
	\| Precision \| bfloat16 (no quantization during training) \|
	\| Seed \| 42 \|

	### Training Results

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Final train loss \| 0.523 \|
	\| Eval loss \| 0.482 (at step 150) \|
	\| Token accuracy \| 85.9% \|
	\| Training time \| 78 min \|
	\| Peak GPU memory \| 64.7 GB \|
	\| Hardware \| NVIDIA H200 (Modal cloud) \|
	\| Estimated cost \| ~$6.20 \|

	### Merge

	Adapter merged into base weights using `PeftModel.merge_and_unload()` from PEFT 0.19.1. The result is a standard HuggingFace model -- no adapter loading required at inference time.

	## Evaluation

	Tested as a 6-bit MLX quantization on Mac Studio M4 Max (128GB) against the base model (unsloth 4-bit quantization). 13 coding problems, 10 samples each at temp=0.7:

	\| Problem difficulty \| Base (4-bit) \| Merged (6-bit) \|
	\|-------------------\|-------------\|----------------\|
	\| Easy (5 problems) \| 50/50 (100%) \| 50/50 (100%) \|
	\| Hard (8 problems) \| 76/80 (95%) \| 78/80 (98%) \|
	\| Overall \| 126/130 (97%) \| 128/130 (98%) \|

	Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10 -> merged 9/10.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Inference speed (6-bit MLX) \| 78.9 tok/s average \|
	\| Base model speed (4-bit MLX) \| 86.7 tok/s average \|

	Important caveats:

	- Quantization confound: The base model was tested at 4-bit quantization while the merged model was tested at 6-bit. Higher quantization preserves more model information. Some or all of the quality difference (128/130 vs 126/130) may be attributable to quantization level rather than the RFT training. A controlled comparison at matched quantization has not been run.
	- Statistical significance: The difference of 2/130 samples is not statistically significant (p ~= 0.28, Fisher's exact test). These results are within noise at this sample size.
	- Temp=0 behavior: At temp=0, the merged model is expected to behave very similarly to the base model, though weights differ due to the LoRA merge. We have not formally tested temp=0 equivalence.

	## How to Use

	### With Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"shaneMattner/Qwen3.6-35B-A3B-RFT",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	attn_implementation="eager",
	)
	tokenizer = AutoTokenizer.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")

	messages = [
	{"role": "user", "content": "Write a Python function to merge two sorted lists into one sorted list."}
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### With MLX (Apple Silicon)

	```bash
	pip install mlx-lm
	```

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("shaneMattner/Qwen3.6-35B-A3B-RFT")
	response = generate(
	model,
	tokenizer,
	prompt="Write a Python function to merge two sorted lists.",
	max_tokens=512,
	)
	print(response)
	```

	Or quantize first for faster inference:

	```bash
	# Convert to 6-bit MLX format
	python -m mlx_lm.convert \
	--hf-path shaneMattner/Qwen3.6-35B-A3B-RFT \
	--mlx-path Qwen3.6-35B-A3B-RFT-6bit \
	-q --q-bits 6
	```

	Note: If you encounter errors related to `model_type`, you may need to change `"model_type": "qwen3_5_moe_text"` to `"model_type": "qwen3_5_moe"` in `config.json` for mlx-lm compatibility.

	### With llama.cpp / GGUF

	Convert to GGUF for use with llama.cpp, Ollama, or other GGUF-compatible tools:

	```bash
	# Clone llama.cpp and convert
	python convert_hf_to_gguf.py shaneMattner/Qwen3.6-35B-A3B-RFT --outtype bf16

	# Quantize to desired format
	./llama-quantize Qwen3.6-35B-A3B-RFT-bf16.gguf Qwen3.6-35B-A3B-RFT-Q4_K_M.gguf Q4_K_M
	```

	## Limitations

	- Coding-focused: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model.
	- Bounded by base model: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability.
	- Small training set: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results.
	- Eval coverage: Tested on 13 coding problems only. Broader benchmarks (HumanEval, MBPP, etc.) have not been run. Results are not statistically significant at this sample size.
	- Quantization confound: Base and merged models were evaluated at different quantization levels (4-bit vs 6-bit), confounding the quality comparison.
	- DeltaNet targeting: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations.

	## Architecture Notes

	Qwen3.6-35B-A3B uses a hybrid architecture:
	- Mixture of Experts (MoE): 256 experts with 8 active per token, keeping active compute at ~3B parameters despite 34.66B total
	- Gated DeltaNet linear attention: 30 of 40 layers use linear attention (every 4th layer uses full attention), enabling efficient long-context processing
	- 262K context window: Supports up to 262,144 tokens

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{mattner2026qwen36rft,
	title={Qwen3.6-35B-A3B-RFT: Rejection Fine-Tuned Qwen3.6 for Coding},
	author={Shane Mattner},
	year={2026},
	url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-RFT}
	}
	```

	### Related Work

	- [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) -- Base model by Qwen team
	- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) -- Hu et al., 2021
	- [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193) -- The SSD paper that inspired this work. Our method deviates from SSD by adding execution-based correctness filtering (making it RFT rather than pure SSD).

	## License

	Apache 2.0 (same as the base model [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B))