Instructions to use shaneMattner/Qwen3.6-35B-A3B-RFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shaneMattner/Qwen3.6-35B-A3B-RFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="shaneMattner/Qwen3.6-35B-A3B-RFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")
model = AutoModelForCausalLM.from_pretrained("shaneMattner/Qwen3.6-35B-A3B-RFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use shaneMattner/Qwen3.6-35B-A3B-RFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "shaneMattner/Qwen3.6-35B-A3B-RFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shaneMattner/Qwen3.6-35B-A3B-RFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/shaneMattner/Qwen3.6-35B-A3B-RFT

SGLang

How to use shaneMattner/Qwen3.6-35B-A3B-RFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "shaneMattner/Qwen3.6-35B-A3B-RFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shaneMattner/Qwen3.6-35B-A3B-RFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "shaneMattner/Qwen3.6-35B-A3B-RFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shaneMattner/Qwen3.6-35B-A3B-RFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use shaneMattner/Qwen3.6-35B-A3B-RFT with Docker Model Runner:
```
docker model run hf.co/shaneMattner/Qwen3.6-35B-A3B-RFT
```

shaneMattner commited on 4 days ago

Commit

80e0957

verified ·

1 Parent(s): 48c3617

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +27 -15

README.md CHANGED Viewed

@@ -4,8 +4,8 @@ language:
 license: apache-2.0
 library_name: transformers
 tags:
   - self-distillation
-  - ssd
   - qwen
   - qwen3.6
   - moe
@@ -23,32 +23,37 @@ model-index:
       - task:
           type: text-generation
         dataset:
-          name: Self-generated coding dataset (SSD)
           type: custom
         metrics:
           - name: Train Loss
             type: train_loss
             value: 0.523
-          - name: pass@10 (temp=0.7, 13 problems)
-            type: pass_at_k
             value: 0.985
 ---
 # Qwen3.6-35B-A3B-SSD
-A self-distilled version of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), fine-tuned using the **Self-play Self-Distillation (SSD)** technique. The LoRA adapter has been merged into the base weights -- this is a standard bf16 model ready for direct use or quantization.
-## What is SSD?
-Self-play Self-Distillation generates training data from the model itself at high temperature (1.6), filters for correctness, then fine-tunes the model on its own best outputs. No external judge, no reward model, no human annotation required.
-The key insight: training on the model's own high-temperature outputs acts as a form of self-distillation that improves pass@k at temperature > 0. At temp=0, the merged model produces output identical to base. The improvement appears in the diversity and correctness of samples at higher temperatures.
 ## Model Details
 | Property | Value |
 |----------|-------|
-| Architecture | Qwen3.5 MoE with Gated DeltaNet linear attention |
 | Total parameters | 34.66B |
 | Active parameters | ~3B (Mixture of Experts, 256 experts, 8 active per token) |
 | Hidden layers | 40 (30 linear attention + 10 full attention) |
@@ -57,6 +62,8 @@ The key insight: training on the model's own high-temperature outputs acts as a
 | Context length | 262,144 tokens |
 | License | Apache 2.0 |
 ## Training Details
 ### Method
@@ -119,14 +126,18 @@ Tested as a 6-bit MLX quantization on Mac Studio M4 Max (128GB) against the base
 | Hard (8 problems) | 76/80 (95%) | 78/80 (98%) |
 | **Overall** | **126/130 (97%)** | **128/130 (98%)** |
-Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10 → merged 9/10.
 | Metric | Value |
 |--------|-------|
 | Inference speed (6-bit MLX) | 78.9 tok/s average |
 | Base model speed (4-bit MLX) | 86.7 tok/s average |
-Note: At temp=0, the merged model produces output identical to the base model. The SSD improvement manifests at temperature > 0 through improved sample diversity and correctness, particularly on harder problems.
 ## How to Use
@@ -201,7 +212,8 @@ python convert_hf_to_gguf.py shaneMattner/Qwen3.6-35B-A3B-SSD --outtype bf16
 - **Coding-focused**: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model.
 - **Bounded by base model**: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability.
 - **Small training set**: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results.
-- **Eval coverage**: Tested on 5 coding problems only. Broader benchmarks (HumanEval, MBPP, etc.) have not been run.
 - **DeltaNet targeting**: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations.
 ## Architecture Notes
@@ -216,8 +228,8 @@ Qwen3.6-35B-A3B uses a hybrid architecture:
 If you use this model, please cite:
 ```bibtex
-@misc{mattner2026qwen36ssd,
-  title={Qwen3.6-35B-A3B-SSD: Self-Distilled Qwen3.6 for Coding},
   author={Shane Mattner},
   year={2026},
   url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-SSD}
@@ -228,7 +240,7 @@ If you use this model, please cite:
 - [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) -- Base model by Qwen team
 - [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) -- Hu et al., 2021
-- Self-play Self-Distillation (SSD) technique for improving pass@k at temperature > 0
 ## License

 license: apache-2.0
 library_name: transformers
 tags:
+  - rejection-fine-tuning
   - self-distillation
   - qwen
   - qwen3.6
   - moe
       - task:
           type: text-generation
         dataset:
+          name: Self-generated coding dataset (RFT, filtered)
           type: custom
         metrics:
           - name: Train Loss
             type: train_loss
             value: 0.523
+          - name: avg_sample_pass_rate (temp=0.7, 13 problems, 10 samples each)
+            type: avg_sample_pass_rate
             value: 0.985
 ---
 # Qwen3.6-35B-A3B-SSD
+A fine-tuned version of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) using **Rejection Fine-Tuning (RFT) on self-generated data**, inspired by the [Simple Self-Distillation (SSD)](https://arxiv.org/abs/2604.01193) paper. The LoRA adapter has been merged into the base weights -- this is a standard bf16 model ready for direct use or quantization.
+> **Note on the repo name**: The repo is named "SSD" because the project started as an SSD replication, but our method deviates from pure SSD in a key way (see below). We kept the name for continuity.
+## What We Actually Did (RFT, Not Pure SSD)
+Our method is **inspired by** the SSD paper ("Embarrassingly Simple Self-Distillation Improves Code Generation", arxiv 2604.01193) but differs in a critical way:
+- **SSD (the paper)**: Generates samples from the model and trains on ALL of them -- correct and incorrect -- with NO filtering. That is the paper's key insight: unfiltered self-generated data still improves pass@k.
+- **Our method**: We generated samples at high temperature, then **filtered for correctness** using execution-based verification (2,000 generated, 1,796 passed tests). We trained only on correct outputs.
+This makes our method **Rejection Fine-Tuning (RFT)** -- also known as rejection sampling + SFT or on-policy distillation. RFT is a well-established technique. The difference matters: SSD's claim is that filtering is unnecessary; we used filtering, so we cannot validate or invalidate that claim.
 ## Model Details
 | Property | Value |
 |----------|-------|
+| Architecture | Qwen3.5 MoE with Gated DeltaNet linear attention (see note below) |
 | Total parameters | 34.66B |
 | Active parameters | ~3B (Mixture of Experts, 256 experts, 8 active per token) |
 | Hidden layers | 40 (30 linear attention + 10 full attention) |
 | Context length | 262,144 tokens |
 | License | Apache 2.0 |
+> **Architecture note**: The HuggingFace config reports `model_type: qwen3_5_moe` -- Qwen3.6 is built on the Qwen3.5 MoE architecture with the addition of Gated DeltaNet linear attention layers.
 ## Training Details
 ### Method
 | Hard (8 problems) | 76/80 (95%) | 78/80 (98%) |
 | **Overall** | **126/130 (97%)** | **128/130 (98%)** |
+Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10 -> merged 9/10.
 | Metric | Value |
 |--------|-------|
 | Inference speed (6-bit MLX) | 78.9 tok/s average |
 | Base model speed (4-bit MLX) | 86.7 tok/s average |
+**Important caveats**:
+- **Quantization confound**: The base model was tested at 4-bit quantization while the merged model was tested at 6-bit. Higher quantization preserves more model information. Some or all of the quality difference (128/130 vs 126/130) may be attributable to quantization level rather than the RFT training. A controlled comparison at matched quantization has not been run.
+- **Statistical significance**: The difference of 2/130 samples is not statistically significant (p ~= 0.28, Fisher's exact test). These results are within noise at this sample size.
+- **Temp=0 behavior**: At temp=0, the merged model is expected to behave very similarly to the base model, though weights differ due to the LoRA merge. We have not formally tested temp=0 equivalence.
 ## How to Use
 - **Coding-focused**: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model.
 - **Bounded by base model**: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability.
 - **Small training set**: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results.
+- **Eval coverage**: Tested on 13 coding problems only. Broader benchmarks (HumanEval, MBPP, etc.) have not been run. Results are not statistically significant at this sample size.
+- **Quantization confound**: Base and merged models were evaluated at different quantization levels (4-bit vs 6-bit), confounding the quality comparison.
 - **DeltaNet targeting**: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations.
 ## Architecture Notes
 If you use this model, please cite:
 ```bibtex
+@misc{mattner2026qwen36rft,
+  title={Qwen3.6-35B-A3B-SSD: Rejection Fine-Tuned Qwen3.6 for Coding},
   author={Shane Mattner},
   year={2026},
   url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-SSD}
 - [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) -- Base model by Qwen team
 - [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) -- Hu et al., 2021
+- [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193) -- The SSD paper that inspired this work. Our method deviates from SSD by adding execution-based correctness filtering (making it RFT rather than pure SSD).
 ## License