Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -4,8 +4,8 @@ language:
|
|
| 4 |
license: apache-2.0
|
| 5 |
library_name: transformers
|
| 6 |
tags:
|
|
|
|
| 7 |
- self-distillation
|
| 8 |
-
- ssd
|
| 9 |
- qwen
|
| 10 |
- qwen3.6
|
| 11 |
- moe
|
|
@@ -23,32 +23,37 @@ model-index:
|
|
| 23 |
- task:
|
| 24 |
type: text-generation
|
| 25 |
dataset:
|
| 26 |
-
name: Self-generated coding dataset (
|
| 27 |
type: custom
|
| 28 |
metrics:
|
| 29 |
- name: Train Loss
|
| 30 |
type: train_loss
|
| 31 |
value: 0.523
|
| 32 |
-
- name:
|
| 33 |
-
type:
|
| 34 |
value: 0.985
|
| 35 |
---
|
| 36 |
|
| 37 |
# Qwen3.6-35B-A3B-SSD
|
| 38 |
|
| 39 |
-
A
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
## Model Details
|
| 48 |
|
| 49 |
| Property | Value |
|
| 50 |
|----------|-------|
|
| 51 |
-
| Architecture | Qwen3.5 MoE with Gated DeltaNet linear attention |
|
| 52 |
| Total parameters | 34.66B |
|
| 53 |
| Active parameters | ~3B (Mixture of Experts, 256 experts, 8 active per token) |
|
| 54 |
| Hidden layers | 40 (30 linear attention + 10 full attention) |
|
|
@@ -57,6 +62,8 @@ The key insight: training on the model's own high-temperature outputs acts as a
|
|
| 57 |
| Context length | 262,144 tokens |
|
| 58 |
| License | Apache 2.0 |
|
| 59 |
|
|
|
|
|
|
|
| 60 |
## Training Details
|
| 61 |
|
| 62 |
### Method
|
|
@@ -119,14 +126,18 @@ Tested as a 6-bit MLX quantization on Mac Studio M4 Max (128GB) against the base
|
|
| 119 |
| Hard (8 problems) | 76/80 (95%) | 78/80 (98%) |
|
| 120 |
| **Overall** | **126/130 (97%)** | **128/130 (98%)** |
|
| 121 |
|
| 122 |
-
Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10
|
| 123 |
|
| 124 |
| Metric | Value |
|
| 125 |
|--------|-------|
|
| 126 |
| Inference speed (6-bit MLX) | 78.9 tok/s average |
|
| 127 |
| Base model speed (4-bit MLX) | 86.7 tok/s average |
|
| 128 |
|
| 129 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
## How to Use
|
| 132 |
|
|
@@ -201,7 +212,8 @@ python convert_hf_to_gguf.py shaneMattner/Qwen3.6-35B-A3B-SSD --outtype bf16
|
|
| 201 |
- **Coding-focused**: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model.
|
| 202 |
- **Bounded by base model**: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability.
|
| 203 |
- **Small training set**: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results.
|
| 204 |
-
- **Eval coverage**: Tested on
|
|
|
|
| 205 |
- **DeltaNet targeting**: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations.
|
| 206 |
|
| 207 |
## Architecture Notes
|
|
@@ -216,8 +228,8 @@ Qwen3.6-35B-A3B uses a hybrid architecture:
|
|
| 216 |
If you use this model, please cite:
|
| 217 |
|
| 218 |
```bibtex
|
| 219 |
-
@misc{
|
| 220 |
-
title={Qwen3.6-35B-A3B-SSD:
|
| 221 |
author={Shane Mattner},
|
| 222 |
year={2026},
|
| 223 |
url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-SSD}
|
|
@@ -228,7 +240,7 @@ If you use this model, please cite:
|
|
| 228 |
|
| 229 |
- [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) -- Base model by Qwen team
|
| 230 |
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) -- Hu et al., 2021
|
| 231 |
-
-
|
| 232 |
|
| 233 |
## License
|
| 234 |
|
|
|
|
| 4 |
license: apache-2.0
|
| 5 |
library_name: transformers
|
| 6 |
tags:
|
| 7 |
+
- rejection-fine-tuning
|
| 8 |
- self-distillation
|
|
|
|
| 9 |
- qwen
|
| 10 |
- qwen3.6
|
| 11 |
- moe
|
|
|
|
| 23 |
- task:
|
| 24 |
type: text-generation
|
| 25 |
dataset:
|
| 26 |
+
name: Self-generated coding dataset (RFT, filtered)
|
| 27 |
type: custom
|
| 28 |
metrics:
|
| 29 |
- name: Train Loss
|
| 30 |
type: train_loss
|
| 31 |
value: 0.523
|
| 32 |
+
- name: avg_sample_pass_rate (temp=0.7, 13 problems, 10 samples each)
|
| 33 |
+
type: avg_sample_pass_rate
|
| 34 |
value: 0.985
|
| 35 |
---
|
| 36 |
|
| 37 |
# Qwen3.6-35B-A3B-SSD
|
| 38 |
|
| 39 |
+
A fine-tuned version of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) using **Rejection Fine-Tuning (RFT) on self-generated data**, inspired by the [Simple Self-Distillation (SSD)](https://arxiv.org/abs/2604.01193) paper. The LoRA adapter has been merged into the base weights -- this is a standard bf16 model ready for direct use or quantization.
|
| 40 |
|
| 41 |
+
> **Note on the repo name**: The repo is named "SSD" because the project started as an SSD replication, but our method deviates from pure SSD in a key way (see below). We kept the name for continuity.
|
| 42 |
|
| 43 |
+
## What We Actually Did (RFT, Not Pure SSD)
|
| 44 |
|
| 45 |
+
Our method is **inspired by** the SSD paper ("Embarrassingly Simple Self-Distillation Improves Code Generation", arxiv 2604.01193) but differs in a critical way:
|
| 46 |
+
|
| 47 |
+
- **SSD (the paper)**: Generates samples from the model and trains on ALL of them -- correct and incorrect -- with NO filtering. That is the paper's key insight: unfiltered self-generated data still improves pass@k.
|
| 48 |
+
- **Our method**: We generated samples at high temperature, then **filtered for correctness** using execution-based verification (2,000 generated, 1,796 passed tests). We trained only on correct outputs.
|
| 49 |
+
|
| 50 |
+
This makes our method **Rejection Fine-Tuning (RFT)** -- also known as rejection sampling + SFT or on-policy distillation. RFT is a well-established technique. The difference matters: SSD's claim is that filtering is unnecessary; we used filtering, so we cannot validate or invalidate that claim.
|
| 51 |
|
| 52 |
## Model Details
|
| 53 |
|
| 54 |
| Property | Value |
|
| 55 |
|----------|-------|
|
| 56 |
+
| Architecture | Qwen3.5 MoE with Gated DeltaNet linear attention (see note below) |
|
| 57 |
| Total parameters | 34.66B |
|
| 58 |
| Active parameters | ~3B (Mixture of Experts, 256 experts, 8 active per token) |
|
| 59 |
| Hidden layers | 40 (30 linear attention + 10 full attention) |
|
|
|
|
| 62 |
| Context length | 262,144 tokens |
|
| 63 |
| License | Apache 2.0 |
|
| 64 |
|
| 65 |
+
> **Architecture note**: The HuggingFace config reports `model_type: qwen3_5_moe` -- Qwen3.6 is built on the Qwen3.5 MoE architecture with the addition of Gated DeltaNet linear attention layers.
|
| 66 |
+
|
| 67 |
## Training Details
|
| 68 |
|
| 69 |
### Method
|
|
|
|
| 126 |
| Hard (8 problems) | 76/80 (95%) | 78/80 (98%) |
|
| 127 |
| **Overall** | **126/130 (97%)** | **128/130 (98%)** |
|
| 128 |
|
| 129 |
+
Biggest improvement on the hardest problem (expression evaluator with operator precedence and parentheses): base 7/10 -> merged 9/10.
|
| 130 |
|
| 131 |
| Metric | Value |
|
| 132 |
|--------|-------|
|
| 133 |
| Inference speed (6-bit MLX) | 78.9 tok/s average |
|
| 134 |
| Base model speed (4-bit MLX) | 86.7 tok/s average |
|
| 135 |
|
| 136 |
+
**Important caveats**:
|
| 137 |
+
|
| 138 |
+
- **Quantization confound**: The base model was tested at 4-bit quantization while the merged model was tested at 6-bit. Higher quantization preserves more model information. Some or all of the quality difference (128/130 vs 126/130) may be attributable to quantization level rather than the RFT training. A controlled comparison at matched quantization has not been run.
|
| 139 |
+
- **Statistical significance**: The difference of 2/130 samples is not statistically significant (p ~= 0.28, Fisher's exact test). These results are within noise at this sample size.
|
| 140 |
+
- **Temp=0 behavior**: At temp=0, the merged model is expected to behave very similarly to the base model, though weights differ due to the LoRA merge. We have not formally tested temp=0 equivalence.
|
| 141 |
|
| 142 |
## How to Use
|
| 143 |
|
|
|
|
| 212 |
- **Coding-focused**: Fine-tuned exclusively on Python coding tasks. General instruction following may not improve (or may slightly regress) compared to the base model.
|
| 213 |
- **Bounded by base model**: Self-distillation cannot exceed the base model's capability ceiling -- it improves sampling consistency, not peak ability.
|
| 214 |
- **Small training set**: 1,616 samples is a proof-of-concept. Larger datasets with more diverse problems would likely yield stronger results.
|
| 215 |
+
- **Eval coverage**: Tested on 13 coding problems only. Broader benchmarks (HumanEval, MBPP, etc.) have not been run. Results are not statistically significant at this sample size.
|
| 216 |
+
- **Quantization confound**: Base and merged models were evaluated at different quantization levels (4-bit vs 6-bit), confounding the quality comparison.
|
| 217 |
- **DeltaNet targeting**: The in_proj_a and in_proj_b DeltaNet gating layers were not included in LoRA targets -- adding them may improve results in future iterations.
|
| 218 |
|
| 219 |
## Architecture Notes
|
|
|
|
| 228 |
If you use this model, please cite:
|
| 229 |
|
| 230 |
```bibtex
|
| 231 |
+
@misc{mattner2026qwen36rft,
|
| 232 |
+
title={Qwen3.6-35B-A3B-SSD: Rejection Fine-Tuned Qwen3.6 for Coding},
|
| 233 |
author={Shane Mattner},
|
| 234 |
year={2026},
|
| 235 |
url={https://huggingface.co/shaneMattner/Qwen3.6-35B-A3B-SSD}
|
|
|
|
| 240 |
|
| 241 |
- [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) -- Base model by Qwen team
|
| 242 |
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) -- Hu et al., 2021
|
| 243 |
+
- [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193) -- The SSD paper that inspired this work. Our method deviates from SSD by adding execution-based correctness filtering (making it RFT rather than pure SSD).
|
| 244 |
|
| 245 |
## License
|
| 246 |
|