Qwen3.5-10.5B-Frankenmerge-Opus-4.6-Distill
| Category | Base (Qwen3.5-9B-Base-Q8_0) | Frankenmodel | Δ |
|---|---|---|---|
| Factual Knowledge | 85.0% B | 85.0% B | = |
| Reasoning | 88.0% B | 60.0% C | ↓ −28.0% |
| Coding | 56.0% D | 80.0% B | ↑ +24.0% |
| Instruction Following | 100.0% A | 30.0% F | ↓ −70.0% |
| Language | 100.0% A | 70.0% C | ↓ −30.0% |
| Safety Calibration | 66.7% C | 66.7% C | = |
| Overall | 82.4% B | 65.6% C | ↓ −16.8% |
Method: Layer surgery on Qwen3.5-9B-Base-Q8_0 followed by fine-tuning.
Benchmarks run attemperature=0, seed=42
Coding capability improved significantly (+24%) at the cost of instruction-following and language tasks
A DIY frankenmerge of Qwen3.5-9B with duplicated reasoning layers, then fine-tuned on high-quality reasoning data. 36 layers instead of 32. ~10.5B parameters. Text-only, thinking mode supported.
What this is
I took llmfan46/Qwen3.5-9B-ultra-heretic (an abliterated Qwen3.5-9B), duplicated layers 24-27 to give it an extra reasoning block, then trained it sequentially on two datasets to make the new layers earn their keep.
The original 9B has 32 layers arranged as 8 blocks of DeltaNet × 3 + Attention × 1. After surgery, it has 36 layers: 9 complete blocks. The duplicated block starts as an exact copy but diverges during training, giving the model more depth for complex reasoning without changing anything about the input/output behavior.
After the merge, two rounds of SFT with high-rank LoRA (r=128, alpha=256):
- Stage 1: Jackrong/Qwen3.5-reasoning-700x (633 examples) at LR 2e-4. Reasoning distillation from Qwen3.5-27B. Gets the frankenmerge coherent and stabilizes the duplicated layers.
- Stage 2: nohurry/Opus-4.6-Reasoning-3000x-filtered (~3000 examples) at LR 5e-5. Claude Opus 4.6 reasoning traces. Strengthens the model's actual problem-solving ability.
Why frankenmerge + train?
David Noel Ng's RYS work showed you can top the Open LLM Leaderboard by duplicating middle "reasoning" layers of a model without changing a single weight. The idea: early layers handle input encoding, late layers handle output decoding, and the middle layers do the actual thinking. Give the model more layers to think with, it thinks better.
RockTalk/Qwen3.5-9B-Franken-L24-27 applied this to Qwen3.5-9B and showed improvements without any post-training. A reddit post on layer surgery explored similar ideas.
Then I saw Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which showed that distilling structured reasoning from Claude Opus into Qwen3.5 massively reduces the overthinking/looping problem and makes the model more coherent and autonomous.
So the logic was: frankenmerge for extra capacity, then train the new capacity on high-quality reasoning data. Layer surgery gives you the architecture; SFT teaches the duplicated layers what to do with themselves.
The surgery, specifically
Qwen3.5-9B's 32 layers follow a repeating pattern:
Block 0: layers 0- 3 (DeltaNet, DeltaNet, DeltaNet, Attention)
Block 1: layers 4- 7 (DeltaNet, DeltaNet, DeltaNet, Attention)
...
Block 6: layers 24-27 (DeltaNet, DeltaNet, DeltaNet, Attention) ← duplicated
Block 7: layers 28-31 (DeltaNet, DeltaNet, DeltaNet, Attention)
After surgery:
Blocks 0-6: layers 0-27 (original, unchanged)
Block 6': layers 28-31 (deep copy of layers 24-27)
Block 7: layers 32-35 (original layers 28-31, shifted)
The copy is done with copy.deepcopy in PyTorch from clean bf16 weights. No quantization artifacts, no weight key remapping hacks.
Training details
| Stage 1 | Stage 2 | |
|---|---|---|
| Dataset | Qwen3.5-reasoning-700x | Opus-4.6-Reasoning-3000x-filtered |
| Examples | 633 | 2326 |
| Learning rate | 2e-4 | 5e-5 |
| Schedule | Cosine | Cosine |
| Epochs | 1 | 1 |
| Effective batch | 8 | 8 |
| LoRA rank | 128 | 128 |
| LoRA alpha | 256 | 256 |
| RSLoRA | Yes | Yes |
| Precision | bf16 | bf16 |
Trained on a single G4 using Unsloth. Response-only masking (instruction tokens masked with -100). Sequential training: Stage 1 completes fully before Stage 2 begins. The LoRA adapters accumulate both stages.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"YOUR_USERNAME/Qwen3.5-9B-Franken-L24-27-Reasoning",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"YOUR_USERNAME/Qwen3.5-9B-Franken-L24-27-Reasoning",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Prove that the square root of 2 is irrational."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.8, top_k=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Acknowledgments
This model wouldn't exist without the work of:
- David Noel Ng (dnhkng) for the RYS research proving layer duplication works, and for writing such a clear explanation of the "LLM neuroanatomy" concept
- RockTalk for demonstrating the frankenmerge on Qwen3.5-9B specifically (even though the weights turned out to be 4-bit under the hood, the idea was sound)
- Jackrong for both the Opus-distilled model showing how well reasoning distillation works on Qwen3.5, and for the Qwen3.5-reasoning-700x dataset
- nohurry for the filtered Opus 4.6 reasoning dataset
- llmfan46 for the ultra-heretic abliteration, which gave me a clean, uncensored base to build on
- r/LocalLLaMA for the collective insanity that makes all of this happen
- The Qwen team at Alibaba for the base Qwen3.5 architecture
- Unsloth for making training on a single GPU actually feasible
License
Apache 2.0, same as the base Qwen3.5 model.
- Downloads last month
- 11