Qwen3.5-10.5B-Frankenmerge-Opus-4.6-Distill

Category Base (Qwen3.5-9B-Base-Q8_0) Frankenmodel Δ
Factual Knowledge 85.0% B 85.0% B =
Reasoning 88.0% B 60.0% C ↓ −28.0%
Coding 56.0% D 80.0% B ↑ +24.0%
Instruction Following 100.0% A 30.0% F ↓ −70.0%
Language 100.0% A 70.0% C ↓ −30.0%
Safety Calibration 66.7% C 66.7% C =
Overall 82.4% B 65.6% C ↓ −16.8%

Method: Layer surgery on Qwen3.5-9B-Base-Q8_0 followed by fine-tuning.
Benchmarks run at temperature=0, seed=42
Coding capability improved significantly (+24%) at the cost of instruction-following and language tasks


A DIY frankenmerge of Qwen3.5-9B with duplicated reasoning layers, then fine-tuned on high-quality reasoning data. 36 layers instead of 32. ~10.5B parameters. Text-only, thinking mode supported.

What this is

I took llmfan46/Qwen3.5-9B-ultra-heretic (an abliterated Qwen3.5-9B), duplicated layers 24-27 to give it an extra reasoning block, then trained it sequentially on two datasets to make the new layers earn their keep.

The original 9B has 32 layers arranged as 8 blocks of DeltaNet × 3 + Attention × 1. After surgery, it has 36 layers: 9 complete blocks. The duplicated block starts as an exact copy but diverges during training, giving the model more depth for complex reasoning without changing anything about the input/output behavior.

After the merge, two rounds of SFT with high-rank LoRA (r=128, alpha=256):

  1. Stage 1: Jackrong/Qwen3.5-reasoning-700x (633 examples) at LR 2e-4. Reasoning distillation from Qwen3.5-27B. Gets the frankenmerge coherent and stabilizes the duplicated layers.
  2. Stage 2: nohurry/Opus-4.6-Reasoning-3000x-filtered (~3000 examples) at LR 5e-5. Claude Opus 4.6 reasoning traces. Strengthens the model's actual problem-solving ability.

Why frankenmerge + train?

David Noel Ng's RYS work showed you can top the Open LLM Leaderboard by duplicating middle "reasoning" layers of a model without changing a single weight. The idea: early layers handle input encoding, late layers handle output decoding, and the middle layers do the actual thinking. Give the model more layers to think with, it thinks better.

RockTalk/Qwen3.5-9B-Franken-L24-27 applied this to Qwen3.5-9B and showed improvements without any post-training. A reddit post on layer surgery explored similar ideas.

Then I saw Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which showed that distilling structured reasoning from Claude Opus into Qwen3.5 massively reduces the overthinking/looping problem and makes the model more coherent and autonomous.

So the logic was: frankenmerge for extra capacity, then train the new capacity on high-quality reasoning data. Layer surgery gives you the architecture; SFT teaches the duplicated layers what to do with themselves.

The surgery, specifically

Qwen3.5-9B's 32 layers follow a repeating pattern:

Block 0: layers  0- 3  (DeltaNet, DeltaNet, DeltaNet, Attention)
Block 1: layers  4- 7  (DeltaNet, DeltaNet, DeltaNet, Attention)
...
Block 6: layers 24-27  (DeltaNet, DeltaNet, DeltaNet, Attention)  ← duplicated
Block 7: layers 28-31  (DeltaNet, DeltaNet, DeltaNet, Attention)

After surgery:

Blocks 0-6: layers  0-27  (original, unchanged)
Block 6':  layers 28-31  (deep copy of layers 24-27)
Block 7:   layers 32-35  (original layers 28-31, shifted)

The copy is done with copy.deepcopy in PyTorch from clean bf16 weights. No quantization artifacts, no weight key remapping hacks.

Training details

Stage 1 Stage 2
Dataset Qwen3.5-reasoning-700x Opus-4.6-Reasoning-3000x-filtered
Examples 633 2326
Learning rate 2e-4 5e-5
Schedule Cosine Cosine
Epochs 1 1
Effective batch 8 8
LoRA rank 128 128
LoRA alpha 256 256
RSLoRA Yes Yes
Precision bf16 bf16

Trained on a single G4 using Unsloth. Response-only masking (instruction tokens masked with -100). Sequential training: Stage 1 completes fully before Stage 2 begins. The LoRA adapters accumulate both stages.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "YOUR_USERNAME/Qwen3.5-9B-Franken-L24-27-Reasoning",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "YOUR_USERNAME/Qwen3.5-9B-Franken-L24-27-Reasoning",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that the square root of 2 is irrational."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.8, top_k=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Acknowledgments

This model wouldn't exist without the work of:

  • David Noel Ng (dnhkng) for the RYS research proving layer duplication works, and for writing such a clear explanation of the "LLM neuroanatomy" concept
  • RockTalk for demonstrating the frankenmerge on Qwen3.5-9B specifically (even though the weights turned out to be 4-bit under the hood, the idea was sound)
  • Jackrong for both the Opus-distilled model showing how well reasoning distillation works on Qwen3.5, and for the Qwen3.5-reasoning-700x dataset
  • nohurry for the filtered Opus 4.6 reasoning dataset
  • llmfan46 for the ultra-heretic abliteration, which gave me a clean, uncensored base to build on
  • r/LocalLLaMA for the collective insanity that makes all of this happen
  • The Qwen team at Alibaba for the base Qwen3.5 architecture
  • Unsloth for making training on a single GPU actually feasible

License

Apache 2.0, same as the base Qwen3.5 model.

Downloads last month
11
Safetensors
Model size
10B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(3)
this model
Quantizations
3 models

Datasets used to train JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill

Collection including JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill