Qwen3.5-10B-Frankenmerge-Opus-4.6-Distill

Category	Base (Qwen3.5-9B-Base-Q8_0)	Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill	Δ
Factual Knowledge	85.0% B	85.0% B	=
Reasoning	88.0% B	60.0% C	↓ −28.0%
Coding	56.0% D	80.0% B	↑ +24.0%
Instruction Following	100.0% A	30.0% F	↓ −70.0%
Language	100.0% A	70.0% C	↓ −30.0%
Safety Calibration	66.7% C	66.7% C	=
Overall	82.4% B	65.6% C	↓ −16.8%

Method: Layer surgery on Qwen3.5-9B-Base-Q8_0 followed by fine-tuning.
Benchmarks run at temperature=0, seed=42
Coding capability improved significantly (+24%) at the cost of instruction-following and language tasks

This model was GGUF format using Unsloth.

Example usage:

For text only LLMs: llama-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF --jinja
For multimodal models: llama-mtmd-cli -hf JackBinary/Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill-GGUF --jinja

Available Model files:

Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q6_K.gguf
Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q8_0.gguf
Qwen-3.5-10B-Frankenmerge-Opus-4.6-Distill.Q4_K_M.gguf This was trained 2x faster with Unsloth

A DIY frankenmerge of Qwen3.5-9B with duplicated reasoning layers, then fine-tuned on high-quality reasoning data. 36 layers instead of 32. ~10B parameters. Text-only, thinking mode supported.

What this is

I took llmfan46/Qwen3.5-9B-ultra-heretic (an abliterated Qwen3.5-9B), duplicated layers 24-27 to give it an extra reasoning block, then trained it sequentially on two datasets to make the new layers earn their keep.

The original 9B has 32 layers arranged as 8 blocks of DeltaNet × 3 + Attention × 1. After surgery, it has 36 layers: 9 complete blocks. The duplicated block starts as an exact copy but diverges during training, giving the model more depth for complex reasoning without changing anything about the input/output behavior.

After the merge, two rounds of SFT with high-rank LoRA (r=128, alpha=256):

Stage 1: Jackrong/Qwen3.5-reasoning-700x (633 examples) at LR 2e-4. Reasoning distillation from Qwen3.5-27B. Gets the frankenmerge coherent and stabilizes the duplicated layers.
Stage 2: nohurry/Opus-4.6-Reasoning-3000x-filtered (~3000 examples) at LR 5e-5. Claude Opus 4.6 reasoning traces. Strengthens the model's actual problem-solving ability.

Why frankenmerge + train?

David Noel Ng's RYS work showed you can top the Open LLM Leaderboard by duplicating middle "reasoning" layers of a model without changing a single weight. The idea: early layers handle input encoding, late layers handle output decoding, and the middle layers do the actual thinking. Give the model more layers to think with, it thinks better.

RockTalk/Qwen3.5-9B-Franken-L24-27 applied this to Qwen3.5-9B and showed improvements without any post-training. A reddit post on layer surgery explored similar ideas.

Then I saw Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which showed that distilling structured reasoning from Claude Opus into Qwen3.5 massively reduces the overthinking/looping problem and makes the model more coherent and autonomous.

So the logic was: frankenmerge for extra capacity, then train the new capacity on high-quality reasoning data. Layer surgery gives you the architecture; SFT teaches the duplicated layers what to do with themselves.

The surgery, specifically

Qwen3.5-9B's 32 layers follow a repeating pattern:

Block 0: layers  0- 3  (DeltaNet, DeltaNet, DeltaNet, Attention)
Block 1: layers  4- 7  (DeltaNet, DeltaNet, DeltaNet, Attention)
...
Block 6: layers 24-27  (DeltaNet, DeltaNet, DeltaNet, Attention)  ← duplicated
Block 7: layers 28-31  (DeltaNet, DeltaNet, DeltaNet, Attention)

After surgery:

Blocks 0-6: layers  0-27  (original, unchanged)
Block 6':  layers 28-31  (deep copy of layers 24-27)
Block 7:   layers 32-35  (original layers 28-31, shifted)

The copy is done with copy.deepcopy in PyTorch from clean bf16 weights. No quantization artifacts, no weight key remapping hacks.

Training details

	Stage 1	Stage 2
Dataset	Qwen3.5-reasoning-700x	Opus-4.6-Reasoning-3000x-filtered
Examples	633	2326
Learning rate	2e-4	5e-5
Schedule	Cosine	Cosine
Epochs	1	1
Effective batch	8	8
LoRA rank	128	128
LoRA alpha	256	256
RSLoRA	Yes	Yes
Precision	bf16	bf16

Trained on a single G4 using Unsloth. Response-only masking (instruction tokens masked with -100). Sequential training: Stage 1 completes fully before Stage 2 begins. The LoRA adapters accumulate both stages.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "YOUR_USERNAME/Qwen3.5-9B-Franken-L24-27-Reasoning",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "YOUR_USERNAME/Qwen3.5-9B-Franken-L24-27-Reasoning",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that the square root of 2 is irrational."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.8, top_k=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Acknowledgments

This model wouldn't exist without the work of:

David Noel Ng (dnhkng) for the RYS research proving layer duplication works, and for writing such a clear explanation of the "LLM neuroanatomy" concept
RockTalk for demonstrating the frankenmerge on Qwen3.5-9B specifically (even though the weights turned out to be 4-bit under the hood, the idea was sound)
Jackrong for both the Opus-distilled model showing how well reasoning distillation works on Qwen3.5, and for the Qwen3.5-reasoning-700x dataset
nohurry for the filtered Opus 4.6 reasoning dataset
llmfan46 for the ultra-heretic abliteration, which gave me a clean, uncensored base to build on
r/LocalLLaMA for the collective insanity that makes all of this happen
The Qwen team at Alibaba for the base Qwen3.5 architecture
Unsloth for making training on a single GPU actually feasible