Qwen3.5-9B โ€” RYS Layer Surgery (GGUF)

Two modified versions of Qwen3.5-9B-Instruct produced by RYS layer duplication โ€” no training, no weight changes, just routing hidden states through a specific circuit twice.

Based on David Ng's RYS method.


Files

File Layers Size
Qwen3.5-9B-UD-Q4_K_XL.gguf 32 5.6 GiB
Qwen3.5-9B-rys_22-25-UD-Q4_K_XL.gguf 36 ~6.3 GiB
Qwen3.5-9B-rys_6-9_eq-UD-Q4_K_XL.gguf 36 ~6.3 GiB

Probe scores

Scores from an internal sweep benchmark run during circuit search. Sample sizes are small โ€” treat these as directional indicators, not definitive benchmarks.

Model Math EQ Reasoning
Base (32 layers) 0.563 5.5 0.235
rys_22-25 (36 layers) 0.188 0.0 0.529
rys_6-9_eq (36 layers) 0.369 24.0 0.353
  • Math: Ng's partial-credit scoring on a small GSM8K sample
  • EQ: EQ-Bench-style emotional intelligence score (0โ€“100)
  • Reasoning: fraction correct across causal, date, logic, navigation, and GSM8K probes

rys_22-25 more than doubles baseline reasoning (0.529 vs 0.235), with GSM8K rising from 0.2 to 0.8. rys_6-9_eq raises EQ from 5.5 to 24.0 and also improves reasoning above baseline, at some cost to math.


What is RYS?

Transformers self-organise during training into functional circuits โ€” contiguous blocks of layers that act together. The RYS technique duplicates a specific block in the forward pass using the same weights, with no extra copies on disk beyond the GGUF file overhead:

Normal:    0 โ†’ โ€ฆ โ†’ 21 โ†’ 22 โ†’ 23 โ†’ 24 โ†’ 25 โ†’ 26 โ†’ โ€ฆ โ†’ 31
rys_22-25: 0 โ†’ โ€ฆ โ†’ 21 โ†’ 22 โ†’ 23 โ†’ 24 โ†’ 25
                         โ†’ 22 โ†’ 23 โ†’ 24 โ†’ 25 โ†’ 26 โ†’ โ€ฆ โ†’ 31

The model processes the circuit twice, without any weight changes or fine-tuning.


Hybrid Mamba/attention architecture constraint

Qwen3.5-9B is a hybrid SSM/attention model (full_attention_interval = 4): full attention every 4th layer, Gated DeltaNet SSM everywhere else. The architecture repeats 8 times:

3 ร— (DeltaNet โ†’ FFN) โ†’ 1 ร— (Attention โ†’ FFN)

This creates a hard constraint on layer surgery: the total layer count must remain divisible by 4.

  • Block size 4 โ†’ 32 + 4 = 36 layers (36 รท 4 = 9 โœ“)
  • Block size 8 โ†’ 32 + 8 = 40 layers (40 รท 4 = 10 โœ“)
  • Block size 3 โ†’ 32 + 3 = 35 layers (35 รท 4 = 8.75 โœ— โ†’ crash)

How the circuit was found

A two-pass sweep over the 32-layer model:

Pass 1 โ€” 8-layer blocks, stride 4, layers 0โ€“28:

  • (8, 16) identified as hot zone: reasoning=0.412

Pass 2 โ€” 4-layer blocks, stride 1, layers 4โ€“28:

  • (22, 26) best reasoning: reasoning=0.529, GSM8K=0.80
  • (6, 10) best EQ: EQ=24.0, reasoning=0.353

Each configuration was tested by patching the GGUF layer path, loading with llama-server, and scoring with the probe suite.


Usage

llama.cpp / llama-server

# Best reasoning
llama-server -m Qwen3.5-9B-rys_22-25-UD-Q4_K_XL.gguf -ngl 99 --port 8080

# Best EQ
llama-server -m Qwen3.5-9B-rys_6-9_eq-UD-Q4_K_XL.gguf -ngl 99 --port 8080

Thinking mode

Qwen3.5 defaults to thinking mode (<think>โ€ฆ</think>). Add /no_think to the system prompt for fast, direct answers:

messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user",   "content": "Your question here"}
]

VRAM requirements

Model weights are ~6.3 GiB (Q4_K_XL, 36 layers). Runs on any modern GPU or CPU.


Credits

License

Apache 2.0 (inherited from base model)

Downloads last month
400
GGUF
Model size
10B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support