Qwen3.5-2B โ€” RYS Layer Surgery (GGUF)

Two modified versions of Qwen3.5-2B-Instruct produced by RYS layer duplication โ€” no training, no weight changes, just routing hidden states through a specific circuit twice.

Based on David Ng's RYS method.


Files

File Layers Size
Qwen3.5-2B-UD-Q4_K_XL.gguf 24 1.34 GiB
Qwen3.5-2B-rys_8-11-UD-Q4_K_XL.gguf 28 1.39 GiB
Qwen3.5-2B-rys_7-10_reasoning-UD-Q4_K_XL.gguf 28 1.38 GiB

Probe scores

Scores from an internal sweep benchmark run during circuit search. Sample sizes are small โ€” treat these as directional indicators, not definitive benchmarks.

Model Math EQ Reasoning
Base (24 layers) 0.188 0.0 0.118
rys_8-11 (28 layers) 0.125 11.3 0.176
rys_7-10_reasoning (28 layers) 0.062 0.0 0.294
  • Math: Ng's partial-credit scoring on a small GSM8K sample
  • EQ: EQ-Bench-style emotional intelligence score (0โ€“100)
  • Reasoning: fraction correct across causal, date, logic, navigation, and GSM8K probes

rys_8-11 shows the best combined improvement: EQ rises from 0 to 11.3 and reasoning improves. rys_7-10_reasoning achieves the highest reasoning score (0.294 vs 0.118 baseline) but at the cost of math and EQ.


What is RYS?

Transformers self-organise during training into functional circuits โ€” contiguous blocks of layers that act together. The RYS technique duplicates a specific block in the forward pass using the same weights, with no extra copies on disk beyond the GGUF file overhead:

Normal:  0 โ†’ โ€ฆ โ†’ 7 โ†’ 8 โ†’ 9 โ†’ 10 โ†’ 11 โ†’ 12 โ†’ โ€ฆ โ†’ 23
rys_8-11: 0 โ†’ โ€ฆ โ†’ 7 โ†’ 8 โ†’ 9 โ†’ 10 โ†’ 11
                      โ†’ 8 โ†’ 9 โ†’ 10 โ†’ 11 โ†’ 12 โ†’ โ€ฆ โ†’ 23

The model processes the circuit twice, without any weight changes or fine-tuning.


Hybrid Mamba/attention architecture constraint

Qwen3.5-2B is a hybrid SSM/attention model (full_attention_interval = 4): full attention every 4th layer, Gated DeltaNet SSM everywhere else. The architecture repeats 6 times:

3 ร— (DeltaNet โ†’ FFN) โ†’ 1 ร— (Attention โ†’ FFN)

This creates a hard constraint on layer surgery: the total layer count must remain divisible by 4.

  • Block size 4 โ†’ 24 + 4 = 28 layers (28 รท 4 = 7 โœ“)
  • Block size 8 โ†’ 24 + 8 = 32 layers (32 รท 4 = 8 โœ“)
  • Block size 3 โ†’ 24 + 3 = 27 layers (27 รท 4 = 6.75 โœ— โ†’ crash)

rys_8-11 duplicates layers 8โ€“11, one complete DeltaNet+Attention unit. rys_7-10_reasoning duplicates layers 7โ€“10, spanning the boundary between two attention intervals.


How the circuit was found

A two-pass sweep over the 24-layer model:

Pass 1 โ€” 8-layer blocks, stride 4, layers 0โ€“16:

  • (4, 12) identified as the hot zone: EQ=11.76, reasoning=0.176

Pass 2 โ€” 4-layer blocks, stride 1, layers 4โ€“16:

  • (8, 12) best combined: EQ=11.33, reasoning=0.176
  • (7, 11) best reasoning: reasoning=0.294

Each configuration was tested by patching the GGUF layer path, loading with llama-server, and scoring with the probe suite.


Usage

llama.cpp / llama-server

# Best combined (EQ + reasoning)
llama-server -m Qwen3.5-2B-rys_8-11-UD-Q4_K_XL.gguf -ngl 99 --port 8080

# Best reasoning
llama-server -m Qwen3.5-2B-rys_7-10_reasoning-UD-Q4_K_XL.gguf -ngl 99 --port 8080

Thinking mode

Qwen3.5 defaults to thinking mode (<think>โ€ฆ</think>). Add /no_think to the system prompt for fast, direct answers:

messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user",   "content": "Your question here"}
]

VRAM requirements

Model weights are ~1.4 GiB (Q4_K_XL, 28 layers). Runs on any modern GPU or CPU.


Credits

License

Apache 2.0 (inherited from base model)

Downloads last month
383
GGUF
Model size
2B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support