Qwen3.5-4B โ€” RYS Layer Surgery (GGUF)

Two modified versions of Qwen3.5-4B-Instruct produced by RYS layer duplication โ€” no training, no weight changes, just routing hidden states through a specific circuit twice.

Based on David Ng's RYS method.


Files

File Layers Size
Qwen3.5-4B-UD-Q4_K_XL.gguf 32 2.8 GiB
Qwen3.5-4B-rys_27-30-UD-Q4_K_XL.gguf 36 3.0 GiB
Qwen3.5-4B-rys_28-31_eq-UD-Q4_K_XL.gguf 36 3.0 GiB

Probe scores

Scores from an internal sweep benchmark run during circuit search. Sample sizes are small โ€” treat these as directional indicators, not definitive benchmarks.

Model Math EQ Reasoning
Base (32 layers) 0.375 0.0 0.529
rys_27-30 (36 layers) 0.188 0.0 0.588
rys_28-31_eq (36 layers) 0.000 80.7 0.000
  • Math: Ng's partial-credit scoring on a small GSM8K sample
  • EQ: EQ-Bench-style emotional intelligence score (0โ€“100)
  • Reasoning: fraction correct across causal, date, logic, navigation, and GSM8K probes

rys_27-30 improves reasoning above the already-strong baseline (0.588 vs 0.529), with GSM8K rising from 0.2 to 0.8. rys_28-31_eq achieves EQ=80.7, the highest EQ score observed across all model sizes in this sweep, but at the cost of reasoning.


What is RYS?

Transformers self-organise during training into functional circuits โ€” contiguous blocks of layers that act together. The RYS technique duplicates a specific block in the forward pass using the same weights, with no extra copies on disk beyond the GGUF file overhead:

Normal:     0 โ†’ โ€ฆ โ†’ 26 โ†’ 27 โ†’ 28 โ†’ 29 โ†’ 30 โ†’ 31
rys_27-30:  0 โ†’ โ€ฆ โ†’ 26 โ†’ 27 โ†’ 28 โ†’ 29 โ†’ 30
                          โ†’ 27 โ†’ 28 โ†’ 29 โ†’ 30 โ†’ 31

The model processes the circuit twice, without any weight changes or fine-tuning.


Hybrid Mamba/attention architecture constraint

Qwen3.5-4B is a hybrid SSM/attention model (full_attention_interval = 4): full attention every 4th layer, Gated DeltaNet SSM everywhere else. The architecture repeats 8 times:

3 ร— (DeltaNet โ†’ FFN) โ†’ 1 ร— (Attention โ†’ FFN)

This creates a hard constraint on layer surgery: the total layer count must remain divisible by 4.

  • Block size 4 โ†’ 32 + 4 = 36 layers (36 รท 4 = 9 โœ“)
  • Block size 8 โ†’ 32 + 8 = 40 layers (40 รท 4 = 10 โœ“)
  • Block size 3 โ†’ 32 + 3 = 35 layers (35 รท 4 = 8.75 โœ— โ†’ crash)

How the circuit was found

A two-pass sweep over the 32-layer model:

Pass 1 โ€” 8-layer blocks, stride 4, layers 0โ€“32:

  • (12, 20) best combined: EQ=58.4, reasoning=0.412
  • (24, 32) best EQ: EQ=63.1

Pass 2 โ€” 4-layer blocks, stride 1, layers 8โ€“32:

  • (28, 32) best EQ: EQ=80.7 (highest across all model sizes in this sweep)
  • (27, 31) best reasoning: reasoning=0.588 (exceeds baseline of 0.529), GSM8K=0.80

Each configuration was tested by patching the GGUF layer path, loading with llama-server, and scoring with the probe suite.


Usage

llama.cpp / llama-server

# Best reasoning
llama-server -m Qwen3.5-4B-rys_27-30-UD-Q4_K_XL.gguf -ngl 99 --port 8080

# Best EQ
llama-server -m Qwen3.5-4B-rys_28-31_eq-UD-Q4_K_XL.gguf -ngl 99 --port 8080

Thinking mode

Qwen3.5 defaults to thinking mode (<think>โ€ฆ</think>). Add /no_think to the system prompt for fast, direct answers:

messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user",   "content": "Your question here"}
]

VRAM requirements

Model weights are ~3.0 GiB (Q4_K_XL, 36 layers). Runs on any modern GPU or CPU.


Credits

License

Apache 2.0 (inherited from base model)

Downloads last month
429
GGUF
Model size
5B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support