YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GPT-OSS-20B with Attention Repeat (Layers 19-20)
Applying the RYS (Repeat Your Steps) method to an MoE (Mixture-of-Experts) model. Achieves +13.3pp improvement on BBH-style reasoning benchmarks with no training and no weight changes -- only a forward pass modification at inference time.
Key Insight: In MoE Models, Repeat the Dense Part (Attention), Not the Sparse Part (Experts)
The RYS method was originally discovered on dense models (Devstral-24B, Qwen2.5-32B), where duplicating specific transformer layers boosts reasoning ability.
We tested three repetition strategies on GPT-OSS-20B, an MoE model with 32 experts per layer (top-4 routing):
| Method | Description | Result |
|---|---|---|
| Expert-only repeat | Repeat only the MoE block (Router + Experts) | No effect or degradation |
| Full-block repeat | Repeat both Attention and MoE | Severe degradation (-20pp) |
| Attention-only repeat | Repeat only the Attention (dense) component | +13.3pp improvement |
Why Attention Repeat Works
Normal: x -> [Attention] -> [MoE: Router -> Top-4/32 Experts] -> out
Modified: x -> [Attention] -> [Attention (2nd pass)] -> [MoE: Router -> Top-4/32 Experts] -> out
- Attention is the dense component -- all parameters are always active for every token
- Expert FFN is the sparse component -- only 4 out of 32 experts are active per token
- The core mechanism of RYS is amplifying dense reasoning circuits; repeating sparse components has no effect
- By running Attention twice, the hidden state passed to the MoE Router is more refined, enabling better expert selection
Benchmark Results
Evaluated on 15 BBH-style reasoning questions covering logical deduction, tracking shuffles, arithmetic, causal judgement, navigation, boolean satisfiability, table reasoning, disambiguation, and Dyck languages.
Sweep Results
Configuration Score Delta vs baseline
--------------------------------------------------
attn-L19-20 12/15 +13.3pp *** BEST
attn-L18-20 11/15 +6.7pp
baseline 10/15 0.0pp
attn-L18-19 10/15 0.0pp
attn-L17 9/15 -6.7pp
attn-L19-21 9/15 -6.7pp
attn-L16-20 8/15 -13.3pp
expert-repeat-L10-12 9/15 -6.7pp
full-block-L10-12 7/15 -20.0pp
Key Observations
- Only layers 19-20 are effective -- shifting by even one layer eliminates the benefit (consistent with the original RYS findings)
- Expert (sparse) repeat is ineffective -- the same Router selects the same Experts, so no new information is introduced
- Full-block repeat is harmful -- re-running Attention with stale KV cache causes decoherence
- Too wide a range is also harmful -- repeating 5 layers (L16-20) yields -13.3pp, worse than baseline
Usage
With the loader script (recommended)
from load_model import load_attn_repeat_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load_attn_repeat_model()
sampler = make_sampler(temp=0.0)
response = generate(model, tokenizer, prompt="Your question here", max_tokens=512, sampler=sampler)
print(response)
Manual patching on any GPT-OSS-20B MLX model
import mlx.nn as nn
from mlx_lm import load, generate
class AttentionRepeatWrapper(nn.Module):
def __init__(self, block):
super().__init__()
self.block = block
def __call__(self, x, mask, cache=None):
# First attention pass
residual = x
h = self.block.input_layernorm(x)
h = self.block.self_attn(h, mask, cache)
x = residual + h
# Second attention pass
residual = x
h = self.block.input_layernorm(x)
h = self.block.self_attn(h, mask, cache)
x = residual + h
# Single MoE pass
residual = x
h = self.block.post_attention_layernorm(x)
h = self.block.mlp(h)
x = residual + h
return x
model, tokenizer = load("mlx-community/gpt-oss-20b-MXFP4-Q4")
for idx in [19, 20]:
model.model.layers[idx] = AttentionRepeatWrapper(model.model.layers[idx])
Model Details
- Base model: openai/gpt-oss-20b (via mlx-community/gpt-oss-20b-MXFP4-Q4)
- Architecture: MoE Transformer, 24 layers, 32 experts/layer, top-4 routing
- Total params: ~21B (3.6B active per token)
- Modification: Layers 19 and 20 have their attention block repeated (no weight changes)
- Quantization: MXFP4 (experts) + Q4 affine (attention)
- License: Apache 2.0 (same as base model)
Method
Based on llm-circuit-finder (RYS method), adapted for MoE architectures. The key adaptation is the discovery that in MoE models, the dense component (Attention) should be targeted for repetition rather than the sparse component (Experts).
This suggests that the RYS effect is fundamentally about amplifying dense reasoning circuits, not about increasing model depth in general.
Citation
@misc{gpt-oss-20b-attn-repeat,
title={Attention Repeat for MoE Models: Applying RYS Method to GPT-OSS-20B},
author={shi3z},
year={2026},
note={Based on RYS method by David Ng and llm-circuit-finder by alainnothere}
}
- Downloads last month
- 143