GPT-OSS-20B with Attention Repeat (Layers 19-20)

Applying the RYS (Repeat Your Steps) method to an MoE (Mixture-of-Experts) model. Achieves +13.3pp improvement on BBH-style reasoning benchmarks with no training and no weight changes -- only a forward pass modification at inference time.

Key Insight: In MoE Models, Repeat the Dense Part (Attention), Not the Sparse Part (Experts)

The RYS method was originally discovered on dense models (Devstral-24B, Qwen2.5-32B), where duplicating specific transformer layers boosts reasoning ability.

We tested three repetition strategies on GPT-OSS-20B, an MoE model with 32 experts per layer (top-4 routing):

Method	Description	Result
Expert-only repeat	Repeat only the MoE block (Router + Experts)	No effect or degradation
Full-block repeat	Repeat both Attention and MoE	Severe degradation (-20pp)
Attention-only repeat	Repeat only the Attention (dense) component	+13.3pp improvement

Why Attention Repeat Works

Normal:   x -> [Attention] -> [MoE: Router -> Top-4/32 Experts] -> out
Modified: x -> [Attention] -> [Attention (2nd pass)] -> [MoE: Router -> Top-4/32 Experts] -> out

Attention is the dense component -- all parameters are always active for every token
Expert FFN is the sparse component -- only 4 out of 32 experts are active per token
The core mechanism of RYS is amplifying dense reasoning circuits; repeating sparse components has no effect
By running Attention twice, the hidden state passed to the MoE Router is more refined, enabling better expert selection

Benchmark Results

Evaluated on 15 BBH-style reasoning questions covering logical deduction, tracking shuffles, arithmetic, causal judgement, navigation, boolean satisfiability, table reasoning, disambiguation, and Dyck languages.

Sweep Results

Configuration         Score     Delta vs baseline
--------------------------------------------------
attn-L19-20           12/15    +13.3pp  *** BEST
attn-L18-20           11/15    +6.7pp
baseline              10/15     0.0pp
attn-L18-19           10/15     0.0pp
attn-L17               9/15    -6.7pp
attn-L19-21            9/15    -6.7pp
attn-L16-20            8/15   -13.3pp
expert-repeat-L10-12   9/15    -6.7pp
full-block-L10-12      7/15   -20.0pp

Key Observations

Only layers 19-20 are effective -- shifting by even one layer eliminates the benefit (consistent with the original RYS findings)
Expert (sparse) repeat is ineffective -- the same Router selects the same Experts, so no new information is introduced
Full-block repeat is harmful -- re-running Attention with stale KV cache causes decoherence
Too wide a range is also harmful -- repeating 5 layers (L16-20) yields -13.3pp, worse than baseline

Usage

With the loader script (recommended)

from load_model import load_attn_repeat_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_attn_repeat_model()
sampler = make_sampler(temp=0.0)
response = generate(model, tokenizer, prompt="Your question here", max_tokens=512, sampler=sampler)
print(response)

Manual patching on any GPT-OSS-20B MLX model

import mlx.nn as nn
from mlx_lm import load, generate

class AttentionRepeatWrapper(nn.Module):
    def __init__(self, block):
        super().__init__()
        self.block = block

    def __call__(self, x, mask, cache=None):
        # First attention pass
        residual = x
        h = self.block.input_layernorm(x)
        h = self.block.self_attn(h, mask, cache)
        x = residual + h
        # Second attention pass
        residual = x
        h = self.block.input_layernorm(x)
        h = self.block.self_attn(h, mask, cache)
        x = residual + h
        # Single MoE pass
        residual = x
        h = self.block.post_attention_layernorm(x)
        h = self.block.mlp(h)
        x = residual + h
        return x

model, tokenizer = load("mlx-community/gpt-oss-20b-MXFP4-Q4")
for idx in [19, 20]:
    model.model.layers[idx] = AttentionRepeatWrapper(model.model.layers[idx])

Model Details

Base model: openai/gpt-oss-20b (via mlx-community/gpt-oss-20b-MXFP4-Q4)
Architecture: MoE Transformer, 24 layers, 32 experts/layer, top-4 routing
Total params: ~21B (3.6B active per token)
Modification: Layers 19 and 20 have their attention block repeated (no weight changes)
Quantization: MXFP4 (experts) + Q4 affine (attention)
License: Apache 2.0 (same as base model)

Method

Based on llm-circuit-finder (RYS method), adapted for MoE architectures. The key adaptation is the discovery that in MoE models, the dense component (Attention) should be targeted for repetition rather than the sparse component (Experts).

This suggests that the RYS effect is fundamentally about amplifying dense reasoning circuits, not about increasing model depth in general.

Citation

@misc{gpt-oss-20b-attn-repeat,
  title={Attention Repeat for MoE Models: Applying RYS Method to GPT-OSS-20B},
  author={shi3z},
  year={2026},
  note={Based on RYS method by David Ng and llm-circuit-finder by alainnothere}
}

Downloads last month: 143

Safetensors

Model size

21B params

Tensor type

BF16

U32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support