Bi-Induct 1B Balanced
This repository contains the Bi-Induct 1B Balanced checkpoint from Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning.
This release corresponds to the 1B setting in the paper and is a research checkpoint intended for studying matched-compute pretraining, induction-style curricula, and in-context learning behavior. It is not instruction-tuned, alignment-tuned, or safety-tuned.
Variant
Bi-Induct balanced curriculum. Each synthetic injection chooses forward-copy or backward-copy with equal probability.
Model overview
- Architecture: decoder-only Transformer
- Positional encoding: RoPE (
theta=10000) - Normalization: pre-norm residual blocks
- MLP: SwiGLU
- Attention: grouped-query / grouped key-value attention
- Precision: bfloat16 training
- Context length: 1024
- Embeddings: untied input/output embeddings
Model specification
| Field | Value |
|---|---|
| Parameters (paper label) | 1B |
| Layers | 30 |
| Hidden size | 1,536 |
| Intermediate / MLP size | 6,144 |
| Head dimension | 64 |
| Attention heads | 24 |
| KV heads | 6 |
Training data
All checkpoints in this family were pretrained on the deduplicated THE PILE in streaming / shuffled mode. A stable MD5-based hash was used to create a fixed held-out evaluation slice, with 0.2% of the corpus reserved for evaluation (roughly 0.4B tokens). Tokenization was truncated to 1024 tokens per sequence.
For the Bi-Induct variants, synthetic snippets were interleaved on top of the natural stream:
- Induction:
[S || SEP || S] - Anti-Induction:
[S || SEP || reverse(S)] - Balanced: each injection randomly chooses induction or anti-induction
The main cross-scale experiments used span length L = 20 and initial mix ratio m0 = 50%, linearly annealed to zero over the full training budget.
Training recipe
- Optimizer: AdamW (
beta1=0.9,beta2=0.999, weight decay0.1) - Learning rate: peak
1e-3 - Schedule:
3%linear warmup, then cosine decay - Update size:
2^16tokens per update - Token budget: approximately
20Ntokens following the Chinchilla-style rule of thumb - Comparison protocol: iso-FLOPs across curricula at each scale
Evaluation summary for the 1B family
The table below summarizes the main results at this scale. Standard LM benchmarks are evaluated 3-shot and Todd et al. function-style probes are evaluated 10-shot with HITS@1.
| Variant | Standard LM ICL composite ↑ | Todd-style ICL composite ↑ | Held-out PPL ↓ |
|---|---|---|---|
| Baseline | 24.2 ± 0.5 | 20.0 ± 1.3 | 14.1 |
| Induction | 23.9 ± 0.5 | 15.2 ± 1.1 | 14.9 |
| Anti-Induction | 23.6 ± 0.4 | 14.7 ± 1.2 | 14.9 |
| Balanced | 24.3 ± 0.3 | 14.9 ± 1.1 | 14.9 |
This checkpoint: Balanced.
Benchmarks included
Standard LM benchmarks
- MMLU
- Winogrande
- CommonSenseQA
- PIQA
- HellaSwag
- TriviaQA-Wiki
- BBH (CoT)
- OpenBookQA
- ARC-Challenge
- GPQA
- GSM-8K
- MathQA
- BoolQ
- LAMBADA
Todd et al. function-style probes
- alphabetically first 3
- alphabetically first 5
- alphabetically last 3
- alphabetically last 5
- capitalize
- capitalize first letter
- capitalize last letter
- choose first of 3
- choose first of 5
- choose last of 3
- choose last of 5
- choose middle of 3
- choose middle of 5
- lowercase first letter
- lowercase last letter
- next capital letter
- next item
- prev item
- word length
Example usage
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "MohammedSabry/biinduct-1b-balanced"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Limitations
- These are research checkpoints, not production chat models.
- They were designed to study the relationship between induction-style telemetry and load-bearing ICL behavior under matched compute.
- The synthetic interventions are intentionally lightweight and token-level; results should not be interpreted as ruling out richer data-rewrite strategies.
- Because Bi-Induct replaces a fraction of natural data under iso-FLOPs, some trade-offs may reflect natural-text displacement in addition to mechanistic redundancy.
Citation
If you use this model, please cite:
@misc{sabry2026inductionsignaturesenoughmatchedcompute,
title={Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning},
author={Mohammed Sabry and Anya Belz},
year={2026},
eprint={2509.22947},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.22947},
}
- Downloads last month
- 253