File size: 6,642 Bytes
e6fd416
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - lean4
  - theorem-proving
  - formal-mathematics
  - agentic
  - retrieval-augmented
datasets:
  - m-a-p/OProofs
---

# OProver

> A unified framework for **agentic formal theorem proving** in Lean 4.

- 📄 Paper: [OProver: A Unified Framework for Agentic Formal Theorem Proving](https://huggingface.co/papers/2605.17283) ([arXiv](https://arxiv.org/abs/2605.17283))
- 📚 Collection: [m-a-p/OProver](https://huggingface.co/collections/m-a-p/oprover)
- 📦 Dataset: [m-a-p/OProofs](https://huggingface.co/datasets/m-a-p/OProofs)

OProver treats theorem proving as a multi-round refinement loop. Given a target
theorem, the prover retrieves top-k compiler-verified proofs from a memory of
prior proofs, generates a proof attempt, runs the Lean 4 compiler, and—on
failure—revises the attempt using the compiler feedback in the next round.
The same retrieval and feedback signals are baked into training, so the
training-time interface matches the proving-time interaction.

## Highlights

- **State-of-the-art among open-weight whole-proof provers.** OProver-32B
  attains the best Pass@32 on MiniF2F (93.3), ProverBench (58.2), and
  PutnamBench (11.3), and the second-best on MathOlympiad (22.8) and
  ProofNet (33.2). Even OProver-8B beats Goedel-Prover-V2-32B on all five
  benchmarks despite having 4× fewer parameters.
- **Agentic proving is in the policy, not bolted on.** Retrieval, multi-turn
  compiler feedback, and iterative repair are all part of the trained policy.
  Ablations show feedback is the dominant driver: removing it costs OProver-32B
  4.9–7.4 Pass@32 points across benchmarks; removing retrieval on top adds
  0.5–1.7 more.
- **Co-evolving prover and corpus.** Each post-training iteration runs agentic
  rollouts; verified proofs are added to OProofs and re-indexed for retrieval,
  repair trajectories become SFT data, and unresolved hard cases provide RL
  signal. MiniF2F-Test Pass@32 grows monotonically: OProver-8B 79.5 → 86.2 →
  87.0 → 91.8 across rounds; OProver-32B 84.7 → 88.1 → 93.3.

## What's in this release

The OProver collection ([m-a-p/OProver](https://huggingface.co/collections/m-a-p/oprover)) bundles the paper,
the corpus, and seven model checkpoints covering both training stages and
both model sizes:

| Repo | Stage | Size | Note |
|---|---|---|---|
| [OProver-8B-Base](https://huggingface.co/m-a-p/OProver-8B-Base) | Continued pretraining only | 8B | Domain-adapted base; before SFT/RL |
| [OProver-32B-Base](https://huggingface.co/m-a-p/OProver-32B-Base) | Continued pretraining only | 32B | Domain-adapted base; before SFT/RL |
| [OProver-8B-Round1](https://huggingface.co/m-a-p/OProver-8B-Round1) | Post-training Round 1 | 8B | After first SFT+RL iteration |
| [OProver-8B-Round2](https://huggingface.co/m-a-p/OProver-8B-Round2) | Post-training Round 2 | 8B | After second SFT+RL iteration |
| [OProver-32B-Round1](https://huggingface.co/m-a-p/OProver-32B-Round1) | Post-training Round 1 | 32B | After first SFT+RL iteration |
| **[OProver-8B](https://huggingface.co/m-a-p/OProver-8B)** | Final | 8B | The 8B prover reported in the paper (Round 3) |
| **[OProver-32B](https://huggingface.co/m-a-p/OProver-32B)** | Final | 32B | The 32B prover reported in the paper (Round 2) |
| [OProofs](https://huggingface.co/datasets/m-a-p/OProofs) | Dataset | 6.86M proofs | Lean 4 corpus used for CPT, SFT, and the retrieval memory |

Use **OProver-8B** or **OProver-32B** for proving. The Base / Round-N
checkpoints are released for reproducibility and ablation studies.

## Dataset: OProofs

OProofs is a large-scale Lean 4 corpus that doubles as the retrieval memory at
proving time. It is built from three sources: public Lean resources
(NuminaMath-LEAN, Lean-Workbook, Leanabell-FormalStmt, Goedel-Pset, …),
large-scale autoformalization + agentic proof synthesis from informal math
mined on Common Crawl and GitHub, and traces from OProver's own agentic
proving runs.

![OProofs Lean 4 corpus statistics](./assets/oproofs.png)

## Training

Two stages, both anchored on OProofs.

**1. Continued pretraining (one-time).** A 65B-token mixture of formal
Lean (≈30%, from OProofs), code (≈20%, OpenCoder), mathematics (≈40%,
Nemotron-Math-4-Plus), and long-CoT (≈10%, ProLong-64K). AdamW, peak LR
5e-5, cosine with 3% warmup, batch 512, sequence length 8192. Output:
`OProver-{8B,32B}-Base`.

**2. Iterative post-training.** Each round runs:

1. Agentic proving with the current prover on a theorem pool, producing
   multi-round rollouts conditioned on retrieved proofs and Lean feedback.
2. **SFT** on round-level repair examples `(s, R, p_{t-1}, f_{t-1}) → p_t`,
   with cross-entropy loss only on the new attempt.
3. **GSPO RL** on hard cases (groups with non-trivial pass-rate). Per-round
   reward `r = 0.8 + 0.2·1[format ok]` if Lean-verified, else 0; advantages
   are pooled across the n×R rounds for the same theorem.
4. Newly verified proofs and repair trajectories are folded back into OProofs
   and re-indexed into the retrieval memory for the next round.

## Results

Pass@32 (n=64) across five Lean 4 benchmarks. **Bold** is best, _underlined_
is second-best.

![Main results: Pass@32 across five Lean 4 benchmarks](./assets/main_results.png)

OProver-32B reaches three best and two second-best across five benchmarks —
the most top placements of any model in the comparison, despite being a
32B dense model versus a 560B MoE or a 671B dense competitor.

## Loading

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Pick any of: OProver-8B, OProver-32B, *-Base, *-Round1, *-Round2
name = "m-a-p/OProver-8B"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name, torch_dtype="bfloat16", device_map="auto")
```

```python
from datasets import load_dataset
ds = load_dataset("m-a-p/OProofs", split="train")
```

OProver is trained against a multi-round agentic interface: at each round the
input includes the target Lean statement, top-k retrieved verified proofs, the
prior proof attempt, and the Lean compiler feedback. See the paper §2.1 for
the prompt template, and Appendix B for serialization details.

## Citation

```bibtex
@article{ma2026oprover,
  title   = {OProver: A Unified Framework for Agentic Formal Theorem Proving},
  author  = {David Ma and Kaijing Ma and Shawn Guo and Yunfeng Shi and Enduo Zhao and Jiajun Shi and Zhaoxiang Zhang and Gavin Cheung and Jiaheng Liu and Zili Wang},
  journal = {arXiv preprint arXiv:2605.17283},
  year    = {2026}
}
```