dma2077 commited on
Commit
0e86031
·
verified ·
1 Parent(s): fdcc029

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +141 -0
README.md CHANGED
@@ -1,3 +1,144 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - lean4
9
+ - theorem-proving
10
+ - formal-mathematics
11
+ - agentic
12
+ - retrieval-augmented
13
+ datasets:
14
+ - m-a-p/OProofs
15
  ---
16
+
17
+ # OProver
18
+
19
+ > A unified framework for **agentic formal theorem proving** in Lean 4.
20
+
21
+ - 📄 Paper: [OProver: A Unified Framework for Agentic Formal Theorem Proving](https://huggingface.co/papers/2605.17283) ([arXiv](https://arxiv.org/abs/2605.17283))
22
+ - 📚 Collection: [m-a-p/OProver](https://huggingface.co/collections/m-a-p/oprover)
23
+ - 📦 Dataset: [m-a-p/OProofs](https://huggingface.co/datasets/m-a-p/OProofs)
24
+
25
+ OProver treats theorem proving as a multi-round refinement loop. Given a target
26
+ theorem, the prover retrieves top-k compiler-verified proofs from a memory of
27
+ prior proofs, generates a proof attempt, runs the Lean 4 compiler, and—on
28
+ failure—revises the attempt using the compiler feedback in the next round.
29
+ The same retrieval and feedback signals are baked into training, so the
30
+ training-time interface matches the proving-time interaction.
31
+
32
+ ## Highlights
33
+
34
+ - **State-of-the-art among open-weight whole-proof provers.** OProver-32B
35
+ attains the best Pass@32 on MiniF2F (93.3), ProverBench (58.2), and
36
+ PutnamBench (11.3), and the second-best on MathOlympiad (22.8) and
37
+ ProofNet (33.2). Even OProver-8B beats Goedel-Prover-V2-32B on all five
38
+ benchmarks despite having 4× fewer parameters.
39
+ - **Agentic proving is in the policy, not bolted on.** Retrieval, multi-turn
40
+ compiler feedback, and iterative repair are all part of the trained policy.
41
+ Ablations show feedback is the dominant driver: removing it costs OProver-32B
42
+ 4.9–7.4 Pass@32 points across benchmarks; removing retrieval on top adds
43
+ 0.5–1.7 more.
44
+ - **Co-evolving prover and corpus.** Each post-training iteration runs agentic
45
+ rollouts; verified proofs are added to OProofs and re-indexed for retrieval,
46
+ repair trajectories become SFT data, and unresolved hard cases provide RL
47
+ signal. MiniF2F-Test Pass@32 grows monotonically: OProver-8B 79.5 → 86.2 →
48
+ 87.0 → 91.8 across rounds; OProver-32B 84.7 → 88.1 → 93.3.
49
+
50
+ ## What's in this release
51
+
52
+ The OProver collection ([m-a-p/OProver](https://huggingface.co/collections/m-a-p/oprover)) bundles the paper,
53
+ the corpus, and seven model checkpoints covering both training stages and
54
+ both model sizes:
55
+
56
+ | Repo | Stage | Size | Note |
57
+ |---|---|---|---|
58
+ | [OProver-8B-Base](https://huggingface.co/m-a-p/OProver-8B-Base) | Continued pretraining only | 8B | Domain-adapted base; before SFT/RL |
59
+ | [OProver-32B-Base](https://huggingface.co/m-a-p/OProver-32B-Base) | Continued pretraining only | 32B | Domain-adapted base; before SFT/RL |
60
+ | [OProver-8B-Round1](https://huggingface.co/m-a-p/OProver-8B-Round1) | Post-training Round 1 | 8B | After first SFT+RL iteration |
61
+ | [OProver-8B-Round2](https://huggingface.co/m-a-p/OProver-8B-Round2) | Post-training Round 2 | 8B | After second SFT+RL iteration |
62
+ | [OProver-32B-Round1](https://huggingface.co/m-a-p/OProver-32B-Round1) | Post-training Round 1 | 32B | After first SFT+RL iteration |
63
+ | **[OProver-8B](https://huggingface.co/m-a-p/OProver-8B)** | Final | 8B | The 8B prover reported in the paper (Round 3) |
64
+ | **[OProver-32B](https://huggingface.co/m-a-p/OProver-32B)** | Final | 32B | The 32B prover reported in the paper (Round 2) |
65
+ | [OProofs](https://huggingface.co/datasets/m-a-p/OProofs) | Dataset | 6.86M proofs | Lean 4 corpus used for CPT, SFT, and the retrieval memory |
66
+
67
+ Use **OProver-8B** or **OProver-32B** for proving. The Base / Round-N
68
+ checkpoints are released for reproducibility and ablation studies.
69
+
70
+ ## Dataset: OProofs
71
+
72
+ OProofs is a large-scale Lean 4 corpus that doubles as the retrieval memory at
73
+ proving time. It is built from three sources: public Lean resources
74
+ (NuminaMath-LEAN, Lean-Workbook, Leanabell-FormalStmt, Goedel-Pset, …),
75
+ large-scale autoformalization + agentic proof synthesis from informal math
76
+ mined on Common Crawl and GitHub, and traces from OProver's own agentic
77
+ proving runs.
78
+
79
+ ![OProofs Lean 4 corpus statistics](./assets/oproofs.png)
80
+
81
+ ## Training
82
+
83
+ Two stages, both anchored on OProofs.
84
+
85
+ **1. Continued pretraining (one-time).** A 65B-token mixture of formal
86
+ Lean (≈30%, from OProofs), code (≈20%, OpenCoder), mathematics (≈40%,
87
+ Nemotron-Math-4-Plus), and long-CoT (≈10%, ProLong-64K). AdamW, peak LR
88
+ 5e-5, cosine with 3% warmup, batch 512, sequence length 8192. Output:
89
+ `OProver-{8B,32B}-Base`.
90
+
91
+ **2. Iterative post-training.** Each round runs:
92
+
93
+ 1. Agentic proving with the current prover on a theorem pool, producing
94
+ multi-round rollouts conditioned on retrieved proofs and Lean feedback.
95
+ 2. **SFT** on round-level repair examples `(s, R, p_{t-1}, f_{t-1}) → p_t`,
96
+ with cross-entropy loss only on the new attempt.
97
+ 3. **GSPO RL** on hard cases (groups with non-trivial pass-rate). Per-round
98
+ reward `r = 0.8 + 0.2·1[format ok]` if Lean-verified, else 0; advantages
99
+ are pooled across the n×R rounds for the same theorem.
100
+ 4. Newly verified proofs and repair trajectories are folded back into OProofs
101
+ and re-indexed into the retrieval memory for the next round.
102
+
103
+ ## Results
104
+
105
+ Pass@32 (n=64) across five Lean 4 benchmarks. **Bold** is best, _underlined_
106
+ is second-best.
107
+
108
+ ![Main results: Pass@32 across five Lean 4 benchmarks](./assets/main_results.png)
109
+
110
+ OProver-32B reaches three best and two second-best across five benchmarks —
111
+ the most top placements of any model in the comparison, despite being a
112
+ 32B dense model versus a 560B MoE or a 671B dense competitor.
113
+
114
+ ## Loading
115
+
116
+ ```python
117
+ from transformers import AutoModelForCausalLM, AutoTokenizer
118
+
119
+ # Pick any of: OProver-8B, OProver-32B, *-Base, *-Round1, *-Round2
120
+ name = "m-a-p/OProver-8B"
121
+ tok = AutoTokenizer.from_pretrained(name)
122
+ model = AutoModelForCausalLM.from_pretrained(name, torch_dtype="bfloat16", device_map="auto")
123
+ ```
124
+
125
+ ```python
126
+ from datasets import load_dataset
127
+ ds = load_dataset("m-a-p/OProofs", split="train")
128
+ ```
129
+
130
+ OProver is trained against a multi-round agentic interface: at each round the
131
+ input includes the target Lean statement, top-k retrieved verified proofs, the
132
+ prior proof attempt, and the Lean compiler feedback. See the paper §2.1 for
133
+ the prompt template, and Appendix B for serialization details.
134
+
135
+ ## Citation
136
+
137
+ ```bibtex
138
+ @article{ma2026oprover,
139
+ title = {OProver: A Unified Framework for Agentic Formal Theorem Proving},
140
+ author = {David Ma and Kaijing Ma and Shawn Guo and Yunfeng Shi and Enduo Zhao and Jiajun Shi and Zhaoxiang Zhang and Gavin Cheung and Jiaheng Liu and Zili Wang},
141
+ journal = {arXiv preprint arXiv:2605.17283},
142
+ year = {2026}
143
+ }
144
+ ```