Text Generation
Transformers
Safetensors
English
qwen3
lean4
theorem-proving
formal-mathematics
agentic
retrieval-augmented
conversational
text-generation-inference
Instructions to use m-a-p/OProver-8B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use m-a-p/OProver-8B-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="m-a-p/OProver-8B-Base") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("m-a-p/OProver-8B-Base") model = AutoModelForCausalLM.from_pretrained("m-a-p/OProver-8B-Base") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use m-a-p/OProver-8B-Base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "m-a-p/OProver-8B-Base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-a-p/OProver-8B-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/m-a-p/OProver-8B-Base
- SGLang
How to use m-a-p/OProver-8B-Base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "m-a-p/OProver-8B-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-a-p/OProver-8B-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "m-a-p/OProver-8B-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-a-p/OProver-8B-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use m-a-p/OProver-8B-Base with Docker Model Runner:
docker model run hf.co/m-a-p/OProver-8B-Base
File size: 6,642 Bytes
2758888 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | ---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- lean4
- theorem-proving
- formal-mathematics
- agentic
- retrieval-augmented
datasets:
- m-a-p/OProofs
---
# OProver
> A unified framework for **agentic formal theorem proving** in Lean 4.
- 📄 Paper: [OProver: A Unified Framework for Agentic Formal Theorem Proving](https://huggingface.co/papers/2605.17283) ([arXiv](https://arxiv.org/abs/2605.17283))
- 📚 Collection: [m-a-p/OProver](https://huggingface.co/collections/m-a-p/oprover)
- 📦 Dataset: [m-a-p/OProofs](https://huggingface.co/datasets/m-a-p/OProofs)
OProver treats theorem proving as a multi-round refinement loop. Given a target
theorem, the prover retrieves top-k compiler-verified proofs from a memory of
prior proofs, generates a proof attempt, runs the Lean 4 compiler, and—on
failure—revises the attempt using the compiler feedback in the next round.
The same retrieval and feedback signals are baked into training, so the
training-time interface matches the proving-time interaction.
## Highlights
- **State-of-the-art among open-weight whole-proof provers.** OProver-32B
attains the best Pass@32 on MiniF2F (93.3), ProverBench (58.2), and
PutnamBench (11.3), and the second-best on MathOlympiad (22.8) and
ProofNet (33.2). Even OProver-8B beats Goedel-Prover-V2-32B on all five
benchmarks despite having 4× fewer parameters.
- **Agentic proving is in the policy, not bolted on.** Retrieval, multi-turn
compiler feedback, and iterative repair are all part of the trained policy.
Ablations show feedback is the dominant driver: removing it costs OProver-32B
4.9–7.4 Pass@32 points across benchmarks; removing retrieval on top adds
0.5–1.7 more.
- **Co-evolving prover and corpus.** Each post-training iteration runs agentic
rollouts; verified proofs are added to OProofs and re-indexed for retrieval,
repair trajectories become SFT data, and unresolved hard cases provide RL
signal. MiniF2F-Test Pass@32 grows monotonically: OProver-8B 79.5 → 86.2 →
87.0 → 91.8 across rounds; OProver-32B 84.7 → 88.1 → 93.3.
## What's in this release
The OProver collection ([m-a-p/OProver](https://huggingface.co/collections/m-a-p/oprover)) bundles the paper,
the corpus, and seven model checkpoints covering both training stages and
both model sizes:
| Repo | Stage | Size | Note |
|---|---|---|---|
| [OProver-8B-Base](https://huggingface.co/m-a-p/OProver-8B-Base) | Continued pretraining only | 8B | Domain-adapted base; before SFT/RL |
| [OProver-32B-Base](https://huggingface.co/m-a-p/OProver-32B-Base) | Continued pretraining only | 32B | Domain-adapted base; before SFT/RL |
| [OProver-8B-Round1](https://huggingface.co/m-a-p/OProver-8B-Round1) | Post-training Round 1 | 8B | After first SFT+RL iteration |
| [OProver-8B-Round2](https://huggingface.co/m-a-p/OProver-8B-Round2) | Post-training Round 2 | 8B | After second SFT+RL iteration |
| [OProver-32B-Round1](https://huggingface.co/m-a-p/OProver-32B-Round1) | Post-training Round 1 | 32B | After first SFT+RL iteration |
| **[OProver-8B](https://huggingface.co/m-a-p/OProver-8B)** | Final | 8B | The 8B prover reported in the paper (Round 3) |
| **[OProver-32B](https://huggingface.co/m-a-p/OProver-32B)** | Final | 32B | The 32B prover reported in the paper (Round 2) |
| [OProofs](https://huggingface.co/datasets/m-a-p/OProofs) | Dataset | 6.86M proofs | Lean 4 corpus used for CPT, SFT, and the retrieval memory |
Use **OProver-8B** or **OProver-32B** for proving. The Base / Round-N
checkpoints are released for reproducibility and ablation studies.
## Dataset: OProofs
OProofs is a large-scale Lean 4 corpus that doubles as the retrieval memory at
proving time. It is built from three sources: public Lean resources
(NuminaMath-LEAN, Lean-Workbook, Leanabell-FormalStmt, Goedel-Pset, …),
large-scale autoformalization + agentic proof synthesis from informal math
mined on Common Crawl and GitHub, and traces from OProver's own agentic
proving runs.

## Training
Two stages, both anchored on OProofs.
**1. Continued pretraining (one-time).** A 65B-token mixture of formal
Lean (≈30%, from OProofs), code (≈20%, OpenCoder), mathematics (≈40%,
Nemotron-Math-4-Plus), and long-CoT (≈10%, ProLong-64K). AdamW, peak LR
5e-5, cosine with 3% warmup, batch 512, sequence length 8192. Output:
`OProver-{8B,32B}-Base`.
**2. Iterative post-training.** Each round runs:
1. Agentic proving with the current prover on a theorem pool, producing
multi-round rollouts conditioned on retrieved proofs and Lean feedback.
2. **SFT** on round-level repair examples `(s, R, p_{t-1}, f_{t-1}) → p_t`,
with cross-entropy loss only on the new attempt.
3. **GSPO RL** on hard cases (groups with non-trivial pass-rate). Per-round
reward `r = 0.8 + 0.2·1[format ok]` if Lean-verified, else 0; advantages
are pooled across the n×R rounds for the same theorem.
4. Newly verified proofs and repair trajectories are folded back into OProofs
and re-indexed into the retrieval memory for the next round.
## Results
Pass@32 (n=64) across five Lean 4 benchmarks. **Bold** is best, _underlined_
is second-best.

OProver-32B reaches three best and two second-best across five benchmarks —
the most top placements of any model in the comparison, despite being a
32B dense model versus a 560B MoE or a 671B dense competitor.
## Loading
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Pick any of: OProver-8B, OProver-32B, *-Base, *-Round1, *-Round2
name = "m-a-p/OProver-8B"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name, torch_dtype="bfloat16", device_map="auto")
```
```python
from datasets import load_dataset
ds = load_dataset("m-a-p/OProofs", split="train")
```
OProver is trained against a multi-round agentic interface: at each round the
input includes the target Lean statement, top-k retrieved verified proofs, the
prior proof attempt, and the Lean compiler feedback. See the paper §2.1 for
the prompt template, and Appendix B for serialization details.
## Citation
```bibtex
@article{ma2026oprover,
title = {OProver: A Unified Framework for Agentic Formal Theorem Proving},
author = {David Ma and Kaijing Ma and Shawn Guo and Yunfeng Shi and Enduo Zhao and Jiajun Shi and Zhaoxiang Zhang and Gavin Cheung and Jiaheng Liu and Zili Wang},
journal = {arXiv preprint arXiv:2605.17283},
year = {2026}
}
```
|