Instructions to use m-a-p/OProver-8B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use m-a-p/OProver-8B-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="m-a-p/OProver-8B-Base") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("m-a-p/OProver-8B-Base") model = AutoModelForCausalLM.from_pretrained("m-a-p/OProver-8B-Base") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use m-a-p/OProver-8B-Base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "m-a-p/OProver-8B-Base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-a-p/OProver-8B-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/m-a-p/OProver-8B-Base
- SGLang
How to use m-a-p/OProver-8B-Base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "m-a-p/OProver-8B-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-a-p/OProver-8B-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "m-a-p/OProver-8B-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-a-p/OProver-8B-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use m-a-p/OProver-8B-Base with Docker Model Runner:
docker model run hf.co/m-a-p/OProver-8B-Base
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- lean4
- theorem-proving
- formal-mathematics
- agentic
- retrieval-augmented
datasets:
- m-a-p/OProofs
OProver
A unified framework for agentic formal theorem proving in Lean 4.
- 📄 Paper: OProver: A Unified Framework for Agentic Formal Theorem Proving (arXiv)
- 📚 Collection: m-a-p/OProver
- 📦 Dataset: m-a-p/OProofs
OProver treats theorem proving as a multi-round refinement loop. Given a target theorem, the prover retrieves top-k compiler-verified proofs from a memory of prior proofs, generates a proof attempt, runs the Lean 4 compiler, and—on failure—revises the attempt using the compiler feedback in the next round. The same retrieval and feedback signals are baked into training, so the training-time interface matches the proving-time interaction.
Highlights
- State-of-the-art among open-weight whole-proof provers. OProver-32B attains the best Pass@32 on MiniF2F (93.3), ProverBench (58.2), and PutnamBench (11.3), and the second-best on MathOlympiad (22.8) and ProofNet (33.2). Even OProver-8B beats Goedel-Prover-V2-32B on all five benchmarks despite having 4× fewer parameters.
- Agentic proving is in the policy, not bolted on. Retrieval, multi-turn compiler feedback, and iterative repair are all part of the trained policy. Ablations show feedback is the dominant driver: removing it costs OProver-32B 4.9–7.4 Pass@32 points across benchmarks; removing retrieval on top adds 0.5–1.7 more.
- Co-evolving prover and corpus. Each post-training iteration runs agentic rollouts; verified proofs are added to OProofs and re-indexed for retrieval, repair trajectories become SFT data, and unresolved hard cases provide RL signal. MiniF2F-Test Pass@32 grows monotonically: OProver-8B 79.5 → 86.2 → 87.0 → 91.8 across rounds; OProver-32B 84.7 → 88.1 → 93.3.
What's in this release
The OProver collection (m-a-p/OProver) bundles the paper, the corpus, and seven model checkpoints covering both training stages and both model sizes:
| Repo | Stage | Size | Note |
|---|---|---|---|
| OProver-8B-Base | Continued pretraining only | 8B | Domain-adapted base; before SFT/RL |
| OProver-32B-Base | Continued pretraining only | 32B | Domain-adapted base; before SFT/RL |
| OProver-8B-Round1 | Post-training Round 1 | 8B | After first SFT+RL iteration |
| OProver-8B-Round2 | Post-training Round 2 | 8B | After second SFT+RL iteration |
| OProver-32B-Round1 | Post-training Round 1 | 32B | After first SFT+RL iteration |
| OProver-8B | Final | 8B | The 8B prover reported in the paper (Round 3) |
| OProver-32B | Final | 32B | The 32B prover reported in the paper (Round 2) |
| OProofs | Dataset | 6.86M proofs | Lean 4 corpus used for CPT, SFT, and the retrieval memory |
Use OProver-8B or OProver-32B for proving. The Base / Round-N checkpoints are released for reproducibility and ablation studies.
Dataset: OProofs
OProofs is a large-scale Lean 4 corpus that doubles as the retrieval memory at proving time. It is built from three sources: public Lean resources (NuminaMath-LEAN, Lean-Workbook, Leanabell-FormalStmt, Goedel-Pset, …), large-scale autoformalization + agentic proof synthesis from informal math mined on Common Crawl and GitHub, and traces from OProver's own agentic proving runs.
Training
Two stages, both anchored on OProofs.
1. Continued pretraining (one-time). A 65B-token mixture of formal
Lean (≈30%, from OProofs), code (≈20%, OpenCoder), mathematics (≈40%,
Nemotron-Math-4-Plus), and long-CoT (≈10%, ProLong-64K). AdamW, peak LR
5e-5, cosine with 3% warmup, batch 512, sequence length 8192. Output:
OProver-{8B,32B}-Base.
2. Iterative post-training. Each round runs:
- Agentic proving with the current prover on a theorem pool, producing multi-round rollouts conditioned on retrieved proofs and Lean feedback.
- SFT on round-level repair examples
(s, R, p_{t-1}, f_{t-1}) → p_t, with cross-entropy loss only on the new attempt. - GSPO RL on hard cases (groups with non-trivial pass-rate). Per-round
reward
r = 0.8 + 0.2·1[format ok]if Lean-verified, else 0; advantages are pooled across the n×R rounds for the same theorem. - Newly verified proofs and repair trajectories are folded back into OProofs and re-indexed into the retrieval memory for the next round.
Results
Pass@32 (n=64) across five Lean 4 benchmarks. Bold is best, underlined is second-best.
OProver-32B reaches three best and two second-best across five benchmarks — the most top placements of any model in the comparison, despite being a 32B dense model versus a 560B MoE or a 671B dense competitor.
Loading
from transformers import AutoModelForCausalLM, AutoTokenizer
# Pick any of: OProver-8B, OProver-32B, *-Base, *-Round1, *-Round2
name = "m-a-p/OProver-8B"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name, torch_dtype="bfloat16", device_map="auto")
from datasets import load_dataset
ds = load_dataset("m-a-p/OProofs", split="train")
OProver is trained against a multi-round agentic interface: at each round the input includes the target Lean statement, top-k retrieved verified proofs, the prior proof attempt, and the Lean compiler feedback. See the paper §2.1 for the prompt template, and Appendix B for serialization details.
Citation
@article{ma2026oprover,
title = {OProver: A Unified Framework for Agentic Formal Theorem Proving},
author = {David Ma and Kaijing Ma and Shawn Guo and Yunfeng Shi and Enduo Zhao and Jiajun Shi and Zhaoxiang Zhang and Gavin Cheung and Jiaheng Liu and Zili Wang},
journal = {arXiv preprint arXiv:2605.17283},
year = {2026}
}

