Text Generation
Transformers
Safetensors
English
qwen3
lean4
theorem-proving
formal-mathematics
agentic
retrieval-augmented
conversational
text-generation-inference
Instructions to use m-a-p/OProver-8B-Round2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use m-a-p/OProver-8B-Round2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="m-a-p/OProver-8B-Round2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("m-a-p/OProver-8B-Round2") model = AutoModelForCausalLM.from_pretrained("m-a-p/OProver-8B-Round2") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use m-a-p/OProver-8B-Round2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "m-a-p/OProver-8B-Round2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-a-p/OProver-8B-Round2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/m-a-p/OProver-8B-Round2
- SGLang
How to use m-a-p/OProver-8B-Round2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "m-a-p/OProver-8B-Round2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-a-p/OProver-8B-Round2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "m-a-p/OProver-8B-Round2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "m-a-p/OProver-8B-Round2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use m-a-p/OProver-8B-Round2 with Docker Model Runner:
docker model run hf.co/m-a-p/OProver-8B-Round2
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - lean4 | |
| - theorem-proving | |
| - formal-mathematics | |
| - agentic | |
| - retrieval-augmented | |
| datasets: | |
| - m-a-p/OProofs | |
| # OProver | |
| > A unified framework for **agentic formal theorem proving** in Lean 4. | |
| - 📄 Paper: [OProver: A Unified Framework for Agentic Formal Theorem Proving](https://huggingface.co/papers/2605.17283) ([arXiv](https://arxiv.org/abs/2605.17283)) | |
| - 📚 Collection: [m-a-p/OProver](https://huggingface.co/collections/m-a-p/oprover) | |
| - 📦 Dataset: [m-a-p/OProofs](https://huggingface.co/datasets/m-a-p/OProofs) | |
| OProver treats theorem proving as a multi-round refinement loop. Given a target | |
| theorem, the prover retrieves top-k compiler-verified proofs from a memory of | |
| prior proofs, generates a proof attempt, runs the Lean 4 compiler, and—on | |
| failure—revises the attempt using the compiler feedback in the next round. | |
| The same retrieval and feedback signals are baked into training, so the | |
| training-time interface matches the proving-time interaction. | |
| ## Highlights | |
| - **State-of-the-art among open-weight whole-proof provers.** OProver-32B | |
| attains the best Pass@32 on MiniF2F (93.3), ProverBench (58.2), and | |
| PutnamBench (11.3), and the second-best on MathOlympiad (22.8) and | |
| ProofNet (33.2). Even OProver-8B beats Goedel-Prover-V2-32B on all five | |
| benchmarks despite having 4× fewer parameters. | |
| - **Agentic proving is in the policy, not bolted on.** Retrieval, multi-turn | |
| compiler feedback, and iterative repair are all part of the trained policy. | |
| Ablations show feedback is the dominant driver: removing it costs OProver-32B | |
| 4.9–7.4 Pass@32 points across benchmarks; removing retrieval on top adds | |
| 0.5–1.7 more. | |
| - **Co-evolving prover and corpus.** Each post-training iteration runs agentic | |
| rollouts; verified proofs are added to OProofs and re-indexed for retrieval, | |
| repair trajectories become SFT data, and unresolved hard cases provide RL | |
| signal. MiniF2F-Test Pass@32 grows monotonically: OProver-8B 79.5 → 86.2 → | |
| 87.0 → 91.8 across rounds; OProver-32B 84.7 → 88.1 → 93.3. | |
| ## What's in this release | |
| The OProver collection ([m-a-p/OProver](https://huggingface.co/collections/m-a-p/oprover)) bundles the paper, | |
| the corpus, and seven model checkpoints covering both training stages and | |
| both model sizes: | |
| | Repo | Stage | Size | Note | | |
| |---|---|---|---| | |
| | [OProver-8B-Base](https://huggingface.co/m-a-p/OProver-8B-Base) | Continued pretraining only | 8B | Domain-adapted base; before SFT/RL | | |
| | [OProver-32B-Base](https://huggingface.co/m-a-p/OProver-32B-Base) | Continued pretraining only | 32B | Domain-adapted base; before SFT/RL | | |
| | [OProver-8B-Round1](https://huggingface.co/m-a-p/OProver-8B-Round1) | Post-training Round 1 | 8B | After first SFT+RL iteration | | |
| | [OProver-8B-Round2](https://huggingface.co/m-a-p/OProver-8B-Round2) | Post-training Round 2 | 8B | After second SFT+RL iteration | | |
| | [OProver-32B-Round1](https://huggingface.co/m-a-p/OProver-32B-Round1) | Post-training Round 1 | 32B | After first SFT+RL iteration | | |
| | **[OProver-8B](https://huggingface.co/m-a-p/OProver-8B)** | Final | 8B | The 8B prover reported in the paper (Round 3) | | |
| | **[OProver-32B](https://huggingface.co/m-a-p/OProver-32B)** | Final | 32B | The 32B prover reported in the paper (Round 2) | | |
| | [OProofs](https://huggingface.co/datasets/m-a-p/OProofs) | Dataset | 6.86M proofs | Lean 4 corpus used for CPT, SFT, and the retrieval memory | | |
| Use **OProver-8B** or **OProver-32B** for proving. The Base / Round-N | |
| checkpoints are released for reproducibility and ablation studies. | |
| ## Dataset: OProofs | |
| OProofs is a large-scale Lean 4 corpus that doubles as the retrieval memory at | |
| proving time. It is built from three sources: public Lean resources | |
| (NuminaMath-LEAN, Lean-Workbook, Leanabell-FormalStmt, Goedel-Pset, …), | |
| large-scale autoformalization + agentic proof synthesis from informal math | |
| mined on Common Crawl and GitHub, and traces from OProver's own agentic | |
| proving runs. | |
|  | |
| ## Training | |
| Two stages, both anchored on OProofs. | |
| **1. Continued pretraining (one-time).** A 65B-token mixture of formal | |
| Lean (≈30%, from OProofs), code (≈20%, OpenCoder), mathematics (≈40%, | |
| Nemotron-Math-4-Plus), and long-CoT (≈10%, ProLong-64K). AdamW, peak LR | |
| 5e-5, cosine with 3% warmup, batch 512, sequence length 8192. Output: | |
| `OProver-{8B,32B}-Base`. | |
| **2. Iterative post-training.** Each round runs: | |
| 1. Agentic proving with the current prover on a theorem pool, producing | |
| multi-round rollouts conditioned on retrieved proofs and Lean feedback. | |
| 2. **SFT** on round-level repair examples `(s, R, p_{t-1}, f_{t-1}) → p_t`, | |
| with cross-entropy loss only on the new attempt. | |
| 3. **GSPO RL** on hard cases (groups with non-trivial pass-rate). Per-round | |
| reward `r = 0.8 + 0.2·1[format ok]` if Lean-verified, else 0; advantages | |
| are pooled across the n×R rounds for the same theorem. | |
| 4. Newly verified proofs and repair trajectories are folded back into OProofs | |
| and re-indexed into the retrieval memory for the next round. | |
| ## Results | |
| Pass@32 (n=64) across five Lean 4 benchmarks. **Bold** is best, _underlined_ | |
| is second-best. | |
|  | |
| OProver-32B reaches three best and two second-best across five benchmarks — | |
| the most top placements of any model in the comparison, despite being a | |
| 32B dense model versus a 560B MoE or a 671B dense competitor. | |
| ## Loading | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # Pick any of: OProver-8B, OProver-32B, *-Base, *-Round1, *-Round2 | |
| name = "m-a-p/OProver-8B" | |
| tok = AutoTokenizer.from_pretrained(name) | |
| model = AutoModelForCausalLM.from_pretrained(name, torch_dtype="bfloat16", device_map="auto") | |
| ``` | |
| ```python | |
| from datasets import load_dataset | |
| ds = load_dataset("m-a-p/OProofs", split="train") | |
| ``` | |
| OProver is trained against a multi-round agentic interface: at each round the | |
| input includes the target Lean statement, top-k retrieved verified proofs, the | |
| prior proof attempt, and the Lean compiler feedback. See the paper §2.1 for | |
| the prompt template, and Appendix B for serialization details. | |
| ## Citation | |
| ```bibtex | |
| @article{ma2026oprover, | |
| title = {OProver: A Unified Framework for Agentic Formal Theorem Proving}, | |
| author = {David Ma and Kaijing Ma and Shawn Guo and Yunfeng Shi and Enduo Zhao and Jiajun Shi and Zhaoxiang Zhang and Gavin Cheung and Jiaheng Liu and Zili Wang}, | |
| journal = {arXiv preprint arXiv:2605.17283}, | |
| year = {2026} | |
| } | |
| ``` | |