| --- |
| license: apache-2.0 |
| language: |
| - en |
| base_model: |
| - Qwen/Qwen3-4B |
| pipeline_tag: text-generation |
| library_name: transformers |
| tags: |
| - qwen3 |
| - reinforcement-learning |
| - verifiable-rewards |
| - process-rewards |
| - agentic-reasoning |
| - reasoning |
| - minesweeper |
| arxiv: 2605.10325 |
| --- |
| |
|  |
|
|
| # VPR-Minesweeper |
|
|
| [**π Project Page**](https://thu-nics.github.io/VPR/) | [**π Paper**](https://arxiv.org/abs/2605.10325) | [**π» Code**](https://github.com/thu-nics/VPR) |
|
|
| ## Model Description |
|
|
| This is the Minesweeper-trained checkpoint from **VPR: Verifiable Process Rewards for Agentic Reasoning**, initialized from Qwen3-4B. |
|
|
| VPR turns verifiable oracles into dense turn-level rewards for long-horizon agentic reasoning. This checkpoint is trained with **posterior-based VPR** on Minesweeper, where posterior mine probabilities provide step-level feedback for safe reveals and certain mine flags under partial observability. |
|
|
| ## Overview |
|
|
| Reinforcement learning from verifiable rewards usually rewards only final success. In long-horizon agentic tasks, this creates a credit assignment problem: a trajectory may fail after many correct steps, or succeed despite flawed intermediate decisions. |
|
|
| VPR studies densely-verifiable agentic reasoning problems, where each intermediate action can be checked by a task-specific oracle. Instead of learning a noisy process reward model or estimating step values through extra rollouts, VPR uses task structure itself to provide reliable turn-level supervision. |
|
|
| ## Method |
|
|
|  |
|
|
| VPR converts sparse trajectory-level feedback into dense process rewards: |
|
|
| ```text |
| r_t^VPR = V(s_t, a_t) |
| ``` |
|
|
| For Minesweeper, VPR uses a posterior-based verifier. The oracle computes posterior mine probabilities and rewards actions that reveal provably safe cells or flag certain mines. |
|
|
| ## Results |
|
|
| ### In-Domain Minesweeper |
|
|
| | Method | Success Rate | Completion Rate | |
| |---|---:|---:| |
| | Base | 0.78 | 73.71 | |
| | OR | 3.91 | 77.26 | |
| | MC-PR | 2.34 | 78.67 | |
| | **VPR** | **10.39** | **80.27** | |
|
|
| ### Zero-Shot Transfer |
|
|
| | Benchmark group | Metric | VPR-Minesweeper | |
| |---|---:|---:| |
| | General reasoning | Average pass@1 | **62.59** | |
| | General reasoning | GSM8K | **94.82** | |
| | General reasoning | MATH-500 | **85.00** | |
| | General reasoning | MMLU-Pro | **67.98** | |
| | Agentic reasoning | ALFWorld success rate | **28.61** | |
| | Agentic reasoning | WebShop score | 30.38 | |
| | Agentic reasoning | WebShop success rate | 1.93 | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = "nics-efc/VPR-Minesweeper" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| torch_dtype="auto", |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| messages = [ |
| {"role": "user", "content": "Solve this step by step: If a train travels 180 miles in 3 hours, what is its average speed?"} |
| ] |
| text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| enable_thinking=True, |
| ) |
| inputs = tokenizer([text], return_tensors="pt").to(model.device) |
| outputs = model.generate(**inputs, max_new_tokens=1024) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## Intended Use |
|
|
| This checkpoint is intended for research on verifiable rewards, process supervision, reinforcement learning for LLM agents, and transfer from game-like agentic training environments to broader reasoning tasks. |
|
|
| The released checkpoint contains the trained language model. Environment simulators, verifiers, and training code are provided in the project repository. |
|
|
| ## Citation |
|
|
| If you find this model helpful, please cite: |
|
|
| ```bibtex |
| @misc{yuan2026verifiable, |
| title={Verifiable Process Rewards for Agentic Reasoning}, |
| author={Huining Yuan and Zelai Xu and Huaijie Wang and Xiangmin Yi and Jiaxuan Gao and Xiao-Ping Zhang and Yu Wang and Chao Yu and Yi Wu}, |
| year={2026}, |
| eprint={2605.10325}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.AI}, |
| url={https://arxiv.org/abs/2605.10325} |
| } |
| ``` |
|
|