--- license: apache-2.0 language: - en base_model: - Qwen/Qwen3-4B pipeline_tag: text-generation library_name: transformers tags: - qwen3 - reinforcement-learning - verifiable-rewards - process-rewards - agentic-reasoning - reasoning - sudoku arxiv: 2605.10325 --- ![VPR overview](https://thu-nics.github.io/VPR/static/images/overview.png) # VPR-Sudoku [**🌐 Project Page**](https://thu-nics.github.io/VPR/) | [**📝 Paper**](https://arxiv.org/abs/2605.10325) | [**💻 Code**](https://github.com/thu-nics/VPR) ## Model Description This is the Sudoku-trained checkpoint from **VPR: Verifiable Process Rewards for Agentic Reasoning**, initialized from Qwen3-4B. VPR turns verifiable oracles into dense turn-level rewards for long-horizon agentic reasoning. This checkpoint is trained with **constraint-based VPR** on Sudoku, where a constraint oracle verifies whether each filled digit is consistent with the puzzle solution. ## Overview Reinforcement learning from verifiable rewards usually rewards only final success. In long-horizon agentic tasks, this creates a credit assignment problem: a trajectory may fail after many correct steps, or succeed despite flawed intermediate decisions. VPR studies densely-verifiable agentic reasoning problems, where each intermediate action can be checked by a task-specific oracle. Instead of learning a noisy process reward model or estimating step values through extra rollouts, VPR uses task structure itself to provide reliable turn-level supervision. ## Method ![VPR method](https://thu-nics.github.io/VPR/static/images/vpr.png) VPR converts sparse trajectory-level feedback into dense process rewards: ```text r_t^VPR = V(s_t, a_t) ``` For Sudoku, VPR uses a constraint-based verifier. The oracle rewards digit placements that are consistent with the valid solution, giving immediate feedback for intermediate reasoning steps. ## Results ### In-Domain Sudoku | Method | Success Rate | Completion Rate | |---|---:|---:| | Base | 3.91 | 63.24 | | OR | 48.43 | 82.80 | | MC-PR | 34.72 | 77.39 | | **VPR** | **56.25** | **85.13** | ### Zero-Shot Transfer | Benchmark group | Metric | VPR-Sudoku | |---|---:|---:| | General reasoning | Average pass@1 | **62.34** | | General reasoning | GPQA-Diamond | **50.17** | | General reasoning | BBH | 88.82 | | General reasoning | MMLU-Pro | 67.88 | | Agentic reasoning | ALFWorld success rate | 25.62 | | Agentic reasoning | WebShop score | **34.29** | | Agentic reasoning | WebShop success rate | **2.20** | ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "nics-efc/VPR-Sudoku" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True, ) messages = [ {"role": "user", "content": "Solve this step by step: If a train travels 180 miles in 3 hours, what is its average speed?"} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=True, ) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=1024) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Intended Use This checkpoint is intended for research on verifiable rewards, process supervision, reinforcement learning for LLM agents, and transfer from game-like agentic training environments to broader reasoning tasks. The released checkpoint contains the trained language model. Environment simulators, verifiers, and training code are provided in the project repository. ## Citation If you find this model helpful, please cite: ```bibtex @misc{yuan2026verifiable, title={Verifiable Process Rewards for Agentic Reasoning}, author={Huining Yuan and Zelai Xu and Huaijie Wang and Xiangmin Yi and Jiaxuan Gao and Xiao-Ping Zhang and Yu Wang and Chao Yu and Yi Wu}, year={2026}, eprint={2605.10325}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2605.10325} } ```