How to use from
Docker Model Runner
docker model run hf.co/nics-efc/VPR-Sudoku
Quick Links

VPR overview

VPR-Sudoku

🌐 Project Page | 📝 Paper | 💻 Code

Model Description

This is the Sudoku-trained checkpoint from VPR: Verifiable Process Rewards for Agentic Reasoning, initialized from Qwen3-4B.

VPR turns verifiable oracles into dense turn-level rewards for long-horizon agentic reasoning. This checkpoint is trained with constraint-based VPR on Sudoku, where a constraint oracle verifies whether each filled digit is consistent with the puzzle solution.

Overview

Reinforcement learning from verifiable rewards usually rewards only final success. In long-horizon agentic tasks, this creates a credit assignment problem: a trajectory may fail after many correct steps, or succeed despite flawed intermediate decisions.

VPR studies densely-verifiable agentic reasoning problems, where each intermediate action can be checked by a task-specific oracle. Instead of learning a noisy process reward model or estimating step values through extra rollouts, VPR uses task structure itself to provide reliable turn-level supervision.

Method

VPR method

VPR converts sparse trajectory-level feedback into dense process rewards:

r_t^VPR = V(s_t, a_t)

For Sudoku, VPR uses a constraint-based verifier. The oracle rewards digit placements that are consistent with the valid solution, giving immediate feedback for intermediate reasoning steps.

Results

In-Domain Sudoku

Method Success Rate Completion Rate
Base 3.91 63.24
OR 48.43 82.80
MC-PR 34.72 77.39
VPR 56.25 85.13

Zero-Shot Transfer

Benchmark group Metric VPR-Sudoku
General reasoning Average pass@1 62.34
General reasoning GPQA-Diamond 50.17
General reasoning BBH 88.82
General reasoning MMLU-Pro 67.88
Agentic reasoning ALFWorld success rate 25.62
Agentic reasoning WebShop score 34.29
Agentic reasoning WebShop success rate 2.20

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "nics-efc/VPR-Sudoku"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Solve this step by step: If a train travels 180 miles in 3 hours, what is its average speed?"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

This checkpoint is intended for research on verifiable rewards, process supervision, reinforcement learning for LLM agents, and transfer from game-like agentic training environments to broader reasoning tasks.

The released checkpoint contains the trained language model. Environment simulators, verifiers, and training code are provided in the project repository.

Citation

If you find this model helpful, please cite:

@misc{yuan2026verifiable,
  title={Verifiable Process Rewards for Agentic Reasoning},
  author={Huining Yuan and Zelai Xu and Huaijie Wang and Xiangmin Yi and Jiaxuan Gao and Xiao-Ping Zhang and Yu Wang and Chao Yu and Yi Wu},
  year={2026},
  eprint={2605.10325},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2605.10325}
}
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nics-efc/VPR-Sudoku

Finetuned
Qwen/Qwen3-4B
Finetuned
(642)
this model

Collection including nics-efc/VPR-Sudoku

Paper for nics-efc/VPR-Sudoku