WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
Paper • 2601.21872 • Published • 1
Reasoning Process Reward Model for Web Agents. Models, data, and WebPRMBench. ICLR 2026.
Note 📄 Published at ICLR 2026
Note 🏆 Best model (Qwen3-8B backbone) — 76.66% Avg. BoN Acc on WebPRMBench
Note 🤖 Qwen2.5-7B backbone — 74.60% Avg. BoN Acc on WebPRMBench
Note ⚡ Efficient model (Qwen3-4B backbone) — 72.55% Avg. BoN Acc on WebPRMBench
Note 🤖 Qwen2.5-3B backbone — 59.06% Avg. BoN Acc on WebPRMBench
Note 📊 Evaluation benchmark — 1,150 states, 4,600 pairwise instances across 4 web environments
Note 📊 Training data — SFT (9,642 examples) + RL (18,921 pairs)
Note 📊 72 reward-guided search trajectories on WebArena-Lite