WebArbiter-4B-Qwen3

A principle-guided reasoning Process Reward Model for web agents

Published at ICLR 2026

Paper | Code | Website | Collection | Demo

Introduction

WebArbiter-4B-Qwen3 is a 4B reasoning Process Reward Model (PRM) for web agents, built on Qwen3-4B. It demonstrates that stronger base models amplify the benefits of principle-guided reasoning distillation — achieving an Avg. BoN Acc of 72.55% with roughly half the parameters of WebArbiter-7B (Qwen2.5), which scores 74.60%.

Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.

Highlights

Parameter-efficient: Approaches WebArbiter-7B (Qwen2.5) performance (72.55 vs 74.60 Avg. BoN Acc) with roughly half the parameters.
Reasoning as reward: Generates structured <State>, <Criteria>, <Analysis>, and <Answer> outputs with auditable reasoning chains.
Principle-inducing evaluation: Dynamically derives evaluation principles from user intent and page state.
Two-stage training: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO).
Cross-backbone generalization: Same training pipeline as Qwen2.5 variants; only backbone-specific hyperparameters differ.

Results on WebPRMBench

Models marked with ⋆ are ours. Bold = best at comparable scale.

Model	Mind2Web		WebArena		AssistantBench		WorkArena		Avg.
	Pair	BoN	Pair	BoN	Pair	BoN	Pair	BoN	Pair	BoN
Proprietary LLM-as-judge
GPT-4o	79.99	52.62	84.58	66.67	85.83	66.67	84.33	55.19	83.68	60.29
GPT-5	80.86	62.39	84.83	71.64	81.67	63.33	81.14	64.62	82.13	65.50
WebPRMs (3~4B)
WebShepherd-3B	87.50	65.21	68.16	41.29	66.67	46.67	50.00	21.23	68.08	43.60
⋆ WebArbiter-3B (Qwen2.5)	93.32	78.42	81.97	56.22	78.33	46.67	81.01	54.81	83.65	59.06
⋆ WebArbiter-4B (Qwen3)	98.55	94.73	83.21	61.19	92.50	83.33	76.68	50.96	87.73	72.55

WebArbiter-4B (Qwen3) substantially outperforms WebArbiter-3B (Qwen2.5) across all environments, improving Avg. BoN Acc from 59.06% to 72.55%. This approaches WebArbiter-7B (Qwen2.5) at 74.60% with roughly half the parameters.

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ZYao720/WebArbiter-4B-Qwen3"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Construct your prompt following the WebPRMBench format.
# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
user_prompt = "..."  # evaluation prompt with intent, AXTree, trajectory, two responses

messages = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)

response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(response)

Example output:

<State>The user is on the DuckDuckGo homepage with a search box visible.
Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
<Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task?
2. Element reference accuracy (weight 0.25) — Is the referenced element correct?
3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria>
<Analysis>Response 1 directly fills the search query into the textbox, which is the
most direct path to completing the search task. Response 2 clicks an irrelevant link
that does not contribute to the search goal.</Analysis>
<Answer>Response 1</Answer>

Training Details

	Stage 1: Reasoning Distillation	Stage 2: RLVR
Method	Supervised fine-tuning (SFT)	GRPO with binary verifiable rewards
Data	9,642 teacher-distilled examples	18,921 preference pairs
Teacher	o3	—
Base Model	Qwen3-4B	Stage 1 checkpoint
Fine-tuning	LoRA	FSDP + LoRA
Framework	LLaMA-Factory	veRL
Hardware	8 × NVIDIA A100-80GB	8 × NVIDIA A100-80GB
Source Data	WebPRM Collection (~30k step-level preference pairs from Mind2Web)

All variants use the same training data, distillation strategy, and RL procedure; only backbone-specific hyperparameters differ. See the paper (Appendix C) for full details.

Intended Uses

WebArbiter-4B-Qwen3 is designed to:

Evaluate web agent actions: Given a web state and two candidate actions, determine which better advances the user's task.
Guide trajectory search: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
Provide interpretable feedback: Generate structured justifications explaining why one action is preferred.
Resource-efficient deployment: Strong performance at 4B parameters — approaching 7B-level accuracy with roughly half the parameters.

Limitations

Text-only observations: Relies on accessibility tree representations without visual observations.
English-only: Training and evaluation are conducted exclusively in English-language web environments.
Safe-action bias: May sometimes overvalue cautious actions because the accessibility tree does not encode interaction effects.

License

This model is released under Apache 2.0, following the base model Qwen3-4B.

Related Resources

Resource	Link
WebArbiter-8B-Qwen3 (strongest)	ZYao720/WebArbiter-8B-Qwen3
WebArbiter-7B (Qwen2.5)	ZYao720/WebArbiter-7B
WebArbiter-3B (Qwen2.5)	ZYao720/WebArbiter-3B
WEBPRMBENCH (benchmark)	ZYao720/WEBPRMBENCH
Training Data	ZYao720/WebArbiter-Data
Search Trajectories	ZYao720/WebArbiter-Trajectories

Citation

@misc{zhang2026ZYao720principleguidedreasoningprocess,
      title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, 
      author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
      year={2026},
      eprint={2601.21872},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.21872}, 
}