Weak spot of the model

by NikolaiSkripko - opened Jan 21

Jan 21

First, I want to say that this work is excellent! Thanks a lot for open-sourcing it.

I would like to point out one weak spot of this reward model.

Consider a chat with 6 messages:

1) user: what's the weather in Moscow and London?
2) assistant: get_weather(Moscow)
3) tool result: 16F
4) assistant: get_weather(London)
5) tool result: 47F
6) assistant: The weather ...

If we try to score message number 2, ToolRM will say something like:
'The assistant should have called get_weather twice. Hence, he made a mistake.'

This becomes a headache with LLMs that do not support parallel calls (like GPT OSS or our Russian GigaChat models).

I propose that a solution here would be passing the whole chat (and separately repeating the message you would like to evaluate) or at least including some messages after the last assistant response that we are asking to evaluate (add IDs to messages and ask to evaluate the message with a specified ID).

RioLee

Owner Jan 22

Hi Nikolai,
Thank you for the kind words and for flagging this. ToolRM was trained with an emphasis on function-calling efficiency, often expecting the assistant to issue all required tool calls in a single (parallel) turn. So for a step like get_weather(Moscow), it may infer that the assistant should have called both Moscow and London in parallel—an assumption that doesn’t align with sequential-only tool behavior.

That said, there are a couple of ways to mitigate this:

Inference-time workaround:
As you suggested, you can pass more context to ToolRM—e.g., merge the tool-call/tool-response messages into a larger chunk and ask ToolRM to evaluate the targeted assistant message with that additional surrounding trajectory (or provide message IDs and specify which one to score).
Note: This differs from ToolRM’s native training setup and relies on generalization from RL-trained reasoning behavior, so it may help in practice, but we can’t guarantee stability or consistent calibration.
Training-based solution:
Alternatively, you can retrain a ToolRM variant on preference data that excludes (or re-labels) parallel-call examples, bringing it into closer alignment with sequential-only model behavior.

More broadly, the current ToolRM is primarily a process reward model designed to score intermediate steps. We may consider building a version that performs fine-grained evaluation over the entire tool-use trajectory, which would address this issue more natively. We still believe parallel tool calling is an important capability—it can significantly improve efficiency for complex requests and is supported by many frontier models—so we intentionally wanted ToolRM to provide feedback that encourages that behavior when it is available.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment