ToolRL: Reward is All Tool Learning Needs
Paper • 2504.13958 • Published • 49
This model explores PPO as an alternative to GRPO for tool learning.
Fine-tuned Qwen2.5-7B-Instruct using PPO cold start training on the ToolRL dataset.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"YOUR_USERNAME/Qwen2.5-7B-Instruct-ToolRL-PPO-Cold",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
"YOUR_USERNAME/Qwen2.5-7B-Instruct-ToolRL-PPO-Cold"
)
system_prompt = """You are a helpful multi-turn dialogue assistant capable of leveraging tool calls to solve user tasks.
**Available Tools**
1. Name: {tool_name}
Description: {tool_description}
Parameters: {tool_params}
**Output Format**
<think> Your thoughts </think>
<tool_call>
{"name": "Tool name", "parameters": {"param": "value"}}
</tool_call>
<response> Final response </response>"""
user_prompt = "**Dialogue Records History**\n<user> {question} </user>"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Input: "What is 1234 + 5678?"
<think> The user wants to add 1234 and 5678. I will use the calculator tool. </think>
<tool_call>
{"name": "calculator", "parameters": {"a": 1234, "b": 5678, "op": "+"}}
</tool_call>
Input: "What is the weather in Tokyo?"
<think> The user wants to know the current weather in Tokyo. I will use the get_weather tool. </think>
<tool_call>
{"name": "get_weather", "parameters": {"city": "Tokyo"}}
</tool_call>
Input: "Search for latest news about AI"
<think> The user wants to know the latest news about AI. I will use the web_search tool. </think>
<tool_call>
{"name": "web_search", "parameters": {"query": "latest news about AI"}}
</tool_call>
{
"algorithm": "PPO",
"batch_size": 512,
"epochs": 15,
"actor_lr": 1e-6,
"critic_lr": 1e-5,
"kl_coef": 0.05,
"max_grad_norm": 1.0,
"ppo_clip_range": 0.1,
"normalize_advantages": True,
"max_prompt_length": 1024,
"max_response_length": 512,
}
This model was trained with modified reward scaling:
{
"CORRECTMAX1": 1, # Reward range [-1, +1] instead of [-3, +3]
"WITHLENGTH": 0, # Length reward disabled
"REFINEDREWARD": 0, # Refined reward disabled
"COARSEREWARD": 0, # Coarse reward disabled
"STRICTMATCH": 0, # Strict match disabled
}
Format score: [0.0, +1.0]
Correctness score: [-1.0, +1.0] ← CORRECTMAX1=1
Total range: [-1.0, +2.0]
Note: Original ToolRL paper uses [-3.0, +3.0] correctness range.
Results not directly comparable to paper without rescaling.
@article{toolrl2025,
title={ToolRL: Reward is All Tool Learning Needs},
year={2025}
}