Related Work

This model explores PPO as an alternative to GRPO for tool learning.

Qwen2.5-7B-Instruct-ToolRL-PPO-Cold

Fine-tuned Qwen2.5-7B-Instruct using PPO cold start training on the ToolRL dataset.

Model Description

  • Base Model: Qwen/Qwen2.5-7B-Instruct
  • Algorithm: PPO (cold start)
  • Dataset: ToolRL rlla_4k (4,000 samples)
  • Hardware: 2× NVIDIA A100 SXM4 80GB
  • Training Time: ~16 hours
  • Final val/test_score: 1.312 (scale: [-1.0, +2.0])
  • Final val/test_format: 0.988 (99% correct format!)
  • Final val/test_correctness: 0.607 (61% correct tool calls!) (scale: [-1.0, +1.0])

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "YOUR_USERNAME/Qwen2.5-7B-Instruct-ToolRL-PPO-Cold",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "YOUR_USERNAME/Qwen2.5-7B-Instruct-ToolRL-PPO-Cold"
)

system_prompt = """You are a helpful multi-turn dialogue assistant capable of leveraging tool calls to solve user tasks.

**Available Tools**
1. Name: {tool_name}
Description: {tool_description}
Parameters: {tool_params}

**Output Format**
<think> Your thoughts </think>
<tool_call>
{"name": "Tool name", "parameters": {"param": "value"}}
</tool_call>
<response> Final response </response>"""

user_prompt = "**Dialogue Records History**\n<user> {question} </user>"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Example Outputs

Calculator

Input: "What is 1234 + 5678?"

<think> The user wants to add 1234 and 5678. I will use the calculator tool. </think>
<tool_call>
{"name": "calculator", "parameters": {"a": 1234, "b": 5678, "op": "+"}}
</tool_call>

Weather

Input: "What is the weather in Tokyo?"

<think> The user wants to know the current weather in Tokyo. I will use the get_weather tool. </think>
<tool_call>
{"name": "get_weather", "parameters": {"city": "Tokyo"}}
</tool_call>

Web Search

Input: "Search for latest news about AI"

<think> The user wants to know the latest news about AI. I will use the web_search tool. </think>
<tool_call>
{"name": "web_search", "parameters": {"query": "latest news about AI"}}
</tool_call>

Training Details

Hyperparameters

{
    "algorithm": "PPO",
    "batch_size": 512,
    "epochs": 15,
    "actor_lr": 1e-6,
    "critic_lr": 1e-5,
    "kl_coef": 0.05,
    "max_grad_norm": 1.0,
    "ppo_clip_range": 0.1,
    "normalize_advantages": True,
    "max_prompt_length": 1024,
    "max_response_length": 512,
}

Training Reward Configuration

This model was trained with modified reward scaling:

{
    "CORRECTMAX1": 1,        # Reward range [-1, +1] instead of [-3, +3]
    "WITHLENGTH": 0,         # Length reward disabled
    "REFINEDREWARD": 0,      # Refined reward disabled
    "COARSEREWARD": 0,       # Coarse reward disabled
    "STRICTMATCH": 0,        # Strict match disabled
}

Reward Scale

Format score:      [0.0,  +1.0]
Correctness score: [-1.0, +1.0]  ← CORRECTMAX1=1
Total range:       [-1.0, +2.0]

Note: Original ToolRL paper uses [-3.0, +3.0] correctness range.
Results not directly comparable to paper without rescaling.

Citation

@article{toolrl2025,
    title={ToolRL: Reward is All Tool Learning Needs},
    year={2025}
}
Downloads last month
3
Safetensors
Model size
8B params
Tensor type
F32
·
Video Preview
loading

Model tree for sbhokare/Qwen2.5-7B-Instruct-ToolRL-PPO-Cold-Equal-Max

Base model

Qwen/Qwen2.5-7B
Finetuned
(3215)
this model

Dataset used to train sbhokare/Qwen2.5-7B-Instruct-ToolRL-PPO-Cold-Equal-Max

Paper for sbhokare/Qwen2.5-7B-Instruct-ToolRL-PPO-Cold-Equal-Max