Qwen-2.5-1.5B-GRPO-Tool-Calling

This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct designed to excel at tool-calling and functional reasoning. It was trained using GRPO (Group Relative Policy Optimization).

The model incorporates a <think> block for internal reasoning before outputting the final <tool> call, enabling it to better plan complex function parameters.

🎯 Model Details

  • Base Model: Qwen/Qwen2.5-1.5B-Instruct
  • Training Method: GRPO (Group Relative Policy Optimization)
  • Adaptation: QLoRA (4-bit quantization)
  • Primary Task: Verifiable Tool-Calling and Argument Generation

πŸ”§ Training Hyperparameters

The model was trained on a single NVIDIA T4 GPU with the following configuration:

Parameter Value
Steps 25
Effective Batch Size 16
Rollouts per Query 4
Max Prompt Length 512
Max Completion Length 1024
LoRA Rank (r) 32
LoRA Alpha 32
Target Modules All Linear (q, k, v, o, gate, up, down)

🎁 Reward Functions

The model's behavior is shaped by a multi-objective reward system:

  1. Format Reward [Weight: 0.3]: Incentivizes the model to follow the specific XML-style structure:
    <think> ... </think>
    <tool> {"name": "...", "parameters": {...}} </tool>
    
  2. Correctness Reward [Weight: 0.7]: Validates the output against the ground truth. It awards points based on:
    • Exact match of the tool name.
    • Syntactic validity of the JSON parameters.
    • Semantic match of the arguments provided.
Downloads last month
1
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RamAnanth1/qwen2.5-1.5b-grpo-tool-calling

Finetuned
(1499)
this model

Dataset used to train RamAnanth1/qwen2.5-1.5b-grpo-tool-calling