Qwen-2.5-1.5B-GRPO-Tool-Calling

This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct designed to excel at tool-calling and functional reasoning. It was trained using GRPO (Group Relative Policy Optimization).

The model incorporates a <think> block for internal reasoning before outputting the final <tool> call, enabling it to better plan complex function parameters.

🎯 Model Details

Base Model: Qwen/Qwen2.5-1.5B-Instruct
Training Method: GRPO (Group Relative Policy Optimization)
Adaptation: QLoRA (4-bit quantization)
Primary Task: Verifiable Tool-Calling and Argument Generation

🔧 Training Hyperparameters

The model was trained on a single NVIDIA T4 GPU with the following configuration:

Parameter	Value
Steps	25
Effective Batch Size	16
Rollouts per Query	4
Max Prompt Length	512
Max Completion Length	1024
LoRA Rank (r)	32
LoRA Alpha	32
Target Modules	All Linear (q, k, v, o, gate, up, down)

🎁 Reward Functions

The model's behavior is shaped by a multi-objective reward system:

Format Reward [Weight: 0.3]: Incentivizes the model to follow the specific XML-style structure:
```
<think> ... </think>
<tool> {"name": "...", "parameters": {...}} </tool>
```
Correctness Reward [Weight: 0.7]: Validates the output against the ground truth. It awards points based on:
- Exact match of the tool name.
- Syntactic validity of the JSON parameters.
- Semantic match of the arguments provided.

Downloads last month: 1

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RamAnanth1/qwen2.5-1.5b-grpo-tool-calling

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Finetuned

(1499)

this model

RamAnanth1
/

qwen2.5-1.5b-grpo-tool-calling

Qwen-2.5-1.5B-GRPO-Tool-Calling

🎯 Model Details

🔧 Training Hyperparameters

🎁 Reward Functions

Model tree for RamAnanth1/qwen2.5-1.5b-grpo-tool-calling

Dataset used to train RamAnanth1/qwen2.5-1.5b-grpo-tool-calling