GRPO-TCR-Qwen3-4B-test

This repository contains a full fine-tuned model based on Qwen3-4B-Instruct-2507, trained in two stages:

  1. SFT (Supervised Fine-Tuning) — multi-turn agentic cold-start
  2. GRPO-TCR (Group Relative Policy Optimization with Tool Call Reward) — reinforcement learning

Training uses the Open-AgentRL framework based on the DemyAgent methodology.

Note: This is a test run with minimal data (8 train / 4 test samples, 6 steps) to validate the training pipeline configuration.

Training Objective

This model is trained to perform deliberative agentic reasoning — selectively calling a code_interpreter tool across multiple turns to solve math and coding problems, rather than relying on verbose self-reasoning or indiscriminate tool calls.

The GRPO-TCR stage reinforces:

  • Correct final answers (outcome reward)
  • Tool usage attempts even on incorrect answers (Tool Call Reward)
  • Concise responses (overlong penalty)

Training Pipeline

Qwen3-4B-Instruct-2507
  │
  ├── Stage 1: SFT (multi-turn agentic cold-start)
  │     Dataset: y-ohtani/open_agentrl_like_sft (2K samples, Apache-2.0)
  │     Epochs: 10, Max length: 32768, Full fine-tuning (FSDP, bfloat16)
  │
  └── Stage 2: GRPO-TCR (this model)
        Dataset: y-ohtani/open_agentrl_grpo_2k (8 samples = test run, Apache-2.0)
        Epochs: 3, Total steps: 6, Algorithm: GRPO + 5 enhancements (see below)

GRPO-TCR Configuration

Parameter Value
Base (SFT model) qwen3-4b-ra-sft-merged-epoch3
Algorithm GRPO (Group Relative Policy Optimization)
Max prompt length 2,560
Max response length 10,480
Max turns 16
Learning rate 1e-6
Train batch size 4
Responses per prompt (n) 8
PPO mini batch size 1
Epochs 3
Total steps 6
Train samples 8 (test run)
Test samples 4
Loss aggregation token-mean
Clip ratio low=0.2, high=0.28 (asymmetric)
KL divergence Disabled (kl_coef=0.0)
Overlong penalty buffer=3,000, factor=1.0
Reward manager DAPO
Rollout engine vLLM (sync mode, TP=4)
Sequence parallel 4 (Ulysses)
Param/optimizer offload True (CPU)
GPU memory utilization 0.3
Tool format Hermes
Hardware 4x RTX 4090 (24GB)

5 Key Enhancements over Standard GRPO

Enhancement Purpose
Multi-turn tool calling Enable agentic reasoning (up to 16 turns)
TCR (Tool Call Reward) Reward tool usage even on wrong answers to prevent exploration collapse
Asymmetric clipping Promote exploration by allowing larger probability increases
Overlong penalty Suppress verbose responses, encourage efficient tool use
KL removal + token-mean Allow free exploration without reference model constraint

Validation Results (Step 6)

Benchmark Accuracy Score Reward
deepscaler 1.0 1.0 1.0
taco_code 0.0 0.5 0.5
numina_math 0.0 -1.1 -1.1

Dataset (RL Stage)

  • Name: y-ohtani/open_agentrl_grpo_2k
  • License: Apache-2.0 (all sources are Apache-2.0 or MIT)
  • Sampling: 8 samples (test run), balanced across 5 sources
Source Original Dataset License Domain
deepscaler agentica-org/DeepScaleR-Preview-Dataset MIT Math (reasoning)
omni_math KbsdJames/Omni-MATH Apache-2.0 Math (olympiad)
numina_math AI-MO/NuminaMath-1.5 Apache-2.0 Math (general)
taco_code BAAI/TACO Apache-2.0 Coding (algorithm)
leetcode_code newfacade/LeetCodeDataset Apache-2.0 Coding (LeetCode)

All training data is sourced from Apache-2.0 / MIT licensed open datasets. This repository does NOT redistribute the dataset.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "y-ohtani/GRPO-TCR-Qwen3-4B-test"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Find all prime numbers p such that p^2 + 2 is also prime."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sources & Terms

Component Source License
Base model Qwen/Qwen3-4B-Instruct-2507 Apache-2.0
SFT dataset y-ohtani/open_agentrl_like_sft Apache-2.0
RL dataset y-ohtani/open_agentrl_grpo_2k Apache-2.0
Training framework Open-AgentRL (verl) Apache-2.0
Methodology DemyAgent (arXiv:2507.15997) —

Users must comply with the base model license and dataset terms.

Intended Use & Limitations

  • Intended: Pipeline validation only. This is a test run to verify the training configuration.
  • Not intended: Any production or evaluation use.
  • Limitations:
    • Trained on only 8 samples (6 steps) as a pipeline validation run.
    • Tool calling requires a compatible runtime (e.g., SandboxFusion, code interpreter).
Downloads last month
13
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train y-ohtani/GRPO-TCR-Qwen3-4B-test

Paper for y-ohtani/GRPO-TCR-Qwen3-4B-test