GRPO-TCR-Qwen3-4B-quarter

This repository contains a full fine-tuned model based on Qwen3-4B-Instruct-2507, trained in two stages:

  1. SFT (Supervised Fine-Tuning) — multi-turn agentic cold-start
  2. GRPO-TCR (Group Relative Policy Optimization with Tool Call Reward) — reinforcement learning

Training uses the Open-AgentRL framework based on the DemyAgent methodology.

Note: This model is trained on 1/4 of the full dataset (490 samples out of 1,960) for 3 epochs as an intermediate experiment.

Training Objective

This model is trained to perform deliberative agentic reasoning — selectively calling a code_interpreter tool across multiple turns to solve math and coding problems, rather than relying on verbose self-reasoning or indiscriminate tool calls.

The GRPO-TCR stage reinforces:

  • Correct final answers (outcome reward)
  • Tool usage attempts even on incorrect answers (Tool Call Reward)
  • Concise responses (overlong penalty)

Training Pipeline

Qwen3-4B-Instruct-2507
  │
  ├── Stage 1: SFT (multi-turn agentic cold-start)
  │     Dataset: y-ohtani/open_agentrl_like_sft (2K samples, Apache-2.0)
  │     Epochs: 10, Max length: 32768, Full fine-tuning (FSDP, bfloat16)
  │
  └── Stage 2: GRPO-TCR (this model)
        Dataset: y-ohtani/open_agentrl_grpo_2k (490 samples = 1/4, Apache-2.0)
        Epochs: 3, Total steps: 366, Algorithm: GRPO + 5 enhancements (see below)

GRPO-TCR Configuration

Parameter Value
Base (SFT model) qwen3-4b-ra-sft-merged-epoch3
Algorithm GRPO (Group Relative Policy Optimization)
Max prompt length 2,560
Max response length 10,480
Max turns 16
Learning rate 1e-6
Train batch size 4
Responses per prompt (n) 8
PPO mini batch size 1
Epochs 3
Total steps 366
Train samples 490 (1/4 of full dataset)
Test samples 10
Loss aggregation token-mean
Clip ratio low=0.2, high=0.28 (asymmetric)
KL divergence Disabled (kl_coef=0.0)
Overlong penalty buffer=3,000, factor=1.0
Reward manager DAPO
Rollout engine vLLM (sync mode, TP=4)
Sequence parallel 4 (Ulysses)
Param/optimizer offload True (CPU)
GPU memory utilization 0.3
Tool format Hermes
Hardware 4x RTX 4090 (24GB)
Training time ~13 hours

5 Key Enhancements over Standard GRPO

Enhancement Purpose
Multi-turn tool calling Enable agentic reasoning (up to 16 turns)
TCR (Tool Call Reward) Reward tool usage even on wrong answers to prevent exploration collapse
Asymmetric clipping Promote exploration by allowing larger probability increases
Overlong penalty Suppress verbose responses, encourage efficient tool use
KL removal + token-mean Allow free exploration without reference model constraint

Final Validation Results (Step 366)

Benchmark Accuracy Score Reward
deepscaler 1.0 1.0 1.0
taco_code 0.0 0.4375 0.4375
numina_math 0.0 -1.1 -1.1
gpqa_science 0.0 -1.1 -1.1
omni_math 0.0 -1.1 -1.1

Dataset (RL Stage)

  • Name: y-ohtani/open_agentrl_grpo_2k
  • License: Apache-2.0 (all sources are Apache-2.0 or MIT)
  • Sampling: 490 samples (1/4 of 1,960 train split), balanced across 5 sources
Source Original Dataset License Domain
deepscaler agentica-org/DeepScaleR-Preview-Dataset MIT Math (reasoning)
omni_math KbsdJames/Omni-MATH Apache-2.0 Math (olympiad)
numina_math AI-MO/NuminaMath-1.5 Apache-2.0 Math (general)
taco_code BAAI/TACO Apache-2.0 Coding (algorithm)
leetcode_code newfacade/LeetCodeDataset Apache-2.0 Coding (LeetCode)

All training data is sourced from Apache-2.0 / MIT licensed open datasets. This repository does NOT redistribute the dataset.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "y-ohtani/GRPO-TCR-Qwen3-4B-quarter"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Find all prime numbers p such that p^2 + 2 is also prime."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sources & Terms

Component Source License
Base model Qwen/Qwen3-4B-Instruct-2507 Apache-2.0
SFT dataset y-ohtani/open_agentrl_like_sft Apache-2.0
RL dataset y-ohtani/open_agentrl_grpo_2k Apache-2.0
Training framework Open-AgentRL (verl) Apache-2.0
Methodology DemyAgent (arXiv:2507.15997)

Users must comply with the base model license and dataset terms.

Intended Use & Limitations

  • Intended: Agentic reasoning tasks with tool use (math, coding). Best results when used with a code interpreter tool in multi-turn settings.
  • Not intended: Production deployment without further evaluation.
  • Limitations:
    • Trained on 490 samples (1/4 of the 2K balanced dataset) as an intermediate experiment.
    • Performance on non-math/non-coding tasks may degrade compared to the base instruct model.
    • Tool calling requires a compatible runtime (e.g., SandboxFusion, code interpreter).
    • deepscaler以外のベンチマークではスコアが低く、フルデータでの学習が必要です。
Downloads last month
8
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train y-ohtani/GRPO-TCR-Qwen3-4B-quarter

Paper for y-ohtani/GRPO-TCR-Qwen3-4B-quarter