GRPO-TCR-Qwen3-4B-test

This repository contains a full fine-tuned model based on Qwen3-4B-Instruct-2507, trained in two stages:

SFT (Supervised Fine-Tuning) — multi-turn agentic cold-start
GRPO-TCR (Group Relative Policy Optimization with Tool Call Reward) — reinforcement learning

Training uses the Open-AgentRL framework based on the DemyAgent methodology.

Note: This is a test run with minimal data (8 train / 4 test samples, 6 steps) to validate the training pipeline configuration.

Training Objective

This model is trained to perform deliberative agentic reasoning — selectively calling a code_interpreter tool across multiple turns to solve math and coding problems, rather than relying on verbose self-reasoning or indiscriminate tool calls.

The GRPO-TCR stage reinforces:

Correct final answers (outcome reward)
Tool usage attempts even on incorrect answers (Tool Call Reward)
Concise responses (overlong penalty)

Training Pipeline

Qwen3-4B-Instruct-2507
  │
  ├── Stage 1: SFT (multi-turn agentic cold-start)
  │     Dataset: y-ohtani/open_agentrl_like_sft (2K samples, Apache-2.0)
  │     Epochs: 10, Max length: 32768, Full fine-tuning (FSDP, bfloat16)
  │
  └── Stage 2: GRPO-TCR (this model)
        Dataset: y-ohtani/open_agentrl_grpo_2k (8 samples = test run, Apache-2.0)
        Epochs: 3, Total steps: 6, Algorithm: GRPO + 5 enhancements (see below)

GRPO-TCR Configuration

Parameter	Value
Base (SFT model)	qwen3-4b-ra-sft-merged-epoch3
Algorithm	GRPO (Group Relative Policy Optimization)
Max prompt length	2,560
Max response length	10,480
Max turns	16
Learning rate	1e-6
Train batch size	4
Responses per prompt (n)	8
PPO mini batch size	1
Epochs	3
Total steps	6
Train samples	8 (test run)
Test samples	4
Loss aggregation	token-mean
Clip ratio	low=0.2, high=0.28 (asymmetric)
KL divergence	Disabled (kl_coef=0.0)
Overlong penalty	buffer=3,000, factor=1.0
Reward manager	DAPO
Rollout engine	vLLM (sync mode, TP=4)
Sequence parallel	4 (Ulysses)
Param/optimizer offload	True (CPU)
GPU memory utilization	0.3
Tool format	Hermes
Hardware	4x RTX 4090 (24GB)

5 Key Enhancements over Standard GRPO

Enhancement	Purpose
Multi-turn tool calling	Enable agentic reasoning (up to 16 turns)
TCR (Tool Call Reward)	Reward tool usage even on wrong answers to prevent exploration collapse
Asymmetric clipping	Promote exploration by allowing larger probability increases
Overlong penalty	Suppress verbose responses, encourage efficient tool use
KL removal + token-mean	Allow free exploration without reference model constraint

Validation Results (Step 6)

Benchmark	Accuracy	Score	Reward
deepscaler	1.0	1.0	1.0
taco_code	0.0	0.5	0.5
numina_math	0.0	-1.1	-1.1

Dataset (RL Stage)

Name: y-ohtani/open_agentrl_grpo_2k
License: Apache-2.0 (all sources are Apache-2.0 or MIT)
Sampling: 8 samples (test run), balanced across 5 sources

Source	Original Dataset	License	Domain
deepscaler	agentica-org/DeepScaleR-Preview-Dataset	MIT	Math (reasoning)
omni_math	KbsdJames/Omni-MATH	Apache-2.0	Math (olympiad)
numina_math	AI-MO/NuminaMath-1.5	Apache-2.0	Math (general)
taco_code	BAAI/TACO	Apache-2.0	Coding (algorithm)
leetcode_code	newfacade/LeetCodeDataset	Apache-2.0	Coding (LeetCode)

All training data is sourced from Apache-2.0 / MIT licensed open datasets. This repository does NOT redistribute the dataset.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "y-ohtani/GRPO-TCR-Qwen3-4B-test"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Find all prime numbers p such that p^2 + 2 is also prime."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sources & Terms

Component	Source	License
Base model	Qwen/Qwen3-4B-Instruct-2507	Apache-2.0
SFT dataset	y-ohtani/open_agentrl_like_sft	Apache-2.0
RL dataset	y-ohtani/open_agentrl_grpo_2k	Apache-2.0
Training framework	Open-AgentRL (verl)	Apache-2.0
Methodology	DemyAgent (arXiv:2507.15997)	—

Users must comply with the base model license and dataset terms.

Intended Use & Limitations

Intended: Pipeline validation only. This is a test run to verify the training configuration.
Not intended: Any production or evaluation use.
Limitations:
- Trained on only 8 samples (6 steps) as a pipeline validation run.
- Tool calling requires a compatible runtime (e.g., SandboxFusion, code interpreter).

Downloads last month: 13

Safetensors

Model size

4B params

Tensor type

BF16

Datasets used to train y-ohtani/GRPO-TCR-Qwen3-4B-test

Paper for y-ohtani/GRPO-TCR-Qwen3-4B-test

"We Need a Standard": Toward an Expert-Informed Privacy Label for Differential Privacy

Paper • 2507.15997 • Published Jul 21, 2025