GRPO-TCR-Qwen3-4B-quarter

This repository contains a full fine-tuned model based on Qwen3-4B-Instruct-2507, trained in two stages:

SFT (Supervised Fine-Tuning) — multi-turn agentic cold-start
GRPO-TCR (Group Relative Policy Optimization with Tool Call Reward) — reinforcement learning

Training uses the Open-AgentRL framework based on the DemyAgent methodology.

Note: This model is trained on 1/4 of the full dataset (490 samples out of 1,960) for 3 epochs as an intermediate experiment.

Training Objective

This model is trained to perform deliberative agentic reasoning — selectively calling a code_interpreter tool across multiple turns to solve math and coding problems, rather than relying on verbose self-reasoning or indiscriminate tool calls.

The GRPO-TCR stage reinforces:

Correct final answers (outcome reward)
Tool usage attempts even on incorrect answers (Tool Call Reward)
Concise responses (overlong penalty)

Training Pipeline

Qwen3-4B-Instruct-2507
  │
  ├── Stage 1: SFT (multi-turn agentic cold-start)
  │     Dataset: y-ohtani/open_agentrl_like_sft (2K samples, Apache-2.0)
  │     Epochs: 10, Max length: 32768, Full fine-tuning (FSDP, bfloat16)
  │
  └── Stage 2: GRPO-TCR (this model)
        Dataset: y-ohtani/open_agentrl_grpo_2k (490 samples = 1/4, Apache-2.0)
        Epochs: 3, Total steps: 366, Algorithm: GRPO + 5 enhancements (see below)

GRPO-TCR Configuration

Parameter	Value
Base (SFT model)	qwen3-4b-ra-sft-merged-epoch3
Algorithm	GRPO (Group Relative Policy Optimization)
Max prompt length	2,560
Max response length	10,480
Max turns	16
Learning rate	1e-6
Train batch size	4
Responses per prompt (n)	8
PPO mini batch size	1
Epochs	3
Total steps	366
Train samples	490 (1/4 of full dataset)
Test samples	10
Loss aggregation	token-mean
Clip ratio	low=0.2, high=0.28 (asymmetric)
KL divergence	Disabled (kl_coef=0.0)
Overlong penalty	buffer=3,000, factor=1.0
Reward manager	DAPO
Rollout engine	vLLM (sync mode, TP=4)
Sequence parallel	4 (Ulysses)
Param/optimizer offload	True (CPU)
GPU memory utilization	0.3
Tool format	Hermes
Hardware	4x RTX 4090 (24GB)
Training time	~13 hours

5 Key Enhancements over Standard GRPO

Enhancement	Purpose
Multi-turn tool calling	Enable agentic reasoning (up to 16 turns)
TCR (Tool Call Reward)	Reward tool usage even on wrong answers to prevent exploration collapse
Asymmetric clipping	Promote exploration by allowing larger probability increases
Overlong penalty	Suppress verbose responses, encourage efficient tool use
KL removal + token-mean	Allow free exploration without reference model constraint

Final Validation Results (Step 366)

Benchmark	Accuracy	Score	Reward
deepscaler	1.0	1.0	1.0
taco_code	0.0	0.4375	0.4375
numina_math	0.0	-1.1	-1.1
gpqa_science	0.0	-1.1	-1.1
omni_math	0.0	-1.1	-1.1

Dataset (RL Stage)

Name: y-ohtani/open_agentrl_grpo_2k
License: Apache-2.0 (all sources are Apache-2.0 or MIT)
Sampling: 490 samples (1/4 of 1,960 train split), balanced across 5 sources

Source	Original Dataset	License	Domain
deepscaler	agentica-org/DeepScaleR-Preview-Dataset	MIT	Math (reasoning)
omni_math	KbsdJames/Omni-MATH	Apache-2.0	Math (olympiad)
numina_math	AI-MO/NuminaMath-1.5	Apache-2.0	Math (general)
taco_code	BAAI/TACO	Apache-2.0	Coding (algorithm)
leetcode_code	newfacade/LeetCodeDataset	Apache-2.0	Coding (LeetCode)

All training data is sourced from Apache-2.0 / MIT licensed open datasets. This repository does NOT redistribute the dataset.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "y-ohtani/GRPO-TCR-Qwen3-4B-quarter"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Find all prime numbers p such that p^2 + 2 is also prime."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sources & Terms

Component	Source	License
Base model	Qwen/Qwen3-4B-Instruct-2507	Apache-2.0
SFT dataset	y-ohtani/open_agentrl_like_sft	Apache-2.0
RL dataset	y-ohtani/open_agentrl_grpo_2k	Apache-2.0
Training framework	Open-AgentRL (verl)	Apache-2.0
Methodology	DemyAgent (arXiv:2507.15997)	—

Users must comply with the base model license and dataset terms.

Intended Use & Limitations

Intended: Agentic reasoning tasks with tool use (math, coding). Best results when used with a code interpreter tool in multi-turn settings.
Not intended: Production deployment without further evaluation.
Limitations:
- Trained on 490 samples (1/4 of the 2K balanced dataset) as an intermediate experiment.
- Performance on non-math/non-coding tasks may degrade compared to the base instruct model.
- Tool calling requires a compatible runtime (e.g., SandboxFusion, code interpreter).
- deepscaler以外のベンチマークではスコアが低く、フルデータでの学習が必要です。

Downloads last month: 8

Safetensors

Model size

4B params

Tensor type

BF16

Datasets used to train y-ohtani/GRPO-TCR-Qwen3-4B-quarter

Paper for y-ohtani/GRPO-TCR-Qwen3-4B-quarter

"We Need a Standard": Toward an Expert-Informed Privacy Label for Differential Privacy

Paper • 2507.15997 • Published Jul 21, 2025