Qwen3-30B-A3B-ToolAgent-GRPO

A LoRA adapter trained via GRPO (Group Relative Policy Optimization) on Prime Intellect to improve tool-calling behaviour in AI coding agents.

This adapter teaches the model to:

Use tools when needed — and only when needed
Choose the right tool — bash, python, read, write, find, grep
Be efficient — minimize tool calls, combine commands
Handle errors gracefully — recover from failures, adapt approach
Know when NOT to use tools — answer factual questions directly

Key Results

Metric	Base Model	+ Adapter	Change
Overall Reward	0.942	0.965	+0.023
Task Completion	96.7%	100%	+3.3%
Failures	1/30	0/30	Fixed

Category Breakdown (held-out eval)

Category	Base	Adapter	Δ	Notes
file_ops	0.785	0.929	+0.144	Biggest improvement — JSON creation, file search
multi_step	0.935	0.967	+0.032	Better error recovery and data pipelines
code_execution	0.955	0.955	—	Already strong
terminal	0.955	0.955	—	Already strong
zero_tool	1.000	1.000	—	Knowledge preserved perfectly
self_improvement	1.000	1.000	—	Reasoning preserved perfectly
planning	0.914	0.907	-0.006	Negligible

No regressions. The adapter improved tool-use capabilities without degrading knowledge or reasoning.

Model Details


Base Model	Qwen/Qwen3-30B-A3B-Instruct-2507
Architecture	MoE (Mixture of Experts), 30B total params, 3B active
Adapter Type	LoRA (PEFT)
LoRA Rank	16
LoRA Alpha	32
Target Modules	`experts`, `k_proj`, `o_proj`, `q_proj`, `v_proj`
Adapter Size	3.1 GB
Training Method	GRPO (Group Relative Policy Optimization)
Training Platform	Prime Intellect hosted RL
Training Cost	Free (PI beta) + ~$8 judge calls via OpenRouter

Training Details

Method

Trained using GRPO — a reinforcement learning algorithm that optimizes policy via group-relative advantages. No SFT pre-training was used; the adapter was trained directly from the base model using RL only.

ToolUseRubric — 4-Dimension Scoring

The reward signal comes from a custom ToolUseRubric with four weighted dimensions:

Dimension	Weight	Type	What it measures
Task Completion	0.50	LLM Judge	Did the agent solve the task correctly? Scored by `gpt-4.1-nano` via OpenRouter
Tool Outcomes	0.20	Heuristic	Did tool calls execute successfully (no errors)?
Efficiency	0.15	Heuristic	Were tool calls within budget (not excessive)?
Dummy Call Detection	0.15	Heuristic	No redundant calls, results referenced in response?

The LLM judge is critical — without it, the model quickly learns to game the heuristic metrics (reward hits 1.0 by step ~35 by simply never using tools). With the judge, the model must actually solve tasks correctly.

Training Configuration

model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 200
batch_size = 256
rollouts_per_example = 8

[sampling]
max_tokens = 4096

[[env]]
id = "anarion/pi_agent_env"
args = { max_turns = 10 }

[checkpoints]
interval = 25

Training Curve

Step    Reward    Task Completion    Tool Calls/sample
  1     0.904         0.862              0.46
 25     0.882         0.855              0.65
 50     0.893         0.873              0.78
 75     0.912         0.901              0.82
100     0.905         0.889              0.85
125     0.901         0.882              0.80
147     0.959 ★       0.932              0.88        ← best step
150     0.907         0.890              0.85
175     0.910         0.897              0.77
199     0.917         0.890              0.54

Overall trend: first 20 avg = 0.882 → last 20 avg = 0.908 (Δ+0.026)

Key observations:

Steady improvement — no reward collapse or instability
Tool usage increased — model learned to use tools when appropriate (0.46 → 0.77 avg)
No reward hacking — judge prevented gaming of heuristic metrics
Best reward 0.959 at step 147

Training Dataset

598 tasks across 7 categories, generated synthetically and converted to Pi Agent format:

Category	Count	Description
zero_tool	247	Factual, arithmetic, reasoning — must NOT use tools
code_execution	100	Python computation tasks
terminal	99	Shell command execution
file_ops	54	File create/read/write/search
self_improvement	34	Meta-reasoning about AI capabilities
planning	32	Plan-then-execute workflows
multi_step	32	Efficient multi-tool workflows

Tools Available During Training

Tool	Description
`bash`	Execute shell commands
`python`	Execute Python code
`read`	Read file contents
`write`	Write/create files
`find`	Find files by glob pattern
`grep`	Search file contents by regex

Usage

With PEFT + Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-30B-A3B-Instruct-2507",
    device_map="auto",
    torch_dtype="auto",
)
model = PeftModel.from_pretrained(base_model, "Indelwin/Qwen3-30B-A3B-ToolAgent-GRPO")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-30B-A3B-Instruct-2507")

messages = [
    {"role": "system", "content": "You are a helpful AI coding assistant with access to tools."},
    {"role": "user", "content": "Create a file /tmp/hello.txt with 'Hello World', then read it back."},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Prime Intellect Inference (if deployed)

from openai import OpenAI

client = OpenAI(
    api_key="your-pi-api-key",
    base_url="https://api.pinference.ai/api/v1",
    default_headers={"X-Prime-Team-ID": "your-team-id"},
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-30B-A3B-Instruct-2507:vaz8dmj5genxl8v94ufat6xt",
    messages=[
        {"role": "user", "content": "Write Python code to compute the 15th Fibonacci number."}
    ],
)
print(response.choices[0].message.content)

Lessons Learned

1. Heuristic rubrics get gamed instantly

Without an LLM judge, the model learned within ~35 steps to never use tools and write long plausible-sounding responses. Reward hit 1.0 and stayed there. The model was optimizing the metric, not the task.

2. LLM judge is essential but cheap

Using gpt-4.1-nano via OpenRouter costs only ~$5-10 for a full 200-step run. This prevented reward hacking completely and led to genuine capability improvements.

3. RL alone works (no SFT needed)

This adapter was trained with RL only — no supervised fine-tuning phase. The base model (Qwen3-30B-A3B-Instruct) already has decent instruction-following, and GRPO was able to improve tool-use behavior on top of that.

4. MoE models are efficient for RL

The 30B MoE model (3B active parameters) trains much faster than a dense 30B model while still having access to broad knowledge. Good balance of capability vs training speed.

5. Tool usage increased, not decreased

The trained model uses more tools than the base model (0.77 vs 0.46 calls/sample), but more appropriately. It learned when tools are genuinely needed rather than trying to answer everything from memory.

Evaluation Details

Evaluated on 30 held-out tasks never seen during training, using heuristic scoring (no judge). Tasks test the same categories as training but with different specific questions.

Per-Task Results

Click to expand full per-task results

Task	Category	Base	Adapter	Δ
eval_001	zero_tool	1.000	1.000	—
eval_002	zero_tool	1.000	1.000	—
eval_003	zero_tool	1.000	1.000	—
eval_004	zero_tool	1.000	1.000	—
eval_005	zero_tool	1.000	1.000	—
eval_006	terminal	0.963	0.963	—
eval_007	terminal	0.963	0.963	—
eval_008	terminal	0.963	0.963	—
eval_009	terminal	0.925	0.925	—
eval_010	code_exec	0.925	0.925	—
eval_011	code_exec	0.963	0.963	—
eval_012	code_exec	0.963	0.963	—
eval_013	code_exec	0.963	0.963	—
eval_014	file_ops	0.950	0.950	—
eval_015	file_ops	0.285	0.863	+0.578
eval_016	file_ops	0.950	0.950	—
eval_017	multi_step	0.963	0.963	—
eval_018	multi_step	0.963	0.963	—
eval_019	multi_step	0.852	0.981	+0.129
eval_020	planning	0.950	0.938	-0.012
eval_021	planning	0.877	0.877	—
eval_022	self_imp	1.000	1.000	—
eval_023	self_imp	1.000	1.000	—
eval_024	code_exec	0.963	0.963	—
eval_025	terminal	0.963	0.963	—
eval_026	zero_tool	1.000	1.000	—
eval_027	zero_tool	1.000	1.000	—
eval_028	file_ops	0.955	0.955	—
eval_029	multi_step	0.963	0.963	—
eval_030	zero_tool	1.000	1.000	—

Training Infrastructure

Platform: Prime Intellect hosted RL training (free during beta)
Environment: Custom pi_agent_env (published as anarion/pi_agent_env on PI Hub)
Judge Model: openai/gpt-4.1-nano via OpenRouter
Agent Framework: Pi Agent by Mario Zechner
Run ID: b4m4eammrloy61ifoo34ndn5
Adapter ID: vaz8dmj5genxl8v94ufat6xt
Training Duration: ~2 hours
Checkpoints: 5 saved (steps 75, 100, 125, 150, 175)

Citation

@misc{qwen3-30b-toolagent-grpo-2026,
  title={Qwen3-30B-A3B-ToolAgent-GRPO: RL-Trained Tool-Use Adapter},
  author={Indelwin},
  year={2026},
  url={https://huggingface.co/Indelwin/Qwen3-30B-A3B-ToolAgent-GRPO},
  note={LoRA adapter trained via GRPO on Prime Intellect for improved tool-calling in coding agents}
}

License

This adapter inherits the license of the base model (Qwen/Qwen3-30B-A3B-Instruct-2507) — Apache 2.0.

Downloads last month: 11

Model tree for Indelwin/Qwen3-30B-A3B-ToolAgent-GRPO

Base model

Qwen/Qwen3-30B-A3B-Instruct-2507

Adapter

(62)

this model

Evaluation results

Eval Reward (Adapter)
self-reported

0.965
Eval Reward (Base)
self-reported

0.942
Task Completion Rate
self-reported

1.000