Qwen3-30B-A3B-ToolAgent-GRPO
A LoRA adapter trained via GRPO (Group Relative Policy Optimization) on Prime Intellect to improve tool-calling behaviour in AI coding agents.
This adapter teaches the model to:
- Use tools when needed β and only when needed
- Choose the right tool β bash, python, read, write, find, grep
- Be efficient β minimize tool calls, combine commands
- Handle errors gracefully β recover from failures, adapt approach
- Know when NOT to use tools β answer factual questions directly
Key Results
| Metric | Base Model | + Adapter | Change |
|---|---|---|---|
| Overall Reward | 0.942 | 0.965 | +0.023 |
| Task Completion | 96.7% | 100% | +3.3% |
| Failures | 1/30 | 0/30 | Fixed |
Category Breakdown (held-out eval)
| Category | Base | Adapter | Ξ | Notes |
|---|---|---|---|---|
| file_ops | 0.785 | 0.929 | +0.144 | Biggest improvement β JSON creation, file search |
| multi_step | 0.935 | 0.967 | +0.032 | Better error recovery and data pipelines |
| code_execution | 0.955 | 0.955 | β | Already strong |
| terminal | 0.955 | 0.955 | β | Already strong |
| zero_tool | 1.000 | 1.000 | β | Knowledge preserved perfectly |
| self_improvement | 1.000 | 1.000 | β | Reasoning preserved perfectly |
| planning | 0.914 | 0.907 | -0.006 | Negligible |
No regressions. The adapter improved tool-use capabilities without degrading knowledge or reasoning.
Model Details
| Base Model | Qwen/Qwen3-30B-A3B-Instruct-2507 |
| Architecture | MoE (Mixture of Experts), 30B total params, 3B active |
| Adapter Type | LoRA (PEFT) |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| Target Modules | experts, k_proj, o_proj, q_proj, v_proj |
| Adapter Size | 3.1 GB |
| Training Method | GRPO (Group Relative Policy Optimization) |
| Training Platform | Prime Intellect hosted RL |
| Training Cost | Free (PI beta) + ~$8 judge calls via OpenRouter |
Training Details
Method
Trained using GRPO β a reinforcement learning algorithm that optimizes policy via group-relative advantages. No SFT pre-training was used; the adapter was trained directly from the base model using RL only.
ToolUseRubric β 4-Dimension Scoring
The reward signal comes from a custom ToolUseRubric with four weighted dimensions:
| Dimension | Weight | Type | What it measures |
|---|---|---|---|
| Task Completion | 0.50 | LLM Judge | Did the agent solve the task correctly? Scored by gpt-4.1-nano via OpenRouter |
| Tool Outcomes | 0.20 | Heuristic | Did tool calls execute successfully (no errors)? |
| Efficiency | 0.15 | Heuristic | Were tool calls within budget (not excessive)? |
| Dummy Call Detection | 0.15 | Heuristic | No redundant calls, results referenced in response? |
The LLM judge is critical β without it, the model quickly learns to game the heuristic metrics (reward hits 1.0 by step ~35 by simply never using tools). With the judge, the model must actually solve tasks correctly.
Training Configuration
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 200
batch_size = 256
rollouts_per_example = 8
[sampling]
max_tokens = 4096
[[env]]
id = "anarion/pi_agent_env"
args = { max_turns = 10 }
[checkpoints]
interval = 25
Training Curve
Step Reward Task Completion Tool Calls/sample
1 0.904 0.862 0.46
25 0.882 0.855 0.65
50 0.893 0.873 0.78
75 0.912 0.901 0.82
100 0.905 0.889 0.85
125 0.901 0.882 0.80
147 0.959 β
0.932 0.88 β best step
150 0.907 0.890 0.85
175 0.910 0.897 0.77
199 0.917 0.890 0.54
Overall trend: first 20 avg = 0.882 β last 20 avg = 0.908 (Ξ+0.026)
Key observations:
- Steady improvement β no reward collapse or instability
- Tool usage increased β model learned to use tools when appropriate (0.46 β 0.77 avg)
- No reward hacking β judge prevented gaming of heuristic metrics
- Best reward 0.959 at step 147
Training Dataset
598 tasks across 7 categories, generated synthetically and converted to Pi Agent format:
| Category | Count | Description |
|---|---|---|
| zero_tool | 247 | Factual, arithmetic, reasoning β must NOT use tools |
| code_execution | 100 | Python computation tasks |
| terminal | 99 | Shell command execution |
| file_ops | 54 | File create/read/write/search |
| self_improvement | 34 | Meta-reasoning about AI capabilities |
| planning | 32 | Plan-then-execute workflows |
| multi_step | 32 | Efficient multi-tool workflows |
Tools Available During Training
| Tool | Description |
|---|---|
bash |
Execute shell commands |
python |
Execute Python code |
read |
Read file contents |
write |
Write/create files |
find |
Find files by glob pattern |
grep |
Search file contents by regex |
Usage
With PEFT + Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-30B-A3B-Instruct-2507",
device_map="auto",
torch_dtype="auto",
)
model = PeftModel.from_pretrained(base_model, "Indelwin/Qwen3-30B-A3B-ToolAgent-GRPO")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-30B-A3B-Instruct-2507")
messages = [
{"role": "system", "content": "You are a helpful AI coding assistant with access to tools."},
{"role": "user", "content": "Create a file /tmp/hello.txt with 'Hello World', then read it back."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With Prime Intellect Inference (if deployed)
from openai import OpenAI
client = OpenAI(
api_key="your-pi-api-key",
base_url="https://api.pinference.ai/api/v1",
default_headers={"X-Prime-Team-ID": "your-team-id"},
)
response = client.chat.completions.create(
model="Qwen/Qwen3-30B-A3B-Instruct-2507:vaz8dmj5genxl8v94ufat6xt",
messages=[
{"role": "user", "content": "Write Python code to compute the 15th Fibonacci number."}
],
)
print(response.choices[0].message.content)
Lessons Learned
1. Heuristic rubrics get gamed instantly
Without an LLM judge, the model learned within ~35 steps to never use tools and write long plausible-sounding responses. Reward hit 1.0 and stayed there. The model was optimizing the metric, not the task.
2. LLM judge is essential but cheap
Using gpt-4.1-nano via OpenRouter costs only ~$5-10 for a full 200-step run. This prevented reward hacking completely and led to genuine capability improvements.
3. RL alone works (no SFT needed)
This adapter was trained with RL only β no supervised fine-tuning phase. The base model (Qwen3-30B-A3B-Instruct) already has decent instruction-following, and GRPO was able to improve tool-use behavior on top of that.
4. MoE models are efficient for RL
The 30B MoE model (3B active parameters) trains much faster than a dense 30B model while still having access to broad knowledge. Good balance of capability vs training speed.
5. Tool usage increased, not decreased
The trained model uses more tools than the base model (0.77 vs 0.46 calls/sample), but more appropriately. It learned when tools are genuinely needed rather than trying to answer everything from memory.
Evaluation Details
Evaluated on 30 held-out tasks never seen during training, using heuristic scoring (no judge). Tasks test the same categories as training but with different specific questions.
Per-Task Results
Click to expand full per-task results
| Task | Category | Base | Adapter | Ξ |
|---|---|---|---|---|
| eval_001 | zero_tool | 1.000 | 1.000 | β |
| eval_002 | zero_tool | 1.000 | 1.000 | β |
| eval_003 | zero_tool | 1.000 | 1.000 | β |
| eval_004 | zero_tool | 1.000 | 1.000 | β |
| eval_005 | zero_tool | 1.000 | 1.000 | β |
| eval_006 | terminal | 0.963 | 0.963 | β |
| eval_007 | terminal | 0.963 | 0.963 | β |
| eval_008 | terminal | 0.963 | 0.963 | β |
| eval_009 | terminal | 0.925 | 0.925 | β |
| eval_010 | code_exec | 0.925 | 0.925 | β |
| eval_011 | code_exec | 0.963 | 0.963 | β |
| eval_012 | code_exec | 0.963 | 0.963 | β |
| eval_013 | code_exec | 0.963 | 0.963 | β |
| eval_014 | file_ops | 0.950 | 0.950 | β |
| eval_015 | file_ops | 0.285 | 0.863 | +0.578 |
| eval_016 | file_ops | 0.950 | 0.950 | β |
| eval_017 | multi_step | 0.963 | 0.963 | β |
| eval_018 | multi_step | 0.963 | 0.963 | β |
| eval_019 | multi_step | 0.852 | 0.981 | +0.129 |
| eval_020 | planning | 0.950 | 0.938 | -0.012 |
| eval_021 | planning | 0.877 | 0.877 | β |
| eval_022 | self_imp | 1.000 | 1.000 | β |
| eval_023 | self_imp | 1.000 | 1.000 | β |
| eval_024 | code_exec | 0.963 | 0.963 | β |
| eval_025 | terminal | 0.963 | 0.963 | β |
| eval_026 | zero_tool | 1.000 | 1.000 | β |
| eval_027 | zero_tool | 1.000 | 1.000 | β |
| eval_028 | file_ops | 0.955 | 0.955 | β |
| eval_029 | multi_step | 0.963 | 0.963 | β |
| eval_030 | zero_tool | 1.000 | 1.000 | β |
Training Infrastructure
- Platform: Prime Intellect hosted RL training (free during beta)
- Environment: Custom
pi_agent_env(published asanarion/pi_agent_envon PI Hub) - Judge Model:
openai/gpt-4.1-nanovia OpenRouter - Agent Framework: Pi Agent by Mario Zechner
- Run ID:
b4m4eammrloy61ifoo34ndn5 - Adapter ID:
vaz8dmj5genxl8v94ufat6xt - Training Duration: ~2 hours
- Checkpoints: 5 saved (steps 75, 100, 125, 150, 175)
Citation
@misc{qwen3-30b-toolagent-grpo-2026,
title={Qwen3-30B-A3B-ToolAgent-GRPO: RL-Trained Tool-Use Adapter},
author={Indelwin},
year={2026},
url={https://huggingface.co/Indelwin/Qwen3-30B-A3B-ToolAgent-GRPO},
note={LoRA adapter trained via GRPO on Prime Intellect for improved tool-calling in coding agents}
}
License
This adapter inherits the license of the base model (Qwen/Qwen3-30B-A3B-Instruct-2507) β Apache 2.0.
- Downloads last month
- 11
Model tree for Indelwin/Qwen3-30B-A3B-ToolAgent-GRPO
Base model
Qwen/Qwen3-30B-A3B-Instruct-2507Evaluation results
- Eval Reward (Adapter)self-reported0.965
- Eval Reward (Base)self-reported0.942
- Task Completion Rateself-reported1.000