Qwen3-30B-A3B-ToolAgent-GRPO

A LoRA adapter trained via GRPO (Group Relative Policy Optimization) on Prime Intellect to improve tool-calling behaviour in AI coding agents.

This adapter teaches the model to:

  • Use tools when needed β€” and only when needed
  • Choose the right tool β€” bash, python, read, write, find, grep
  • Be efficient β€” minimize tool calls, combine commands
  • Handle errors gracefully β€” recover from failures, adapt approach
  • Know when NOT to use tools β€” answer factual questions directly

Key Results

Metric Base Model + Adapter Change
Overall Reward 0.942 0.965 +0.023
Task Completion 96.7% 100% +3.3%
Failures 1/30 0/30 Fixed

Category Breakdown (held-out eval)

Category Base Adapter Ξ” Notes
file_ops 0.785 0.929 +0.144 Biggest improvement β€” JSON creation, file search
multi_step 0.935 0.967 +0.032 Better error recovery and data pipelines
code_execution 0.955 0.955 β€” Already strong
terminal 0.955 0.955 β€” Already strong
zero_tool 1.000 1.000 β€” Knowledge preserved perfectly
self_improvement 1.000 1.000 β€” Reasoning preserved perfectly
planning 0.914 0.907 -0.006 Negligible

No regressions. The adapter improved tool-use capabilities without degrading knowledge or reasoning.

Model Details

Base Model Qwen/Qwen3-30B-A3B-Instruct-2507
Architecture MoE (Mixture of Experts), 30B total params, 3B active
Adapter Type LoRA (PEFT)
LoRA Rank 16
LoRA Alpha 32
Target Modules experts, k_proj, o_proj, q_proj, v_proj
Adapter Size 3.1 GB
Training Method GRPO (Group Relative Policy Optimization)
Training Platform Prime Intellect hosted RL
Training Cost Free (PI beta) + ~$8 judge calls via OpenRouter

Training Details

Method

Trained using GRPO β€” a reinforcement learning algorithm that optimizes policy via group-relative advantages. No SFT pre-training was used; the adapter was trained directly from the base model using RL only.

ToolUseRubric β€” 4-Dimension Scoring

The reward signal comes from a custom ToolUseRubric with four weighted dimensions:

Dimension Weight Type What it measures
Task Completion 0.50 LLM Judge Did the agent solve the task correctly? Scored by gpt-4.1-nano via OpenRouter
Tool Outcomes 0.20 Heuristic Did tool calls execute successfully (no errors)?
Efficiency 0.15 Heuristic Were tool calls within budget (not excessive)?
Dummy Call Detection 0.15 Heuristic No redundant calls, results referenced in response?

The LLM judge is critical β€” without it, the model quickly learns to game the heuristic metrics (reward hits 1.0 by step ~35 by simply never using tools). With the judge, the model must actually solve tasks correctly.

Training Configuration

model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 200
batch_size = 256
rollouts_per_example = 8

[sampling]
max_tokens = 4096

[[env]]
id = "anarion/pi_agent_env"
args = { max_turns = 10 }

[checkpoints]
interval = 25

Training Curve

Step    Reward    Task Completion    Tool Calls/sample
  1     0.904         0.862              0.46
 25     0.882         0.855              0.65
 50     0.893         0.873              0.78
 75     0.912         0.901              0.82
100     0.905         0.889              0.85
125     0.901         0.882              0.80
147     0.959 β˜…       0.932              0.88        ← best step
150     0.907         0.890              0.85
175     0.910         0.897              0.77
199     0.917         0.890              0.54

Overall trend: first 20 avg = 0.882 β†’ last 20 avg = 0.908 (Ξ”+0.026)

Key observations:

  • Steady improvement β€” no reward collapse or instability
  • Tool usage increased β€” model learned to use tools when appropriate (0.46 β†’ 0.77 avg)
  • No reward hacking β€” judge prevented gaming of heuristic metrics
  • Best reward 0.959 at step 147

Training Dataset

598 tasks across 7 categories, generated synthetically and converted to Pi Agent format:

Category Count Description
zero_tool 247 Factual, arithmetic, reasoning β€” must NOT use tools
code_execution 100 Python computation tasks
terminal 99 Shell command execution
file_ops 54 File create/read/write/search
self_improvement 34 Meta-reasoning about AI capabilities
planning 32 Plan-then-execute workflows
multi_step 32 Efficient multi-tool workflows

Tools Available During Training

Tool Description
bash Execute shell commands
python Execute Python code
read Read file contents
write Write/create files
find Find files by glob pattern
grep Search file contents by regex

Usage

With PEFT + Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-30B-A3B-Instruct-2507",
    device_map="auto",
    torch_dtype="auto",
)
model = PeftModel.from_pretrained(base_model, "Indelwin/Qwen3-30B-A3B-ToolAgent-GRPO")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-30B-A3B-Instruct-2507")

messages = [
    {"role": "system", "content": "You are a helpful AI coding assistant with access to tools."},
    {"role": "user", "content": "Create a file /tmp/hello.txt with 'Hello World', then read it back."},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Prime Intellect Inference (if deployed)

from openai import OpenAI

client = OpenAI(
    api_key="your-pi-api-key",
    base_url="https://api.pinference.ai/api/v1",
    default_headers={"X-Prime-Team-ID": "your-team-id"},
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-30B-A3B-Instruct-2507:vaz8dmj5genxl8v94ufat6xt",
    messages=[
        {"role": "user", "content": "Write Python code to compute the 15th Fibonacci number."}
    ],
)
print(response.choices[0].message.content)

Lessons Learned

1. Heuristic rubrics get gamed instantly

Without an LLM judge, the model learned within ~35 steps to never use tools and write long plausible-sounding responses. Reward hit 1.0 and stayed there. The model was optimizing the metric, not the task.

2. LLM judge is essential but cheap

Using gpt-4.1-nano via OpenRouter costs only ~$5-10 for a full 200-step run. This prevented reward hacking completely and led to genuine capability improvements.

3. RL alone works (no SFT needed)

This adapter was trained with RL only β€” no supervised fine-tuning phase. The base model (Qwen3-30B-A3B-Instruct) already has decent instruction-following, and GRPO was able to improve tool-use behavior on top of that.

4. MoE models are efficient for RL

The 30B MoE model (3B active parameters) trains much faster than a dense 30B model while still having access to broad knowledge. Good balance of capability vs training speed.

5. Tool usage increased, not decreased

The trained model uses more tools than the base model (0.77 vs 0.46 calls/sample), but more appropriately. It learned when tools are genuinely needed rather than trying to answer everything from memory.

Evaluation Details

Evaluated on 30 held-out tasks never seen during training, using heuristic scoring (no judge). Tasks test the same categories as training but with different specific questions.

Per-Task Results

Click to expand full per-task results
Task Category Base Adapter Ξ”
eval_001 zero_tool 1.000 1.000 β€”
eval_002 zero_tool 1.000 1.000 β€”
eval_003 zero_tool 1.000 1.000 β€”
eval_004 zero_tool 1.000 1.000 β€”
eval_005 zero_tool 1.000 1.000 β€”
eval_006 terminal 0.963 0.963 β€”
eval_007 terminal 0.963 0.963 β€”
eval_008 terminal 0.963 0.963 β€”
eval_009 terminal 0.925 0.925 β€”
eval_010 code_exec 0.925 0.925 β€”
eval_011 code_exec 0.963 0.963 β€”
eval_012 code_exec 0.963 0.963 β€”
eval_013 code_exec 0.963 0.963 β€”
eval_014 file_ops 0.950 0.950 β€”
eval_015 file_ops 0.285 0.863 +0.578
eval_016 file_ops 0.950 0.950 β€”
eval_017 multi_step 0.963 0.963 β€”
eval_018 multi_step 0.963 0.963 β€”
eval_019 multi_step 0.852 0.981 +0.129
eval_020 planning 0.950 0.938 -0.012
eval_021 planning 0.877 0.877 β€”
eval_022 self_imp 1.000 1.000 β€”
eval_023 self_imp 1.000 1.000 β€”
eval_024 code_exec 0.963 0.963 β€”
eval_025 terminal 0.963 0.963 β€”
eval_026 zero_tool 1.000 1.000 β€”
eval_027 zero_tool 1.000 1.000 β€”
eval_028 file_ops 0.955 0.955 β€”
eval_029 multi_step 0.963 0.963 β€”
eval_030 zero_tool 1.000 1.000 β€”

Training Infrastructure

  • Platform: Prime Intellect hosted RL training (free during beta)
  • Environment: Custom pi_agent_env (published as anarion/pi_agent_env on PI Hub)
  • Judge Model: openai/gpt-4.1-nano via OpenRouter
  • Agent Framework: Pi Agent by Mario Zechner
  • Run ID: b4m4eammrloy61ifoo34ndn5
  • Adapter ID: vaz8dmj5genxl8v94ufat6xt
  • Training Duration: ~2 hours
  • Checkpoints: 5 saved (steps 75, 100, 125, 150, 175)

Citation

@misc{qwen3-30b-toolagent-grpo-2026,
  title={Qwen3-30B-A3B-ToolAgent-GRPO: RL-Trained Tool-Use Adapter},
  author={Indelwin},
  year={2026},
  url={https://huggingface.co/Indelwin/Qwen3-30B-A3B-ToolAgent-GRPO},
  note={LoRA adapter trained via GRPO on Prime Intellect for improved tool-calling in coding agents}
}

License

This adapter inherits the license of the base model (Qwen/Qwen3-30B-A3B-Instruct-2507) β€” Apache 2.0.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Indelwin/Qwen3-30B-A3B-ToolAgent-GRPO

Adapter
(62)
this model

Evaluation results