Gemma-4-E2B On-Demand Plugin Orchestrator
2B-active-parameter plugin selection and orchestration model, trained by AIREV for the On-Demand plugin platform.
Given a user request and a candidate pool of plugins, this model picks the correct subset, orders them with proper dependencies, and hydrates every API parameter β emitting a valid JSON plan that can be executed directly.
Eval results (100-sample held-out On-Demand set)
| Model | Mean | JSON valid | Plugin-ID match | Count match | No-hallucinate | Hydrated | Deps chain |
|---|---|---|---|---|---|---|---|
| Gemma-4-E2B SFT-only (baseline) | 0.9180 | 95.0% | 89.0% | 93.0% | 95.0% | 95.0% | 87.0% |
| Gemma-4-E2B SFT + GRPO (this model) | 0.9400 | 97.0% | 91.0% | 96.0% | 97.0% | 97.0% | 89.0% |
+2.2pp mean score = 26.8% relative error reduction over SFT-only.
By category (GRPO wins on multi-step chains)
| Category | SFT | GRPO | Ξ |
|---|---|---|---|
| 1_step | 0.923 | 0.923 | β |
| 2_step | 0.950 | 1.000 | +5.0 |
| 3_step | 0.936 | 0.976 | +4.0 |
| 4_step | 0.781 | 0.791 | +1.0 |
| multi_turn | 1.000 | 1.000 | β |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json
model_id = "airev-ai/gemma-4-e2b-ondemand"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16
).to("cuda")
system = ("You are an AI agent orchestrator. Given a user request and available "
"plugins/tools, generate a precise multi-step execution plan as a valid "
"JSON object. Each step must use available plugins with correct parameters, "
"proper types, and valid JSON formatting.")
candidates = [
{"pluginId": "plugin-1714851345", "name": "Nutrition BOT",
"description": "Nutrition type stuff", "identifier": "rest_api", "method": "POST"},
{"pluginId": "plugin-1768545918", "name": "kinetiqai-exercise-scoring",
"description": "Analyzes workout form using PoseTracker data",
"identifier": "rest_api", "method": "POST"},
# ... more candidates
]
user_msg = (
f"YOUR TASK IS TO GENERATE A JSON STRICTLY and CORRECTLY\n"
f"{json.dumps(candidates, indent=2)}\n\n"
f"User Request: I want to improve my fitness routine β analyze my workout "
f"form, then get nutrition guidance."
)
prompt = tok.apply_chat_template(
[{"role": "system", "content": system},
{"role": "user", "content": user_msg}],
tokenize=False, add_generation_prompt=True,
)
ids = tok(prompt, return_tensors="pt").input_ids.to("cuda")
out = model.generate(ids, max_new_tokens=1024, temperature=0.1, do_sample=True, top_p=0.9)
response = tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True)
# Response contains <think>...</think> followed by JSON:
# {"plugins": [{"pluginId": "...", "api_request_parameters": {...},
# "all_parameters_hydrated": true, "dependencies": [...]}]}
Output format
The model emits:
- A
<think>...</think>reasoning trace explaining plugin selection - A JSON object:
{"plugins": [...]}where each plugin has:pluginId(from the candidate list β never hallucinated)name,description,identifier,methodapi_request_parametersβ fully hydrated, no placeholdersall_parameters_hydrated: truedependencies: []β list of pluginIds that must run first
Training pipeline
1. Data β 64,992 cleaned samples
- Source: real On-Demand production traces + synthetic plans
- Cleaning pipeline: deduplicated, JSON-validated, thinking tokens enforced,
parameters hydrated (no
example.com, no empty values, no placeholders) - Judge: Claude Opus 4.6 via Vertex AI, 100 parallel workers
2. SFT β 194,976 steps, 3 epochs
- Base:
google/gemma-4-E2B(5B total, 2B active, GDN hybrid) - Optimizer: Adafactor (AdamW causes CUDA illegal memory access on Gemma 4)
- Single GPU only, no scheduler, no gradient clipping
- LR = 2e-5, batch_size = 1, grad_accum = 1, max_length = 1024
- Final loss: 0.1496 avg
- ~20 hours on 1Γ H100 80GB
- Eval score: 0.918
3. AutoResearch β 24 iterations hyperparameter search
- Claude Opus 4.6 mutations with ratchet (keep best config)
- Best finding:
num_plugins=6candidates per prompt (down from 8) - Everything else stayed at defaults:
lr=1e-6,num_generations=4,top_p=0.9
4. GRPO β 570 steps with plugin-selection reward
Reward (0.0β1.0) combines:
- 0.10 valid JSON
- 0.15 all picks in available candidate list
- 0.25 Γ (correct_picked / total_correct)
- 0.20 bonus for no wrong picks
- β0.10 Γ wrong picks (capped at 3)
- 0.10 exact count match
- 0.15 Γ hydration ratio
Best checkpoint: step 500 β eval score 0.940. Peak training-reward avg20 of 0.811 hit at step 473 before entering a noise-induced tail.
Architecture notes
- Gemma-4-E2B = 5B total params, 2B active (MoE), 128K context, GDN-style
- Thinking tokens always active β the model learned to use
<think>for plugin reasoning - Adafactor is mandatory for training β AdamW hits illegal memory access
- Single GPU only β
device_map="auto"causes crashes - No LR scheduler, no gradient clipping β these also destabilize training
Code
Full open-source training pipeline, AutoResearch harness, GRPO reward function, and eval scripts available at: github.com/mk42-ai/gemma-4-e2b-ondemand
Acknowledgements
- Google DeepMind for Gemma 4
- Berkeley RAIL for the BFCL benchmark methodology
- HuggingFace TRL team for GRPO reference implementation
- AIREV infrastructure team for 8Γ H100 cluster access
Citation
@misc{gemma4e2b_ondemand_2026,
title = {Gemma-4-E2B On-Demand: Plugin Orchestration via SFT + GRPO},
author = {Khalid, Muhammed and AIREV},
year = {2026},
url = {https://huggingface.co/airev-ai/gemma-4-e2b-ondemand},
}
- Downloads last month
- 185
Model tree for airev-ai/gemma-4-e2b-ondemand
Base model
google/gemma-4-E2B