Gemma-4-E2B On-Demand Plugin Orchestrator

2B-active-parameter plugin selection and orchestration model, trained by AIREV for the On-Demand plugin platform.

Given a user request and a candidate pool of plugins, this model picks the correct subset, orders them with proper dependencies, and hydrates every API parameter β€” emitting a valid JSON plan that can be executed directly.


Eval results (100-sample held-out On-Demand set)

Model Mean JSON valid Plugin-ID match Count match No-hallucinate Hydrated Deps chain
Gemma-4-E2B SFT-only (baseline) 0.9180 95.0% 89.0% 93.0% 95.0% 95.0% 87.0%
Gemma-4-E2B SFT + GRPO (this model) 0.9400 97.0% 91.0% 96.0% 97.0% 97.0% 89.0%

+2.2pp mean score = 26.8% relative error reduction over SFT-only.

By category (GRPO wins on multi-step chains)

Category SFT GRPO Ξ”
1_step 0.923 0.923 β€”
2_step 0.950 1.000 +5.0
3_step 0.936 0.976 +4.0
4_step 0.781 0.791 +1.0
multi_turn 1.000 1.000 β€”

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json

model_id = "airev-ai/gemma-4-e2b-ondemand"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16
).to("cuda")

system = ("You are an AI agent orchestrator. Given a user request and available "
          "plugins/tools, generate a precise multi-step execution plan as a valid "
          "JSON object. Each step must use available plugins with correct parameters, "
          "proper types, and valid JSON formatting.")

candidates = [
    {"pluginId": "plugin-1714851345", "name": "Nutrition BOT",
     "description": "Nutrition type stuff", "identifier": "rest_api", "method": "POST"},
    {"pluginId": "plugin-1768545918", "name": "kinetiqai-exercise-scoring",
     "description": "Analyzes workout form using PoseTracker data",
     "identifier": "rest_api", "method": "POST"},
    # ... more candidates
]

user_msg = (
    f"YOUR TASK IS TO GENERATE A JSON STRICTLY and CORRECTLY\n"
    f"{json.dumps(candidates, indent=2)}\n\n"
    f"User Request: I want to improve my fitness routine β€” analyze my workout "
    f"form, then get nutrition guidance."
)

prompt = tok.apply_chat_template(
    [{"role": "system", "content": system},
     {"role": "user", "content": user_msg}],
    tokenize=False, add_generation_prompt=True,
)
ids = tok(prompt, return_tensors="pt").input_ids.to("cuda")
out = model.generate(ids, max_new_tokens=1024, temperature=0.1, do_sample=True, top_p=0.9)
response = tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True)

# Response contains <think>...</think> followed by JSON:
# {"plugins": [{"pluginId": "...", "api_request_parameters": {...},
#               "all_parameters_hydrated": true, "dependencies": [...]}]}

Output format

The model emits:

  1. A <think>...</think> reasoning trace explaining plugin selection
  2. A JSON object: {"plugins": [...]} where each plugin has:
    • pluginId (from the candidate list β€” never hallucinated)
    • name, description, identifier, method
    • api_request_parameters β€” fully hydrated, no placeholders
    • all_parameters_hydrated: true
    • dependencies: [] β€” list of pluginIds that must run first

Training pipeline

1. Data β€” 64,992 cleaned samples

  • Source: real On-Demand production traces + synthetic plans
  • Cleaning pipeline: deduplicated, JSON-validated, thinking tokens enforced, parameters hydrated (no example.com, no empty values, no placeholders)
  • Judge: Claude Opus 4.6 via Vertex AI, 100 parallel workers

2. SFT β€” 194,976 steps, 3 epochs

  • Base: google/gemma-4-E2B (5B total, 2B active, GDN hybrid)
  • Optimizer: Adafactor (AdamW causes CUDA illegal memory access on Gemma 4)
  • Single GPU only, no scheduler, no gradient clipping
  • LR = 2e-5, batch_size = 1, grad_accum = 1, max_length = 1024
  • Final loss: 0.1496 avg
  • ~20 hours on 1Γ— H100 80GB
  • Eval score: 0.918

3. AutoResearch β€” 24 iterations hyperparameter search

  • Claude Opus 4.6 mutations with ratchet (keep best config)
  • Best finding: num_plugins=6 candidates per prompt (down from 8)
  • Everything else stayed at defaults: lr=1e-6, num_generations=4, top_p=0.9

4. GRPO β€” 570 steps with plugin-selection reward

Reward (0.0–1.0) combines:

  • 0.10 valid JSON
  • 0.15 all picks in available candidate list
  • 0.25 Γ— (correct_picked / total_correct)
  • 0.20 bonus for no wrong picks
  • βˆ’0.10 Γ— wrong picks (capped at 3)
  • 0.10 exact count match
  • 0.15 Γ— hydration ratio

Best checkpoint: step 500 β€” eval score 0.940. Peak training-reward avg20 of 0.811 hit at step 473 before entering a noise-induced tail.

Architecture notes

  • Gemma-4-E2B = 5B total params, 2B active (MoE), 128K context, GDN-style
  • Thinking tokens always active β€” the model learned to use <think> for plugin reasoning
  • Adafactor is mandatory for training β€” AdamW hits illegal memory access
  • Single GPU only β€” device_map="auto" causes crashes
  • No LR scheduler, no gradient clipping β€” these also destabilize training

Code

Full open-source training pipeline, AutoResearch harness, GRPO reward function, and eval scripts available at: github.com/mk42-ai/gemma-4-e2b-ondemand

Acknowledgements

  • Google DeepMind for Gemma 4
  • Berkeley RAIL for the BFCL benchmark methodology
  • HuggingFace TRL team for GRPO reference implementation
  • AIREV infrastructure team for 8Γ— H100 cluster access

Citation

@misc{gemma4e2b_ondemand_2026,
  title   = {Gemma-4-E2B On-Demand: Plugin Orchestration via SFT + GRPO},
  author  = {Khalid, Muhammed and AIREV},
  year    = {2026},
  url     = {https://huggingface.co/airev-ai/gemma-4-e2b-ondemand},
}
Downloads last month
185
Safetensors
Model size
5B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for airev-ai/gemma-4-e2b-ondemand

Finetuned
(16)
this model