🚀 Gemma-4 Agent — PARL-style Multi-hop Agent

🌟 Model Overview

This model is a highly capable, autonomous multi-hop agent fine-tuned from the google/gemma-4-E4B-it base model [1, 2]. Its core architecture and reasoning mechanisms are heavily inspired by the Kimi K2.5 — Visual Agentic Intelligence research paper (arXiv:2602.02276) [1, 2].

Operating strictly on a robust Agent Loop framework (Think → Act → Observe → Repeat), the model uses Compressed History to retain critical context within a 12,000-token agent window, ensuring it never loses track of its goals during long tasks [2, 3].

🎯 Intended Uses & Agent Modes

The agent is equipped with three specialized system prompts for distinct operational modes [4, 5]:

  1. 🔍 Agentic Search (search): Specializes in multi-hop information retrieval and summarizes results with accurate citations [5].
  2. 💻 Coding Agent (coding): Designed to autonomously write, execute, and debug code via an isolated sandbox [5].
  3. 👑 Orchestrator - Agent Swarm (orchestrator): Acts as a master controller, breaking down massive tasks and delegating them to multiple sub-agents for parallel execution [5].

🔥 What's New in v2.1 (H100 & Long-Context Optimizations)

  • vLLM Acceleration: Merged LoRA weights to enable lightning-fast evaluation on vLLM (H100), achieving up to 10x faster inference [2].
  • 16k Training Window: Expanded RL_MAX_LEN to 16,384 tokens [2].
  • Self-Correction: The model is explicitly trained to handle tool errors and resolve SyntaxError issues autonomously [2, 6].
  • Memory Safety: Designed to run seamlessly on an H100 GPU with stable VRAM consumption at ~33-35GB [6].

🏆 Evaluation & Benchmarking

The model was rigorously evaluated on the MuSiQue-Answerable benchmark (Multi-hop Reasoning) [2, 7].

ผลการทดสอบเปรียบเทียบ (Benchmark Comparison - 10 Samples) [7]:

Metric Baseline (Pre-RL) Gemma-4 RL (Step 26) Improvement
Pass Rate (F1 > 0.3) 40% 70% 🚀 +75%
Average F1 Score 0.2292 0.4912 🚀 +114%
Average Hops 3.90 3.60 -7.7% (Faster)
Tool Success Rate 45% 92% 🚀 +104%

ตัวเลขข้างต้นมาจากการทดสอบบนชุดข้อมูล MuSiQue-Answerable ซึ่งเน้นการสืบค้นข้อมูลหลายชั้น (Multi-hop) ผลลัพธ์แสดงให้เห็นว่ากระบวนการ GRPO ช่วยให้เอเจนต์มีความแม่นยำและการตัดสินใจใช้เครื่องมือที่ซับซ้อนได้ดีขึ้นอย่างชัดเจน [6]

🧠 Training & RL Pipeline

  • Base Architecture: Gemma4ClippableLinear (Custom architecture with recently patched PEFT crash resolutions) [8].
  • Dataset Priority: The RL dataset (dataset/rl_prompts.jsonl) was strictly reprioritized to focus on Multihop Search (2,500 items) and Math (1,180 items) [8].
  • Memory Optimization: Incorporates Context Management (Compression) during GRPO training to fix Out of Memory (OOM) issues and memory leaks [8].
  • GRPO Optimization: Employs Group-relative Advantages (comparing 4 concurrent generations) and an Iterative Backward pass to maximize VRAM efficiency before syncing gradients [3].
  • Auto-Checkpointing: Saves and pushes model weights to the Hugging Face Hub every 1 training step [6].

⚖️ Reward System (PARL Framework)

The PARL Reward system guarantees maximum training stability [1, 3, 5]:

  • r_outcome (Smart Boost): Incorporates strict Entity and Number verification, effectively preventing keyword-stuffing hallucinations [3].
  • r_efficiency & r_format: Grants bonuses for completing tasks with fewer hops, while strictly enforcing 100% JSON structure compliance [3].
  • Task-Specific Rewards: Search is evaluated via Recall-biased F3 Score, while Coding uses Dry-run Success [5]. Recent patches also fixed NameError in grpo_step and missing use_f1 variables [8].

🛠️ Infrastructure & Sandbox Features

  • Isolated Sandbox (.agent_venv): All code runs securely inside a dedicated virtual environment, allowing the agent to safely pip install without affecting the host [5, 9].
  • Remote Sandbox Routing 🌟: To save precious VRAM on cloud training instances, terminal commands are intelligently routed via Ngrok to a remote machine using subprocess.run with asyncio.to_thread to prevent internal server errors [9, 10].
  • Web Automation: Fully integrated with Playwright to extract data from JavaScript-rendered web pages [1].

🙏 Acknowledgments

Special thanks to Lightning AI for providing the powerful compute infrastructure and seamless environment that made the training, fine-tuning, and continuous development of this multi-hop agent possible.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Phonsiri/gemma-agent-grpo