🚀 Gemma-4 Agent — PARL-style Multi-hop Agent
🌟 Model Overview
This model is a highly capable, autonomous multi-hop agent fine-tuned from the google/gemma-4-E4B-it base model [1, 2]. Its core architecture and reasoning mechanisms are heavily inspired by the Kimi K2.5 — Visual Agentic Intelligence research paper (arXiv:2602.02276) [1, 2].
Operating strictly on a robust Agent Loop framework (Think → Act → Observe → Repeat), the model uses Compressed History to retain critical context within a 12,000-token agent window, ensuring it never loses track of its goals during long tasks [2, 3].
🎯 Intended Uses & Agent Modes
The agent is equipped with three specialized system prompts for distinct operational modes [4, 5]:
- 🔍 Agentic Search (
search): Specializes in multi-hop information retrieval and summarizes results with accurate citations [5]. - 💻 Coding Agent (
coding): Designed to autonomously write, execute, and debug code via an isolated sandbox [5]. - 👑 Orchestrator - Agent Swarm (
orchestrator): Acts as a master controller, breaking down massive tasks and delegating them to multiple sub-agents for parallel execution [5].
🔥 What's New in v2.1 (H100 & Long-Context Optimizations)
- vLLM Acceleration: Merged LoRA weights to enable lightning-fast evaluation on vLLM (H100), achieving up to 10x faster inference [2].
- 16k Training Window: Expanded
RL_MAX_LENto 16,384 tokens [2]. - Self-Correction: The model is explicitly trained to handle tool errors and resolve
SyntaxErrorissues autonomously [2, 6]. - Memory Safety: Designed to run seamlessly on an H100 GPU with stable VRAM consumption at ~33-35GB [6].
🏆 Evaluation & Benchmarking
The model was rigorously evaluated on the MuSiQue-Answerable benchmark (Multi-hop Reasoning) [2, 7].
ผลการทดสอบเปรียบเทียบ (Benchmark Comparison - 10 Samples) [7]:
| Metric | Baseline (Pre-RL) | Gemma-4 RL (Step 26) | Improvement |
|---|---|---|---|
| Pass Rate (F1 > 0.3) | 40% | 70% | 🚀 +75% |
| Average F1 Score | 0.2292 | 0.4912 | 🚀 +114% |
| Average Hops | 3.90 | 3.60 | ⚡ -7.7% (Faster) |
| Tool Success Rate | 45% | 92% | 🚀 +104% |
ตัวเลขข้างต้นมาจากการทดสอบบนชุดข้อมูล
MuSiQue-Answerableซึ่งเน้นการสืบค้นข้อมูลหลายชั้น (Multi-hop) ผลลัพธ์แสดงให้เห็นว่ากระบวนการ GRPO ช่วยให้เอเจนต์มีความแม่นยำและการตัดสินใจใช้เครื่องมือที่ซับซ้อนได้ดีขึ้นอย่างชัดเจน [6]
🧠 Training & RL Pipeline
- Base Architecture:
Gemma4ClippableLinear(Custom architecture with recently patched PEFT crash resolutions) [8]. - Dataset Priority: The RL dataset (
dataset/rl_prompts.jsonl) was strictly reprioritized to focus on Multihop Search (2,500 items) and Math (1,180 items) [8]. - Memory Optimization: Incorporates Context Management (Compression) during GRPO training to fix Out of Memory (OOM) issues and memory leaks [8].
- GRPO Optimization: Employs Group-relative Advantages (comparing 4 concurrent generations) and an Iterative Backward pass to maximize VRAM efficiency before syncing gradients [3].
- Auto-Checkpointing: Saves and pushes model weights to the Hugging Face Hub every 1 training step [6].
⚖️ Reward System (PARL Framework)
The PARL Reward system guarantees maximum training stability [1, 3, 5]:
r_outcome(Smart Boost): Incorporates strict Entity and Number verification, effectively preventing keyword-stuffing hallucinations [3].r_efficiency&r_format: Grants bonuses for completing tasks with fewer hops, while strictly enforcing 100% JSON structure compliance [3].- Task-Specific Rewards: Search is evaluated via Recall-biased F3 Score, while Coding uses Dry-run Success [5]. Recent patches also fixed
NameErroringrpo_stepand missinguse_f1variables [8].
🛠️ Infrastructure & Sandbox Features
- Isolated Sandbox (
.agent_venv): All code runs securely inside a dedicated virtual environment, allowing the agent to safelypip installwithout affecting the host [5, 9]. - Remote Sandbox Routing 🌟: To save precious VRAM on cloud training instances, terminal commands are intelligently routed via Ngrok to a remote machine using
subprocess.runwithasyncio.to_threadto prevent internal server errors [9, 10]. - Web Automation: Fully integrated with Playwright to extract data from JavaScript-rendered web pages [1].
🙏 Acknowledgments
Special thanks to Lightning AI for providing the powerful compute infrastructure and seamless environment that made the training, fine-tuning, and continuous development of this multi-hop agent possible.