# 02 — Research: Papers, Datasets & Key Findings ## 🔬 Our Research Mission We asked: *"What's the best way to train a small model for tool-calling? What do the papers say?"* We searched: - Research papers on arXiv (via HuggingFace papers) - Existing datasets on HuggingFace Hub - Current TRL/Transformers APIs (to avoid outdated code) - Existing repos and training examples --- ## 📄 Landmark Papers We Found ### 1. TinyAgent: Function Calling at the Edge (arXiv:2409.00608) **What it proved:** - A **1.1B parameter model** fine-tuned for tool-calling can match **GPT-4-Turbo** at function-calling tasks - The key is **high-quality synthetic data** (they generated 80K examples with GPT-4) - They used **LLMCompiler** framework: model outputs a plan with dependencies, then tools execute in order **Their training recipe:** - Base model: TinyLlama-1.1B and Wizard-2-7B - Dataset: 80K synthetic function-calling plans - Metric: Graph isomorphism (does the model's tool-call DAG match the ground truth?) - Tool RAG: Small model (DeBERTa) selects which tools to use before calling the LLM **How we use it:** - Proves small models CAN work for agents - Inspired our training data format (function schemas + user queries + tool calls) - We use a simpler version (no separate Tool RAG, model handles selection) --- ### 2. STAR Framework (arXiv:2602.03022) **What it proved:** - **Qwen3-1.7B beats Llama-3.1-8B** at function-calling benchmarks - The Qwen3 family has strong built-in instruction-following capabilities - Smaller models with good pre-training outperform larger models with worse pre-training **How we use it:** - **CONFIRMED our base model choice**: Qwen3-1.7B is the sweet spot - Proves we don't need a bigger model — quality of pre-training matters more --- ### 3. Agent-World: Scaling Real-World Agent Training (arXiv:2604.18292) **What it proved:** - Real-world agent training needs **continuous environment-task discovery** - They use **MCP servers** as one source of environment themes (from Smithery.ai) - They build a self-evolving loop: train → evaluate → discover gaps → expand data **How we use it:** - Confirms MCP is the right protocol to focus on - Inspired us to embed MCP knowledge INTO the model rather than calling external MCP servers --- ### 4. MCP-Universe (arXiv:2508.14704) **What it proved:** - Comprehensive benchmark for evaluating LLMs on **real MCP servers** - Tests tool discovery, tool invocation, and response handling - Reveals performance disparities between open and closed-source models **How we use it:** - Shows that MCP tool-calling is a real, testable skill - We can test our model against this benchmark after training --- ### 5. LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685) **What it proved:** - You can fine-tune huge models by only training tiny adapter matrices - Achieves comparable performance to full fine-tuning with 1000× fewer parameters **How we use it:** - Core technique for our training — makes it affordable on T4 GPU --- ### 6. LoRA Without Regret (Thinking Machines Lab, 2025) **What it proved:** - Applying LoRA to **ALL linear layers** (not just attention projections) matches full fine-tuning quality - Previous wisdom was to only apply LoRA to q_proj and v_proj **How we use it:** - We set `target_modules="all-linear"` for best quality - This is our "secret sauce" for making LoRA as good as full fine-tuning --- ## 📊 Datasets We Discovered ### Existing Tool-Calling Datasets (On HuggingFace Hub) | Dataset | Size | Format | Notes | |---------|------|--------|-------| | **glaiveai/glaive-function-calling-v2** | ~100K | Conversations | Most popular, Apache 2.0 | | **glaiveai/glaive-function-calling** | ~52K | Conversations | Earlier version | | **togethercomputer/glaive-function-calling-v2-formatted** | ~100K | Conversations | Community formatted | | **lilacai/glaive-function-calling-v2-sharegpt** | ~100K | ShareGPT format | Good for chat models | | **Salesforce/xlam-function-calling** | ~60K | JSON | Diverse domains | | **NousResearch/hermes-function-calling** | ~20K | Conversations | Hermes format | ### Our Existing Dataset **muhammadtlha944/mcp-agent-training-data** - **Train:** 15,694 examples (63.2 MB) - **Validation:** 826 examples (3.2 MB) - **Format:** `messages` column with role/content pairs - **Content:** Mixed function-calling, JSON output, clarification, safety **Quality assessment:** - ✅ Good: Has system/user/assistant messages in proper format - ✅ Good: Covers multiple tool-calling patterns - ⚠️ Concern: System prompts vary — model might get confused about expected format - ⚠️ Concern: Only ~16K examples. TinyAgent used 80K - ⚠️ Concern: No explicit MCP format examples --- ## 🔧 APIs & Libraries (Current Versions) ### TRL (Transformers Reinforcement Learning) - **SFTTrainer** with `peft_config` parameter for LoRA - **SFTConfig** for training arguments - `report_to="trackio"` for monitoring - `disable_tqdm=True` for clean logs ### PEFT (Parameter-Efficient Fine-Tuning) - `LoraConfig` with `target_modules="all-linear"` - Adapters are ~100MB for 2B model with r=16 ### Key Parameters (From TRL Docs): - Learning rate for LoRA SFT: **2e-4** (10× higher than full fine-tuning) - Batch size strategy: Small per-device batch + gradient accumulation - For 2B model on T4: batch_size=4, accumulation=4 → effective batch=16 --- ## 🎯 Our Research Conclusions ### What Works (Backed by Papers) 1. **Small models CAN do tool-calling well** — TinyAgent proved 1.1B ≈ GPT-4 at focused tasks 2. **Qwen3-1.7B is the best base** — STAR paper shows it beats larger models 3. **LoRA with all-linear targets matches full FT** — "LoRA Without Regret" paper 4. **~16K examples is workable** — TinyAgent used 80K but got good results with less 5. **MCP is the future protocol** — Multiple 2025 papers use MCP for benchmarks ### Our Choices (And Why) | Decision | Choice | Reason | |----------|--------|--------| | Base model | Qwen3-1.7B | STAR paper: beats Llama-3.1-8B; fits T4 | | Training method | LoRA SFT | Affordable, proven quality | | LoRA rank | r=16 | Proportional to dataset size | | LoRA target | all-linear | "LoRA Without Regret": matches full FT | | Epochs | 3 | Standard, prevents overfitting | | Learning rate | 2e-4 | 10× base rate for LoRA | | Batch size | 4×4=16 | Fits T4 memory | --- ## 🔜 Next Step Read `03-architecture.md` to understand HOW the agent harness works.