| # 02 β Research: Papers, Datasets & Key Findings |
|
|
| ## π¬ Our Research Mission |
|
|
| We asked: *"What's the best way to train a small model for tool-calling? What do the papers say?"* |
|
|
| We searched: |
| - Research papers on arXiv (via HuggingFace papers) |
| - Existing datasets on HuggingFace Hub |
| - Current TRL/Transformers APIs (to avoid outdated code) |
| - Existing repos and training examples |
|
|
| --- |
|
|
| ## π Landmark Papers We Found |
|
|
| ### 1. TinyAgent: Function Calling at the Edge (arXiv:2409.00608) |
|
|
| **What it proved:** |
| - A **1.1B parameter model** fine-tuned for tool-calling can match **GPT-4-Turbo** at function-calling tasks |
| - The key is **high-quality synthetic data** (they generated 80K examples with GPT-4) |
| - They used **LLMCompiler** framework: model outputs a plan with dependencies, then tools execute in order |
|
|
| **Their training recipe:** |
| - Base model: TinyLlama-1.1B and Wizard-2-7B |
| - Dataset: 80K synthetic function-calling plans |
| - Metric: Graph isomorphism (does the model's tool-call DAG match the ground truth?) |
| - Tool RAG: Small model (DeBERTa) selects which tools to use before calling the LLM |
|
|
| **How we use it:** |
| - Proves small models CAN work for agents |
| - Inspired our training data format (function schemas + user queries + tool calls) |
| - We use a simpler version (no separate Tool RAG, model handles selection) |
|
|
| --- |
|
|
| ### 2. STAR Framework (arXiv:2602.03022) |
|
|
| **What it proved:** |
| - **Qwen3-1.7B beats Llama-3.1-8B** at function-calling benchmarks |
| - The Qwen3 family has strong built-in instruction-following capabilities |
| - Smaller models with good pre-training outperform larger models with worse pre-training |
|
|
| **How we use it:** |
| - **CONFIRMED our base model choice**: Qwen3-1.7B is the sweet spot |
| - Proves we don't need a bigger model β quality of pre-training matters more |
|
|
| --- |
|
|
| ### 3. Agent-World: Scaling Real-World Agent Training (arXiv:2604.18292) |
|
|
| **What it proved:** |
| - Real-world agent training needs **continuous environment-task discovery** |
| - They use **MCP servers** as one source of environment themes (from Smithery.ai) |
| - They build a self-evolving loop: train β evaluate β discover gaps β expand data |
|
|
| **How we use it:** |
| - Confirms MCP is the right protocol to focus on |
| - Inspired us to embed MCP knowledge INTO the model rather than calling external MCP servers |
|
|
| --- |
|
|
| ### 4. MCP-Universe (arXiv:2508.14704) |
|
|
| **What it proved:** |
| - Comprehensive benchmark for evaluating LLMs on **real MCP servers** |
| - Tests tool discovery, tool invocation, and response handling |
| - Reveals performance disparities between open and closed-source models |
|
|
| **How we use it:** |
| - Shows that MCP tool-calling is a real, testable skill |
| - We can test our model against this benchmark after training |
|
|
| --- |
|
|
| ### 5. LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685) |
|
|
| **What it proved:** |
| - You can fine-tune huge models by only training tiny adapter matrices |
| - Achieves comparable performance to full fine-tuning with 1000Γ fewer parameters |
|
|
| **How we use it:** |
| - Core technique for our training β makes it affordable on T4 GPU |
|
|
| --- |
|
|
| ### 6. LoRA Without Regret (Thinking Machines Lab, 2025) |
|
|
| **What it proved:** |
| - Applying LoRA to **ALL linear layers** (not just attention projections) matches full fine-tuning quality |
| - Previous wisdom was to only apply LoRA to q_proj and v_proj |
|
|
| **How we use it:** |
| - We set `target_modules="all-linear"` for best quality |
| - This is our "secret sauce" for making LoRA as good as full fine-tuning |
|
|
| --- |
|
|
| ## π Datasets We Discovered |
|
|
| ### Existing Tool-Calling Datasets (On HuggingFace Hub) |
|
|
| | Dataset | Size | Format | Notes | |
| |---------|------|--------|-------| |
| | **glaiveai/glaive-function-calling-v2** | ~100K | Conversations | Most popular, Apache 2.0 | |
| | **glaiveai/glaive-function-calling** | ~52K | Conversations | Earlier version | |
| | **togethercomputer/glaive-function-calling-v2-formatted** | ~100K | Conversations | Community formatted | |
| | **lilacai/glaive-function-calling-v2-sharegpt** | ~100K | ShareGPT format | Good for chat models | |
| | **Salesforce/xlam-function-calling** | ~60K | JSON | Diverse domains | |
| | **NousResearch/hermes-function-calling** | ~20K | Conversations | Hermes format | |
|
|
| ### Our Existing Dataset |
|
|
| **muhammadtlha944/mcp-agent-training-data** |
| - **Train:** 15,694 examples (63.2 MB) |
| - **Validation:** 826 examples (3.2 MB) |
| - **Format:** `messages` column with role/content pairs |
| - **Content:** Mixed function-calling, JSON output, clarification, safety |
|
|
| **Quality assessment:** |
| - β
Good: Has system/user/assistant messages in proper format |
| - β
Good: Covers multiple tool-calling patterns |
| - β οΈ Concern: System prompts vary β model might get confused about expected format |
| - β οΈ Concern: Only ~16K examples. TinyAgent used 80K |
| - β οΈ Concern: No explicit MCP format examples |
|
|
| --- |
|
|
| ## π§ APIs & Libraries (Current Versions) |
|
|
| ### TRL (Transformers Reinforcement Learning) |
| - **SFTTrainer** with `peft_config` parameter for LoRA |
| - **SFTConfig** for training arguments |
| - `report_to="trackio"` for monitoring |
| - `disable_tqdm=True` for clean logs |
|
|
| ### PEFT (Parameter-Efficient Fine-Tuning) |
| - `LoraConfig` with `target_modules="all-linear"` |
| - Adapters are ~100MB for 2B model with r=16 |
|
|
| ### Key Parameters (From TRL Docs): |
| - Learning rate for LoRA SFT: **2e-4** (10Γ higher than full fine-tuning) |
| - Batch size strategy: Small per-device batch + gradient accumulation |
| - For 2B model on T4: batch_size=4, accumulation=4 β effective batch=16 |
| |
| --- |
| |
| ## π― Our Research Conclusions |
| |
| ### What Works (Backed by Papers) |
| |
| 1. **Small models CAN do tool-calling well** β TinyAgent proved 1.1B β GPT-4 at focused tasks |
| 2. **Qwen3-1.7B is the best base** β STAR paper shows it beats larger models |
| 3. **LoRA with all-linear targets matches full FT** β "LoRA Without Regret" paper |
| 4. **~16K examples is workable** β TinyAgent used 80K but got good results with less |
| 5. **MCP is the future protocol** β Multiple 2025 papers use MCP for benchmarks |
| |
| ### Our Choices (And Why) |
| |
| | Decision | Choice | Reason | |
| |----------|--------|--------| |
| | Base model | Qwen3-1.7B | STAR paper: beats Llama-3.1-8B; fits T4 | |
| | Training method | LoRA SFT | Affordable, proven quality | |
| | LoRA rank | r=16 | Proportional to dataset size | |
| | LoRA target | all-linear | "LoRA Without Regret": matches full FT | |
| | Epochs | 3 | Standard, prevents overfitting | |
| | Learning rate | 2e-4 | 10Γ base rate for LoRA | |
| | Batch size | 4Γ4=16 | Fits T4 memory | |
| |
| --- |
| |
| ## π Next Step |
| |
| Read `03-architecture.md` to understand HOW the agent harness works. |
| |