MCP-Agent-1.7B / docs /02-research.md
muhammadtlha944's picture
Upload docs/02-research.md
504482b verified
# 02 β€” Research: Papers, Datasets & Key Findings
## πŸ”¬ Our Research Mission
We asked: *"What's the best way to train a small model for tool-calling? What do the papers say?"*
We searched:
- Research papers on arXiv (via HuggingFace papers)
- Existing datasets on HuggingFace Hub
- Current TRL/Transformers APIs (to avoid outdated code)
- Existing repos and training examples
---
## πŸ“„ Landmark Papers We Found
### 1. TinyAgent: Function Calling at the Edge (arXiv:2409.00608)
**What it proved:**
- A **1.1B parameter model** fine-tuned for tool-calling can match **GPT-4-Turbo** at function-calling tasks
- The key is **high-quality synthetic data** (they generated 80K examples with GPT-4)
- They used **LLMCompiler** framework: model outputs a plan with dependencies, then tools execute in order
**Their training recipe:**
- Base model: TinyLlama-1.1B and Wizard-2-7B
- Dataset: 80K synthetic function-calling plans
- Metric: Graph isomorphism (does the model's tool-call DAG match the ground truth?)
- Tool RAG: Small model (DeBERTa) selects which tools to use before calling the LLM
**How we use it:**
- Proves small models CAN work for agents
- Inspired our training data format (function schemas + user queries + tool calls)
- We use a simpler version (no separate Tool RAG, model handles selection)
---
### 2. STAR Framework (arXiv:2602.03022)
**What it proved:**
- **Qwen3-1.7B beats Llama-3.1-8B** at function-calling benchmarks
- The Qwen3 family has strong built-in instruction-following capabilities
- Smaller models with good pre-training outperform larger models with worse pre-training
**How we use it:**
- **CONFIRMED our base model choice**: Qwen3-1.7B is the sweet spot
- Proves we don't need a bigger model β€” quality of pre-training matters more
---
### 3. Agent-World: Scaling Real-World Agent Training (arXiv:2604.18292)
**What it proved:**
- Real-world agent training needs **continuous environment-task discovery**
- They use **MCP servers** as one source of environment themes (from Smithery.ai)
- They build a self-evolving loop: train β†’ evaluate β†’ discover gaps β†’ expand data
**How we use it:**
- Confirms MCP is the right protocol to focus on
- Inspired us to embed MCP knowledge INTO the model rather than calling external MCP servers
---
### 4. MCP-Universe (arXiv:2508.14704)
**What it proved:**
- Comprehensive benchmark for evaluating LLMs on **real MCP servers**
- Tests tool discovery, tool invocation, and response handling
- Reveals performance disparities between open and closed-source models
**How we use it:**
- Shows that MCP tool-calling is a real, testable skill
- We can test our model against this benchmark after training
---
### 5. LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685)
**What it proved:**
- You can fine-tune huge models by only training tiny adapter matrices
- Achieves comparable performance to full fine-tuning with 1000Γ— fewer parameters
**How we use it:**
- Core technique for our training β€” makes it affordable on T4 GPU
---
### 6. LoRA Without Regret (Thinking Machines Lab, 2025)
**What it proved:**
- Applying LoRA to **ALL linear layers** (not just attention projections) matches full fine-tuning quality
- Previous wisdom was to only apply LoRA to q_proj and v_proj
**How we use it:**
- We set `target_modules="all-linear"` for best quality
- This is our "secret sauce" for making LoRA as good as full fine-tuning
---
## πŸ“Š Datasets We Discovered
### Existing Tool-Calling Datasets (On HuggingFace Hub)
| Dataset | Size | Format | Notes |
|---------|------|--------|-------|
| **glaiveai/glaive-function-calling-v2** | ~100K | Conversations | Most popular, Apache 2.0 |
| **glaiveai/glaive-function-calling** | ~52K | Conversations | Earlier version |
| **togethercomputer/glaive-function-calling-v2-formatted** | ~100K | Conversations | Community formatted |
| **lilacai/glaive-function-calling-v2-sharegpt** | ~100K | ShareGPT format | Good for chat models |
| **Salesforce/xlam-function-calling** | ~60K | JSON | Diverse domains |
| **NousResearch/hermes-function-calling** | ~20K | Conversations | Hermes format |
### Our Existing Dataset
**muhammadtlha944/mcp-agent-training-data**
- **Train:** 15,694 examples (63.2 MB)
- **Validation:** 826 examples (3.2 MB)
- **Format:** `messages` column with role/content pairs
- **Content:** Mixed function-calling, JSON output, clarification, safety
**Quality assessment:**
- βœ… Good: Has system/user/assistant messages in proper format
- βœ… Good: Covers multiple tool-calling patterns
- ⚠️ Concern: System prompts vary β€” model might get confused about expected format
- ⚠️ Concern: Only ~16K examples. TinyAgent used 80K
- ⚠️ Concern: No explicit MCP format examples
---
## πŸ”§ APIs & Libraries (Current Versions)
### TRL (Transformers Reinforcement Learning)
- **SFTTrainer** with `peft_config` parameter for LoRA
- **SFTConfig** for training arguments
- `report_to="trackio"` for monitoring
- `disable_tqdm=True` for clean logs
### PEFT (Parameter-Efficient Fine-Tuning)
- `LoraConfig` with `target_modules="all-linear"`
- Adapters are ~100MB for 2B model with r=16
### Key Parameters (From TRL Docs):
- Learning rate for LoRA SFT: **2e-4** (10Γ— higher than full fine-tuning)
- Batch size strategy: Small per-device batch + gradient accumulation
- For 2B model on T4: batch_size=4, accumulation=4 β†’ effective batch=16
---
## 🎯 Our Research Conclusions
### What Works (Backed by Papers)
1. **Small models CAN do tool-calling well** β€” TinyAgent proved 1.1B β‰ˆ GPT-4 at focused tasks
2. **Qwen3-1.7B is the best base** β€” STAR paper shows it beats larger models
3. **LoRA with all-linear targets matches full FT** β€” "LoRA Without Regret" paper
4. **~16K examples is workable** β€” TinyAgent used 80K but got good results with less
5. **MCP is the future protocol** β€” Multiple 2025 papers use MCP for benchmarks
### Our Choices (And Why)
| Decision | Choice | Reason |
|----------|--------|--------|
| Base model | Qwen3-1.7B | STAR paper: beats Llama-3.1-8B; fits T4 |
| Training method | LoRA SFT | Affordable, proven quality |
| LoRA rank | r=16 | Proportional to dataset size |
| LoRA target | all-linear | "LoRA Without Regret": matches full FT |
| Epochs | 3 | Standard, prevents overfitting |
| Learning rate | 2e-4 | 10Γ— base rate for LoRA |
| Batch size | 4Γ—4=16 | Fits T4 memory |
---
## πŸ”œ Next Step
Read `03-architecture.md` to understand HOW the agent harness works.