# 02 — Research: Papers, Datasets & Key Findings

## 🔬 Our Research Mission

We asked: *"What's the best way to train a small model for tool-calling? What do the papers say?"*

We searched:
- Research papers on arXiv (via HuggingFace papers)
- Existing datasets on HuggingFace Hub
- Current TRL/Transformers APIs (to avoid outdated code)
- Existing repos and training examples

---

## 📄 Landmark Papers We Found

### 1. TinyAgent: Function Calling at the Edge (arXiv:2409.00608)

**What it proved:**
- A **1.1B parameter model** fine-tuned for tool-calling can match **GPT-4-Turbo** at function-calling tasks
- The key is **high-quality synthetic data** (they generated 80K examples with GPT-4)
- They used **LLMCompiler** framework: model outputs a plan with dependencies, then tools execute in order

**Their training recipe:**
- Base model: TinyLlama-1.1B and Wizard-2-7B
- Dataset: 80K synthetic function-calling plans
- Metric: Graph isomorphism (does the model's tool-call DAG match the ground truth?)
- Tool RAG: Small model (DeBERTa) selects which tools to use before calling the LLM

**How we use it:**
- Proves small models CAN work for agents
- Inspired our training data format (function schemas + user queries + tool calls)
- We use a simpler version (no separate Tool RAG, model handles selection)

---

### 2. STAR Framework (arXiv:2602.03022)

**What it proved:**
- **Qwen3-1.7B beats Llama-3.1-8B** at function-calling benchmarks
- The Qwen3 family has strong built-in instruction-following capabilities
- Smaller models with good pre-training outperform larger models with worse pre-training

**How we use it:**
- **CONFIRMED our base model choice**: Qwen3-1.7B is the sweet spot
- Proves we don't need a bigger model — quality of pre-training matters more

---

### 3. Agent-World: Scaling Real-World Agent Training (arXiv:2604.18292)

**What it proved:**
- Real-world agent training needs **continuous environment-task discovery**
- They use **MCP servers** as one source of environment themes (from Smithery.ai)
- They build a self-evolving loop: train → evaluate → discover gaps → expand data

**How we use it:**
- Confirms MCP is the right protocol to focus on
- Inspired us to embed MCP knowledge INTO the model rather than calling external MCP servers

---

### 4. MCP-Universe (arXiv:2508.14704)

**What it proved:**
- Comprehensive benchmark for evaluating LLMs on **real MCP servers**
- Tests tool discovery, tool invocation, and response handling
- Reveals performance disparities between open and closed-source models

**How we use it:**
- Shows that MCP tool-calling is a real, testable skill
- We can test our model against this benchmark after training

---

### 5. LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685)

**What it proved:**
- You can fine-tune huge models by only training tiny adapter matrices
- Achieves comparable performance to full fine-tuning with 1000× fewer parameters

**How we use it:**
- Core technique for our training — makes it affordable on T4 GPU

---

### 6. LoRA Without Regret (Thinking Machines Lab, 2025)

**What it proved:**
- Applying LoRA to **ALL linear layers** (not just attention projections) matches full fine-tuning quality
- Previous wisdom was to only apply LoRA to q_proj and v_proj

**How we use it:**
- We set `target_modules="all-linear"` for best quality
- This is our "secret sauce" for making LoRA as good as full fine-tuning

---

## 📊 Datasets We Discovered

### Existing Tool-Calling Datasets (On HuggingFace Hub)

| Dataset | Size | Format | Notes |
|---------|------|--------|-------|
| **glaiveai/glaive-function-calling-v2** | ~100K | Conversations | Most popular, Apache 2.0 |
| **glaiveai/glaive-function-calling** | ~52K | Conversations | Earlier version |
| **togethercomputer/glaive-function-calling-v2-formatted** | ~100K | Conversations | Community formatted |
| **lilacai/glaive-function-calling-v2-sharegpt** | ~100K | ShareGPT format | Good for chat models |
| **Salesforce/xlam-function-calling** | ~60K | JSON | Diverse domains |
| **NousResearch/hermes-function-calling** | ~20K | Conversations | Hermes format |

### Our Existing Dataset

**muhammadtlha944/mcp-agent-training-data**
- **Train:** 15,694 examples (63.2 MB)
- **Validation:** 826 examples (3.2 MB)
- **Format:** `messages` column with role/content pairs
- **Content:** Mixed function-calling, JSON output, clarification, safety

**Quality assessment:**
- ✅ Good: Has system/user/assistant messages in proper format
- ✅ Good: Covers multiple tool-calling patterns
- ⚠️ Concern: System prompts vary — model might get confused about expected format
- ⚠️ Concern: Only ~16K examples. TinyAgent used 80K
- ⚠️ Concern: No explicit MCP format examples

---

## 🔧 APIs & Libraries (Current Versions)

### TRL (Transformers Reinforcement Learning)
- **SFTTrainer** with `peft_config` parameter for LoRA
- **SFTConfig** for training arguments
- `report_to="trackio"` for monitoring
- `disable_tqdm=True` for clean logs

### PEFT (Parameter-Efficient Fine-Tuning)
- `LoraConfig` with `target_modules="all-linear"`
- Adapters are ~100MB for 2B model with r=16

### Key Parameters (From TRL Docs):
- Learning rate for LoRA SFT: **2e-4** (10× higher than full fine-tuning)
- Batch size strategy: Small per-device batch + gradient accumulation
- For 2B model on T4: batch_size=4, accumulation=4 → effective batch=16

---

## 🎯 Our Research Conclusions

### What Works (Backed by Papers)

1. **Small models CAN do tool-calling well** — TinyAgent proved 1.1B ≈ GPT-4 at focused tasks
2. **Qwen3-1.7B is the best base** — STAR paper shows it beats larger models
3. **LoRA with all-linear targets matches full FT** — "LoRA Without Regret" paper
4. **~16K examples is workable** — TinyAgent used 80K but got good results with less
5. **MCP is the future protocol** — Multiple 2025 papers use MCP for benchmarks

### Our Choices (And Why)

| Decision | Choice | Reason |
|----------|--------|--------|
| Base model | Qwen3-1.7B | STAR paper: beats Llama-3.1-8B; fits T4 |
| Training method | LoRA SFT | Affordable, proven quality |
| LoRA rank | r=16 | Proportional to dataset size |
| LoRA target | all-linear | "LoRA Without Regret": matches full FT |
| Epochs | 3 | Standard, prevents overfitting |
| Learning rate | 2e-4 | 10× base rate for LoRA |
| Batch size | 4×4=16 | Fits T4 memory |

---

## 🔜 Next Step

Read `03-architecture.md` to understand HOW the agent harness works.