Upload docs/02-research.md

504482b verified 11 days ago

6.52 kB

	# 02 — Research: Papers, Datasets & Key Findings

	## 🔬 Our Research Mission

	We asked: "What's the best way to train a small model for tool-calling? What do the papers say?"

	We searched:
	- Research papers on arXiv (via HuggingFace papers)
	- Existing datasets on HuggingFace Hub
	- Current TRL/Transformers APIs (to avoid outdated code)
	- Existing repos and training examples

	---

	## 📄 Landmark Papers We Found

	### 1. TinyAgent: Function Calling at the Edge (arXiv:2409.00608)

	What it proved:
	- A 1.1B parameter model fine-tuned for tool-calling can match GPT-4-Turbo at function-calling tasks
	- The key is high-quality synthetic data (they generated 80K examples with GPT-4)
	- They used LLMCompiler framework: model outputs a plan with dependencies, then tools execute in order

	Their training recipe:
	- Base model: TinyLlama-1.1B and Wizard-2-7B
	- Dataset: 80K synthetic function-calling plans
	- Metric: Graph isomorphism (does the model's tool-call DAG match the ground truth?)
	- Tool RAG: Small model (DeBERTa) selects which tools to use before calling the LLM

	How we use it:
	- Proves small models CAN work for agents
	- Inspired our training data format (function schemas + user queries + tool calls)
	- We use a simpler version (no separate Tool RAG, model handles selection)

	---

	### 2. STAR Framework (arXiv:2602.03022)

	What it proved:
	- Qwen3-1.7B beats Llama-3.1-8B at function-calling benchmarks
	- The Qwen3 family has strong built-in instruction-following capabilities
	- Smaller models with good pre-training outperform larger models with worse pre-training

	How we use it:
	- CONFIRMED our base model choice: Qwen3-1.7B is the sweet spot
	- Proves we don't need a bigger model — quality of pre-training matters more

	---

	### 3. Agent-World: Scaling Real-World Agent Training (arXiv:2604.18292)

	What it proved:
	- Real-world agent training needs continuous environment-task discovery
	- They use MCP servers as one source of environment themes (from Smithery.ai)
	- They build a self-evolving loop: train → evaluate → discover gaps → expand data

	How we use it:
	- Confirms MCP is the right protocol to focus on
	- Inspired us to embed MCP knowledge INTO the model rather than calling external MCP servers

	---

	### 4. MCP-Universe (arXiv:2508.14704)

	What it proved:
	- Comprehensive benchmark for evaluating LLMs on real MCP servers
	- Tests tool discovery, tool invocation, and response handling
	- Reveals performance disparities between open and closed-source models

	How we use it:
	- Shows that MCP tool-calling is a real, testable skill
	- We can test our model against this benchmark after training

	---

	### 5. LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685)

	What it proved:
	- You can fine-tune huge models by only training tiny adapter matrices
	- Achieves comparable performance to full fine-tuning with 1000× fewer parameters

	How we use it:
	- Core technique for our training — makes it affordable on T4 GPU

	---

	### 6. LoRA Without Regret (Thinking Machines Lab, 2025)

	What it proved:
	- Applying LoRA to ALL linear layers (not just attention projections) matches full fine-tuning quality
	- Previous wisdom was to only apply LoRA to q_proj and v_proj

	How we use it:
	- We set `target_modules="all-linear"` for best quality
	- This is our "secret sauce" for making LoRA as good as full fine-tuning

	---

	## 📊 Datasets We Discovered

	### Existing Tool-Calling Datasets (On HuggingFace Hub)

	\| Dataset \| Size \| Format \| Notes \|
	\|---------\|------\|--------\|-------\|
	\| glaiveai/glaive-function-calling-v2 \| ~100K \| Conversations \| Most popular, Apache 2.0 \|
	\| glaiveai/glaive-function-calling \| ~52K \| Conversations \| Earlier version \|
	\| togethercomputer/glaive-function-calling-v2-formatted \| ~100K \| Conversations \| Community formatted \|
	\| lilacai/glaive-function-calling-v2-sharegpt \| ~100K \| ShareGPT format \| Good for chat models \|
	\| Salesforce/xlam-function-calling \| ~60K \| JSON \| Diverse domains \|
	\| NousResearch/hermes-function-calling \| ~20K \| Conversations \| Hermes format \|

	### Our Existing Dataset

	muhammadtlha944/mcp-agent-training-data
	- Train: 15,694 examples (63.2 MB)
	- Validation: 826 examples (3.2 MB)
	- Format: `messages` column with role/content pairs
	- Content: Mixed function-calling, JSON output, clarification, safety

	Quality assessment:
	- ✅ Good: Has system/user/assistant messages in proper format
	- ✅ Good: Covers multiple tool-calling patterns
	- ⚠️ Concern: System prompts vary — model might get confused about expected format
	- ⚠️ Concern: Only ~16K examples. TinyAgent used 80K
	- ⚠️ Concern: No explicit MCP format examples

	---

	## 🔧 APIs & Libraries (Current Versions)

	### TRL (Transformers Reinforcement Learning)
	- SFTTrainer with `peft_config` parameter for LoRA
	- SFTConfig for training arguments
	- `report_to="trackio"` for monitoring
	- `disable_tqdm=True` for clean logs

	### PEFT (Parameter-Efficient Fine-Tuning)
	- `LoraConfig` with `target_modules="all-linear"`
	- Adapters are ~100MB for 2B model with r=16

	### Key Parameters (From TRL Docs):
	- Learning rate for LoRA SFT: 2e-4 (10× higher than full fine-tuning)
	- Batch size strategy: Small per-device batch + gradient accumulation
	- For 2B model on T4: batch_size=4, accumulation=4 → effective batch=16

	---

	## 🎯 Our Research Conclusions

	### What Works (Backed by Papers)

	1. Small models CAN do tool-calling well — TinyAgent proved 1.1B ≈ GPT-4 at focused tasks
	2. Qwen3-1.7B is the best base — STAR paper shows it beats larger models
	3. LoRA with all-linear targets matches full FT — "LoRA Without Regret" paper
	4. ~16K examples is workable — TinyAgent used 80K but got good results with less
	5. MCP is the future protocol — Multiple 2025 papers use MCP for benchmarks

	### Our Choices (And Why)

	\| Decision \| Choice \| Reason \|
	\|----------\|--------\|--------\|
	\| Base model \| Qwen3-1.7B \| STAR paper: beats Llama-3.1-8B; fits T4 \|
	\| Training method \| LoRA SFT \| Affordable, proven quality \|
	\| LoRA rank \| r=16 \| Proportional to dataset size \|
	\| LoRA target \| all-linear \| "LoRA Without Regret": matches full FT \|
	\| Epochs \| 3 \| Standard, prevents overfitting \|
	\| Learning rate \| 2e-4 \| 10× base rate for LoRA \|
	\| Batch size \| 4×4=16 \| Fits T4 memory \|

	---

	## 🔜 Next Step

	Read `03-architecture.md` to understand HOW the agent harness works.