MCP-Agent-1.7B / docs /02-research.md

muhammadtlha944

Upload docs/02-research.md

504482b verified 10 days ago

preview code

raw

history blame contribute delete

6.52 kB

02 — Research: Papers, Datasets & Key Findings

🔬 Our Research Mission

We asked: "What's the best way to train a small model for tool-calling? What do the papers say?"

We searched:

Research papers on arXiv (via HuggingFace papers)
Existing datasets on HuggingFace Hub
Current TRL/Transformers APIs (to avoid outdated code)
Existing repos and training examples

📄 Landmark Papers We Found

1. TinyAgent: Function Calling at the Edge (arXiv:2409.00608)

What it proved:

A 1.1B parameter model fine-tuned for tool-calling can match GPT-4-Turbo at function-calling tasks
The key is high-quality synthetic data (they generated 80K examples with GPT-4)
They used LLMCompiler framework: model outputs a plan with dependencies, then tools execute in order

Their training recipe:

Base model: TinyLlama-1.1B and Wizard-2-7B
Dataset: 80K synthetic function-calling plans
Metric: Graph isomorphism (does the model's tool-call DAG match the ground truth?)
Tool RAG: Small model (DeBERTa) selects which tools to use before calling the LLM

How we use it:

Proves small models CAN work for agents
Inspired our training data format (function schemas + user queries + tool calls)
We use a simpler version (no separate Tool RAG, model handles selection)

2. STAR Framework (arXiv:2602.03022)

What it proved:

Qwen3-1.7B beats Llama-3.1-8B at function-calling benchmarks
The Qwen3 family has strong built-in instruction-following capabilities
Smaller models with good pre-training outperform larger models with worse pre-training

How we use it:

CONFIRMED our base model choice: Qwen3-1.7B is the sweet spot
Proves we don't need a bigger model — quality of pre-training matters more

3. Agent-World: Scaling Real-World Agent Training (arXiv:2604.18292)

What it proved:

Real-world agent training needs continuous environment-task discovery
They use MCP servers as one source of environment themes (from Smithery.ai)
They build a self-evolving loop: train → evaluate → discover gaps → expand data

How we use it:

Confirms MCP is the right protocol to focus on
Inspired us to embed MCP knowledge INTO the model rather than calling external MCP servers

4. MCP-Universe (arXiv:2508.14704)

What it proved:

Comprehensive benchmark for evaluating LLMs on real MCP servers
Tests tool discovery, tool invocation, and response handling
Reveals performance disparities between open and closed-source models

How we use it:

Shows that MCP tool-calling is a real, testable skill
We can test our model against this benchmark after training

5. LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685)

What it proved:

You can fine-tune huge models by only training tiny adapter matrices
Achieves comparable performance to full fine-tuning with 1000× fewer parameters

How we use it:

Core technique for our training — makes it affordable on T4 GPU

6. LoRA Without Regret (Thinking Machines Lab, 2025)

What it proved:

Applying LoRA to ALL linear layers (not just attention projections) matches full fine-tuning quality
Previous wisdom was to only apply LoRA to q_proj and v_proj

How we use it:

We set target_modules="all-linear" for best quality
This is our "secret sauce" for making LoRA as good as full fine-tuning

📊 Datasets We Discovered

Existing Tool-Calling Datasets (On HuggingFace Hub)

Dataset	Size	Format	Notes
glaiveai/glaive-function-calling-v2	~100K	Conversations	Most popular, Apache 2.0
glaiveai/glaive-function-calling	~52K	Conversations	Earlier version
togethercomputer/glaive-function-calling-v2-formatted	~100K	Conversations	Community formatted
lilacai/glaive-function-calling-v2-sharegpt	~100K	ShareGPT format	Good for chat models
Salesforce/xlam-function-calling	~60K	JSON	Diverse domains
NousResearch/hermes-function-calling	~20K	Conversations	Hermes format

Our Existing Dataset

muhammadtlha944/mcp-agent-training-data

Train: 15,694 examples (63.2 MB)
Validation: 826 examples (3.2 MB)
Format: messages column with role/content pairs
Content: Mixed function-calling, JSON output, clarification, safety

Quality assessment:

✅ Good: Has system/user/assistant messages in proper format
✅ Good: Covers multiple tool-calling patterns
⚠️ Concern: System prompts vary — model might get confused about expected format
⚠️ Concern: Only ~16K examples. TinyAgent used 80K
⚠️ Concern: No explicit MCP format examples

🔧 APIs & Libraries (Current Versions)

TRL (Transformers Reinforcement Learning)

SFTTrainer with peft_config parameter for LoRA
SFTConfig for training arguments
report_to="trackio" for monitoring
disable_tqdm=True for clean logs

PEFT (Parameter-Efficient Fine-Tuning)

LoraConfig with target_modules="all-linear"
Adapters are ~100MB for 2B model with r=16

Key Parameters (From TRL Docs):

Learning rate for LoRA SFT: 2e-4 (10× higher than full fine-tuning)
Batch size strategy: Small per-device batch + gradient accumulation
For 2B model on T4: batch_size=4, accumulation=4 → effective batch=16

🎯 Our Research Conclusions

What Works (Backed by Papers)

Small models CAN do tool-calling well — TinyAgent proved 1.1B ≈ GPT-4 at focused tasks
Qwen3-1.7B is the best base — STAR paper shows it beats larger models
LoRA with all-linear targets matches full FT — "LoRA Without Regret" paper
~16K examples is workable — TinyAgent used 80K but got good results with less
MCP is the future protocol — Multiple 2025 papers use MCP for benchmarks

Our Choices (And Why)

Decision	Choice	Reason
Base model	Qwen3-1.7B	STAR paper: beats Llama-3.1-8B; fits T4
Training method	LoRA SFT	Affordable, proven quality
LoRA rank	r=16	Proportional to dataset size
LoRA target	all-linear	"LoRA Without Regret": matches full FT
Epochs	3	Standard, prevents overfitting
Learning rate	2e-4	10× base rate for LoRA
Batch size	4×4=16	Fits T4 memory

🔜 Next Step

Read 03-architecture.md to understand HOW the agent harness works.