MCP-Agent-1.7B / docs /02-research.md
muhammadtlha944's picture
Upload docs/02-research.md
504482b verified

02 β€” Research: Papers, Datasets & Key Findings

πŸ”¬ Our Research Mission

We asked: "What's the best way to train a small model for tool-calling? What do the papers say?"

We searched:

  • Research papers on arXiv (via HuggingFace papers)
  • Existing datasets on HuggingFace Hub
  • Current TRL/Transformers APIs (to avoid outdated code)
  • Existing repos and training examples

πŸ“„ Landmark Papers We Found

1. TinyAgent: Function Calling at the Edge (arXiv:2409.00608)

What it proved:

  • A 1.1B parameter model fine-tuned for tool-calling can match GPT-4-Turbo at function-calling tasks
  • The key is high-quality synthetic data (they generated 80K examples with GPT-4)
  • They used LLMCompiler framework: model outputs a plan with dependencies, then tools execute in order

Their training recipe:

  • Base model: TinyLlama-1.1B and Wizard-2-7B
  • Dataset: 80K synthetic function-calling plans
  • Metric: Graph isomorphism (does the model's tool-call DAG match the ground truth?)
  • Tool RAG: Small model (DeBERTa) selects which tools to use before calling the LLM

How we use it:

  • Proves small models CAN work for agents
  • Inspired our training data format (function schemas + user queries + tool calls)
  • We use a simpler version (no separate Tool RAG, model handles selection)

2. STAR Framework (arXiv:2602.03022)

What it proved:

  • Qwen3-1.7B beats Llama-3.1-8B at function-calling benchmarks
  • The Qwen3 family has strong built-in instruction-following capabilities
  • Smaller models with good pre-training outperform larger models with worse pre-training

How we use it:

  • CONFIRMED our base model choice: Qwen3-1.7B is the sweet spot
  • Proves we don't need a bigger model β€” quality of pre-training matters more

3. Agent-World: Scaling Real-World Agent Training (arXiv:2604.18292)

What it proved:

  • Real-world agent training needs continuous environment-task discovery
  • They use MCP servers as one source of environment themes (from Smithery.ai)
  • They build a self-evolving loop: train β†’ evaluate β†’ discover gaps β†’ expand data

How we use it:

  • Confirms MCP is the right protocol to focus on
  • Inspired us to embed MCP knowledge INTO the model rather than calling external MCP servers

4. MCP-Universe (arXiv:2508.14704)

What it proved:

  • Comprehensive benchmark for evaluating LLMs on real MCP servers
  • Tests tool discovery, tool invocation, and response handling
  • Reveals performance disparities between open and closed-source models

How we use it:

  • Shows that MCP tool-calling is a real, testable skill
  • We can test our model against this benchmark after training

5. LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685)

What it proved:

  • You can fine-tune huge models by only training tiny adapter matrices
  • Achieves comparable performance to full fine-tuning with 1000Γ— fewer parameters

How we use it:

  • Core technique for our training β€” makes it affordable on T4 GPU

6. LoRA Without Regret (Thinking Machines Lab, 2025)

What it proved:

  • Applying LoRA to ALL linear layers (not just attention projections) matches full fine-tuning quality
  • Previous wisdom was to only apply LoRA to q_proj and v_proj

How we use it:

  • We set target_modules="all-linear" for best quality
  • This is our "secret sauce" for making LoRA as good as full fine-tuning

πŸ“Š Datasets We Discovered

Existing Tool-Calling Datasets (On HuggingFace Hub)

Dataset Size Format Notes
glaiveai/glaive-function-calling-v2 ~100K Conversations Most popular, Apache 2.0
glaiveai/glaive-function-calling ~52K Conversations Earlier version
togethercomputer/glaive-function-calling-v2-formatted ~100K Conversations Community formatted
lilacai/glaive-function-calling-v2-sharegpt ~100K ShareGPT format Good for chat models
Salesforce/xlam-function-calling ~60K JSON Diverse domains
NousResearch/hermes-function-calling ~20K Conversations Hermes format

Our Existing Dataset

muhammadtlha944/mcp-agent-training-data

  • Train: 15,694 examples (63.2 MB)
  • Validation: 826 examples (3.2 MB)
  • Format: messages column with role/content pairs
  • Content: Mixed function-calling, JSON output, clarification, safety

Quality assessment:

  • βœ… Good: Has system/user/assistant messages in proper format
  • βœ… Good: Covers multiple tool-calling patterns
  • ⚠️ Concern: System prompts vary β€” model might get confused about expected format
  • ⚠️ Concern: Only ~16K examples. TinyAgent used 80K
  • ⚠️ Concern: No explicit MCP format examples

πŸ”§ APIs & Libraries (Current Versions)

TRL (Transformers Reinforcement Learning)

  • SFTTrainer with peft_config parameter for LoRA
  • SFTConfig for training arguments
  • report_to="trackio" for monitoring
  • disable_tqdm=True for clean logs

PEFT (Parameter-Efficient Fine-Tuning)

  • LoraConfig with target_modules="all-linear"
  • Adapters are ~100MB for 2B model with r=16

Key Parameters (From TRL Docs):

  • Learning rate for LoRA SFT: 2e-4 (10Γ— higher than full fine-tuning)
  • Batch size strategy: Small per-device batch + gradient accumulation
  • For 2B model on T4: batch_size=4, accumulation=4 β†’ effective batch=16

🎯 Our Research Conclusions

What Works (Backed by Papers)

  1. Small models CAN do tool-calling well β€” TinyAgent proved 1.1B β‰ˆ GPT-4 at focused tasks
  2. Qwen3-1.7B is the best base β€” STAR paper shows it beats larger models
  3. LoRA with all-linear targets matches full FT β€” "LoRA Without Regret" paper
  4. ~16K examples is workable β€” TinyAgent used 80K but got good results with less
  5. MCP is the future protocol β€” Multiple 2025 papers use MCP for benchmarks

Our Choices (And Why)

Decision Choice Reason
Base model Qwen3-1.7B STAR paper: beats Llama-3.1-8B; fits T4
Training method LoRA SFT Affordable, proven quality
LoRA rank r=16 Proportional to dataset size
LoRA target all-linear "LoRA Without Regret": matches full FT
Epochs 3 Standard, prevents overfitting
Learning rate 2e-4 10Γ— base rate for LoRA
Batch size 4Γ—4=16 Fits T4 memory

πŸ”œ Next Step

Read 03-architecture.md to understand HOW the agent harness works.