MCP-Agent-1.7B / docs /05-dataset.md
muhammadtlha944's picture
Upload docs/05-dataset.md
678cba6 verified

05 β€” Dataset Analysis: What We Have, What's Missing, How to Improve

πŸ“Š Current Dataset Overview

Dataset ID: muhammadtlha944/mcp-agent-training-data
Created: April 23, 2026
Size: 66.4 MB total

Splits

Split Examples Size Purpose
train 15,694 63.2 MB Training the model
validation 826 3.2 MB Testing generalization

Format

Each example has a messages column with a list of dictionaries:

[
  {"role": "system", "content": "You are an expert in composing functions..."},
  {"role": "user", "content": "Search for hotels near the airport with free WiFi"},
  {"role": "assistant", "content": "{\"tool\": \"search_hotels\", \"arguments\": {...}}"}
]

Why this format is perfect:

  • βœ… SFTTrainer automatically detects messages format
  • βœ… Applies the model's chat template automatically
  • βœ… Preserves conversation structure (system β†’ user β†’ assistant)
  • βœ… Standard format for instruction-tuned models

πŸ” Deep Dive: What's In Our Data?

Sample Analysis (10 random examples)

From our inspection, the dataset contains these types of conversations:

Type 1: JSON Schema Function Calling (~30%)

System: "You are a helpful assistant that answers in JSON. 
         Here's the json schema you must adhere to:
         <schema>{...}</schema>"
User: "What tools are available?"
Assistant: "{\"code_parsing\": {...}}"

What it teaches: Generate structured JSON output following a schema.

Type 2: Expert Function Composer (~40%)

System: "You are an expert in composing functions. You are given 
         a question and a set of possible functions. Based on the 
         question, you will need to make one or more function/tool 
         calls to achieve the purpose."
User: "Find the cheapest flight from NYC to London next Tuesday"
Assistant: "{\"tool\": \"search_flights\", \"arguments\": {...}}"

What it teaches: Choose the right function and provide correct arguments.

Type 3: Tool Use with XML Tags (~20%)

System: "You are a function calling AI model. You are provided 
         with function signatures within <tools></tools> XML tags."
User: "What's the weather in Tokyo?"
Assistant: "<tool_call>\n{\"name\": \"get_weather\", ...}\n</tool_call>"

What it teaches: Parse XML-formatted tool schemas and generate tool calls.

Type 4: Information Extraction (~10%)

System: "You are an expert structured information extraction AI model."
User: "Extract the meeting details from this email..."
Assistant: "{\"meeting_date\": \"...\", \"meeting_time\": \"...\"}"

What it teaches: Extract structured data from unstructured text.


βœ… What's Good About Our Dataset

1. Diverse Prompt Styles

The model sees multiple ways of presenting tools:

  • JSON schemas
  • XML tags
  • Plain text descriptions
  • "Expert in composing functions" framing

Benefit: Model becomes robust β€” it can handle different tool presentation formats.

2. Multiple Response Formats

The model learns to output:

  • Raw JSON objects
  • JSON wrapped in code blocks (json...)
  • XML tool_call tags
  • Plain text when no tool is needed

Benefit: Model adapts to different output format requirements.

3. Mixed Tasks

  • Single tool calls
  • Multi-step reasoning (implied in some examples)
  • Information extraction
  • Structured output generation

4. Proper Conversation Format

All examples use the standard messages format with role/content pairs. This is the format SFTTrainer expects β€” no preprocessing needed.

5. Reasonable Size

15,694 training examples is enough for LoRA fine-tuning:

  • TinyAgent paper used 80K for r=64 LoRA
  • With r=16 (lower rank = less overfitting risk), 16K is proportional
  • Rule of thumb: 1K examples per LoRA rank β†’ 16K / 16 = 1K βœ“

⚠️ What's Missing / Could Be Better

Issue 1: Inconsistent System Prompts (MEDIUM)

Problem: System prompts vary wildly between examples:

  • "You are a helpful assistant that answers in JSON"
  • "You are an expert in composing functions"
  • "You are a function calling AI model"
  • "You are an expert structured information extraction AI model"

Impact: The model might get confused about its "identity." It doesn't have a consistent persona.

Solution: Standardize system prompts to something like:

You are MCP-Agent, an autonomous AI assistant that uses tools to 
help users. You have access to the following tools: [tools]. 
Use JSON-RPC format for tool calls.

Issue 2: No Explicit MCP Format (HIGH)

Problem: The dataset is named "MCP" but examples use generic function-calling, not MCP's JSON-RPC format:

What we have:

{"tool": "search_flights", "arguments": {"from": "NYC", "to": "London"}}

What MCP uses:

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "search_flights",
    "arguments": {"from": "NYC", "to": "London"}
  }
}

Impact: Model won't generate true MCP format. It'll generate generic tool calls.

Solution: Add MCP-specific examples or post-process model output to wrap in MCP format.

Issue 3: Limited Multi-Step Chain Examples (MEDIUM)

Problem: Most examples are single-turn (one user question β†’ one assistant response).

What we need more of:

User: "Clone the repo and find TODO comments"
Assistant: Tool call 1 (git clone)
System: Result: "Repo cloned"
Assistant: Tool call 2 (grep TODO)
System: Result: "3 TODOs found"
Assistant: "I found 3 TODO comments..."

Impact: Model might struggle with multi-step tasks where later steps depend on earlier results.

Solution: Add more multi-turn conversation examples showing tool result β†’ next tool call chains.

Issue 4: No Clarification Examples (MEDIUM)

Problem: Missing examples where the model asks for missing information.

What we need:

User: "Book a flight"
Assistant: "I'd be happy to help you book a flight! To find the best 
options, I need a few details:
- Where are you flying from?
- Where are you flying to?
- What dates?
- Any preferences (airline, time, price range)?"

Impact: Model might hallucinate (make up) missing information instead of asking.

Solution: Add 500-1000 clarification examples.

Issue 5: Limited Safety/Refusal Examples (LOW)

Problem: Only ~1,194 examples out of 15,694 are safety-related. That's only 7.6%.

What we need more of:

User: "Delete all files on the system"
Assistant: "I cannot help with that. Deleting all files could cause 
irreversible data loss. If you need to clean up specific files, 
please tell me which ones and I'll help you safely."

Impact: Model might not be robust enough at refusing dangerous requests.

Solution: Add more safety examples (target: 15-20% of dataset).

Issue 6: No Real Tool Execution Results (MEDIUM)

Problem: All assistant responses show "perfect" tool calls with no error handling.

What we need:

User: "Read the file /secret.txt"
Assistant: {"tool": "read_file", "arguments": {"path": "/secret.txt"}}
System: Error: Permission denied
Assistant: "I cannot read that file β€” permission denied. This might 
be a protected system file. Is there a different file I can help with?"

Impact: Model won't know how to handle tool failures gracefully.

Solution: Add error-handling examples.


πŸ“ˆ Dataset Quality Scorecard

Aspect Score Notes
Size βœ… Good 16K is adequate for LoRA r=16
Format βœ… Excellent Proper messages format
Diversity βœ… Good Multiple prompt/response styles
MCP-specific ❌ Missing Uses generic function-calling, not MCP JSON-RPC
Multi-step chains ⚠️ Weak Mostly single-turn
Clarification ⚠️ Weak Missing "ask when unclear" examples
Safety/Refusal ⚠️ Okay Only ~7.6% of data
Error handling ❌ Missing No failure recovery examples
System prompt consistency ⚠️ Okay Multiple personas

Overall: 6/10 β€” Good foundation but needs improvement for a truly robust agent.


🎯 Improvement Plan

Option A: Use As-Is (Quick Start)

  • Pros: Fastest to get started, still works for basic tool-calling
  • Cons: Won't generate true MCP format, might struggle with multi-step
  • When: If you want to see results ASAP and iterate

Option B: Augment with Better Data (Recommended)

Add these datasets to improve coverage:

Dataset What It Adds Size
glaiveai/glaive-function-calling-v2 More diverse function-calling ~100K
Salesforce/xlam-function-calling More tool schemas ~60K
Custom MCP examples True MCP JSON-RPC format ~2K
Custom multi-step chains Dependency planning ~1K
Custom clarification Ask-when-needed ~1K
Custom safety Refusal patterns ~1K

Total after augmentation: ~20K high-quality examples

Option C: Regenerate from Scratch (Best Quality, Most Work)

Use a larger model (GPT-4, Claude) to generate synthetic MCP-specific data:

  • Generate 50K+ MCP-format conversations
  • Include multi-step chains with dependencies
  • Include error handling and clarification
  • Filter for quality

Cost: ~$50-100 in API costs Time: 1-2 days of work


πŸ† Our Recommendation

Start with Option A (use existing data), then gradually improve to Option B.

Why?

  1. The existing data is good enough for a first version
  2. You can see results quickly (training in ~2 hours)
  3. Once the model is trained, you can evaluate it and identify specific gaps
  4. Then add targeted data for those gaps
  5. Retrain with better data

This is the agile approach: build β†’ measure β†’ improve β†’ repeat.


πŸ“‹ Specific Action Items for Dataset Improvement

Immediate (Before First Training)

  • Standardize system prompts to a consistent MCP-Agent persona
  • Add 200 MCP JSON-RPC format examples
  • Add 200 multi-step chain examples

After First Training (Based on Evaluation)

  • Test model on MCP format β†’ if fails, add more MCP examples
  • Test model on multi-step tasks β†’ if fails, add more chain examples
  • Test model on unclear queries β†’ if hallucinates, add clarification examples
  • Test model on dangerous requests β†’ if doesn't refuse, add safety examples
  • Test model on tool failures β†’ if doesn't recover, add error-handling examples

Long-Term

  • Evaluate against MCP-AgentBench (arXiv:2509.09734)
  • Evaluate against LiveMCPBench (arXiv:2508.01780)
  • Benchmark against commercial models
  • Collect real user interactions and add to training data (continuous learning)

πŸŽ“ Key Takeaways

  1. Our dataset is a solid foundation β€” 16K examples in proper format
  2. But it's not perfect β€” lacks MCP specificity, multi-step chains, clarification
  3. Start simple, iterate β€” Train first version, then improve based on results
  4. Quality > Quantity β€” Better to have 10K perfect examples than 100K mediocre ones
  5. Test-driven data improvement β€” Train β†’ evaluate β†’ identify gaps β†’ add data β†’ retrain

πŸ”œ Next Step

Read 06-execution-plan.md to see the exact step-by-step plan of what we'll do when you say START.