MCP-Agent-1.7B / docs /05-dataset.md

muhammadtlha944

Upload docs/05-dataset.md

678cba6 verified 11 days ago

preview code

raw

history blame contribute delete

11.4 kB

05 — Dataset Analysis: What We Have, What's Missing, How to Improve

📊 Current Dataset Overview

Dataset ID: muhammadtlha944/mcp-agent-training-data
Created: April 23, 2026
Size: 66.4 MB total

Splits

Split	Examples	Size	Purpose
train	15,694	63.2 MB	Training the model
validation	826	3.2 MB	Testing generalization

Format

Each example has a messages column with a list of dictionaries:

[
  {"role": "system", "content": "You are an expert in composing functions..."},
  {"role": "user", "content": "Search for hotels near the airport with free WiFi"},
  {"role": "assistant", "content": "{\"tool\": \"search_hotels\", \"arguments\": {...}}"}
]

Why this format is perfect:

✅ SFTTrainer automatically detects messages format
✅ Applies the model's chat template automatically
✅ Preserves conversation structure (system → user → assistant)
✅ Standard format for instruction-tuned models

🔍 Deep Dive: What's In Our Data?

Sample Analysis (10 random examples)

From our inspection, the dataset contains these types of conversations:

Type 1: JSON Schema Function Calling (~30%)

System: "You are a helpful assistant that answers in JSON. 
         Here's the json schema you must adhere to:
         <schema>{...}</schema>"
User: "What tools are available?"
Assistant: "{\"code_parsing\": {...}}"

What it teaches: Generate structured JSON output following a schema.

Type 2: Expert Function Composer (~40%)

System: "You are an expert in composing functions. You are given 
         a question and a set of possible functions. Based on the 
         question, you will need to make one or more function/tool 
         calls to achieve the purpose."
User: "Find the cheapest flight from NYC to London next Tuesday"
Assistant: "{\"tool\": \"search_flights\", \"arguments\": {...}}"

What it teaches: Choose the right function and provide correct arguments.

Type 3: Tool Use with XML Tags (~20%)

System: "You are a function calling AI model. You are provided 
         with function signatures within <tools></tools> XML tags."
User: "What's the weather in Tokyo?"
Assistant: "<tool_call>\n{\"name\": \"get_weather\", ...}\n</tool_call>"

What it teaches: Parse XML-formatted tool schemas and generate tool calls.

Type 4: Information Extraction (~10%)

System: "You are an expert structured information extraction AI model."
User: "Extract the meeting details from this email..."
Assistant: "{\"meeting_date\": \"...\", \"meeting_time\": \"...\"}"

What it teaches: Extract structured data from unstructured text.

✅ What's Good About Our Dataset

1. Diverse Prompt Styles

The model sees multiple ways of presenting tools:

JSON schemas
XML tags
Plain text descriptions
"Expert in composing functions" framing

Benefit: Model becomes robust — it can handle different tool presentation formats.

2. Multiple Response Formats

The model learns to output:

Raw JSON objects
JSON wrapped in code blocks (json...)
XML tool_call tags
Plain text when no tool is needed

Benefit: Model adapts to different output format requirements.

3. Mixed Tasks

Single tool calls
Multi-step reasoning (implied in some examples)
Information extraction
Structured output generation

4. Proper Conversation Format

All examples use the standard messages format with role/content pairs. This is the format SFTTrainer expects — no preprocessing needed.

5. Reasonable Size

15,694 training examples is enough for LoRA fine-tuning:

TinyAgent paper used 80K for r=64 LoRA
With r=16 (lower rank = less overfitting risk), 16K is proportional
Rule of thumb: 1K examples per LoRA rank → 16K / 16 = 1K ✓

⚠️ What's Missing / Could Be Better

Issue 1: Inconsistent System Prompts (MEDIUM)

Problem: System prompts vary wildly between examples:

"You are a helpful assistant that answers in JSON"
"You are an expert in composing functions"
"You are a function calling AI model"
"You are an expert structured information extraction AI model"

Impact: The model might get confused about its "identity." It doesn't have a consistent persona.

Solution: Standardize system prompts to something like:

You are MCP-Agent, an autonomous AI assistant that uses tools to 
help users. You have access to the following tools: [tools]. 
Use JSON-RPC format for tool calls.

Issue 2: No Explicit MCP Format (HIGH)

Problem: The dataset is named "MCP" but examples use generic function-calling, not MCP's JSON-RPC format:

What we have:

{"tool": "search_flights", "arguments": {"from": "NYC", "to": "London"}}

What MCP uses:

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "search_flights",
    "arguments": {"from": "NYC", "to": "London"}
  }
}

Impact: Model won't generate true MCP format. It'll generate generic tool calls.

Solution: Add MCP-specific examples or post-process model output to wrap in MCP format.

Issue 3: Limited Multi-Step Chain Examples (MEDIUM)

Problem: Most examples are single-turn (one user question → one assistant response).

What we need more of:

User: "Clone the repo and find TODO comments"
Assistant: Tool call 1 (git clone)
System: Result: "Repo cloned"
Assistant: Tool call 2 (grep TODO)
System: Result: "3 TODOs found"
Assistant: "I found 3 TODO comments..."

Impact: Model might struggle with multi-step tasks where later steps depend on earlier results.

Solution: Add more multi-turn conversation examples showing tool result → next tool call chains.

Issue 4: No Clarification Examples (MEDIUM)

Problem: Missing examples where the model asks for missing information.

What we need:

User: "Book a flight"
Assistant: "I'd be happy to help you book a flight! To find the best 
options, I need a few details:
- Where are you flying from?
- Where are you flying to?
- What dates?
- Any preferences (airline, time, price range)?"

Impact: Model might hallucinate (make up) missing information instead of asking.

Solution: Add 500-1000 clarification examples.

Issue 5: Limited Safety/Refusal Examples (LOW)

Problem: Only ~1,194 examples out of 15,694 are safety-related. That's only 7.6%.

What we need more of:

User: "Delete all files on the system"
Assistant: "I cannot help with that. Deleting all files could cause 
irreversible data loss. If you need to clean up specific files, 
please tell me which ones and I'll help you safely."

Impact: Model might not be robust enough at refusing dangerous requests.

Solution: Add more safety examples (target: 15-20% of dataset).

Issue 6: No Real Tool Execution Results (MEDIUM)

Problem: All assistant responses show "perfect" tool calls with no error handling.

What we need:

User: "Read the file /secret.txt"
Assistant: {"tool": "read_file", "arguments": {"path": "/secret.txt"}}
System: Error: Permission denied
Assistant: "I cannot read that file — permission denied. This might 
be a protected system file. Is there a different file I can help with?"

Impact: Model won't know how to handle tool failures gracefully.

Solution: Add error-handling examples.

📈 Dataset Quality Scorecard

Aspect	Score	Notes
Size	✅ Good	16K is adequate for LoRA r=16
Format	✅ Excellent	Proper messages format
Diversity	✅ Good	Multiple prompt/response styles
MCP-specific	❌ Missing	Uses generic function-calling, not MCP JSON-RPC
Multi-step chains	⚠️ Weak	Mostly single-turn
Clarification	⚠️ Weak	Missing "ask when unclear" examples
Safety/Refusal	⚠️ Okay	Only ~7.6% of data
Error handling	❌ Missing	No failure recovery examples
System prompt consistency	⚠️ Okay	Multiple personas

Overall: 6/10 — Good foundation but needs improvement for a truly robust agent.

🎯 Improvement Plan

Option A: Use As-Is (Quick Start)

Pros: Fastest to get started, still works for basic tool-calling
Cons: Won't generate true MCP format, might struggle with multi-step
When: If you want to see results ASAP and iterate

Option B: Augment with Better Data (Recommended)

Add these datasets to improve coverage:

Dataset	What It Adds	Size
glaiveai/glaive-function-calling-v2	More diverse function-calling	~100K
Salesforce/xlam-function-calling	More tool schemas	~60K
Custom MCP examples	True MCP JSON-RPC format	~2K
Custom multi-step chains	Dependency planning	~1K
Custom clarification	Ask-when-needed	~1K
Custom safety	Refusal patterns	~1K

Total after augmentation: ~20K high-quality examples

Option C: Regenerate from Scratch (Best Quality, Most Work)

Use a larger model (GPT-4, Claude) to generate synthetic MCP-specific data:

Generate 50K+ MCP-format conversations
Include multi-step chains with dependencies
Include error handling and clarification
Filter for quality

Cost: ~$50-100 in API costs Time: 1-2 days of work

🏆 Our Recommendation

Start with Option A (use existing data), then gradually improve to Option B.

Why?

The existing data is good enough for a first version
You can see results quickly (training in ~2 hours)
Once the model is trained, you can evaluate it and identify specific gaps
Then add targeted data for those gaps
Retrain with better data

This is the agile approach: build → measure → improve → repeat.

📋 Specific Action Items for Dataset Improvement

Immediate (Before First Training)

Standardize system prompts to a consistent MCP-Agent persona
Add 200 MCP JSON-RPC format examples
Add 200 multi-step chain examples

After First Training (Based on Evaluation)

Test model on MCP format → if fails, add more MCP examples
Test model on multi-step tasks → if fails, add more chain examples
Test model on unclear queries → if hallucinates, add clarification examples
Test model on dangerous requests → if doesn't refuse, add safety examples
Test model on tool failures → if doesn't recover, add error-handling examples

Long-Term

Evaluate against MCP-AgentBench (arXiv:2509.09734)
Evaluate against LiveMCPBench (arXiv:2508.01780)
Benchmark against commercial models
Collect real user interactions and add to training data (continuous learning)

🎓 Key Takeaways

Our dataset is a solid foundation — 16K examples in proper format
But it's not perfect — lacks MCP specificity, multi-step chains, clarification
Start simple, iterate — Train first version, then improve based on results
Quality > Quantity — Better to have 10K perfect examples than 100K mediocre ones
Test-driven data improvement — Train → evaluate → identify gaps → add data → retrain

🔜 Next Step

Read 06-execution-plan.md to see the exact step-by-step plan of what we'll do when you say START.