# 05 — Dataset Analysis: What We Have, What's Missing, How to Improve
## 📊 Current Dataset Overview
**Dataset ID:** `muhammadtlha944/mcp-agent-training-data`
**Created:** April 23, 2026
**Size:** 66.4 MB total
### Splits
| Split | Examples | Size | Purpose |
|-------|----------|------|---------|
| **train** | 15,694 | 63.2 MB | Training the model |
| **validation** | 826 | 3.2 MB | Testing generalization |
### Format
Each example has a `messages` column with a list of dictionaries:
```json
[
{"role": "system", "content": "You are an expert in composing functions..."},
{"role": "user", "content": "Search for hotels near the airport with free WiFi"},
{"role": "assistant", "content": "{\"tool\": \"search_hotels\", \"arguments\": {...}}"}
]
```
**Why this format is perfect:**
- ✅ SFTTrainer automatically detects `messages` format
- ✅ Applies the model's chat template automatically
- ✅ Preserves conversation structure (system → user → assistant)
- ✅ Standard format for instruction-tuned models
---
## 🔍 Deep Dive: What's In Our Data?
### Sample Analysis (10 random examples)
From our inspection, the dataset contains these types of conversations:
#### Type 1: JSON Schema Function Calling (~30%)
```
System: "You are a helpful assistant that answers in JSON.
Here's the json schema you must adhere to:
{...}"
User: "What tools are available?"
Assistant: "{\"code_parsing\": {...}}"
```
**What it teaches:** Generate structured JSON output following a schema.
#### Type 2: Expert Function Composer (~40%)
```
System: "You are an expert in composing functions. You are given
a question and a set of possible functions. Based on the
question, you will need to make one or more function/tool
calls to achieve the purpose."
User: "Find the cheapest flight from NYC to London next Tuesday"
Assistant: "{\"tool\": \"search_flights\", \"arguments\": {...}}"
```
**What it teaches:** Choose the right function and provide correct arguments.
#### Type 3: Tool Use with XML Tags (~20%)
```
System: "You are a function calling AI model. You are provided
with function signatures within XML tags."
User: "What's the weather in Tokyo?"
Assistant: "\n{\"name\": \"get_weather\", ...}\n"
```
**What it teaches:** Parse XML-formatted tool schemas and generate tool calls.
#### Type 4: Information Extraction (~10%)
```
System: "You are an expert structured information extraction AI model."
User: "Extract the meeting details from this email..."
Assistant: "{\"meeting_date\": \"...\", \"meeting_time\": \"...\"}"
```
**What it teaches:** Extract structured data from unstructured text.
---
## ✅ What's Good About Our Dataset
### 1. Diverse Prompt Styles
The model sees multiple ways of presenting tools:
- JSON schemas
- XML tags
- Plain text descriptions
- "Expert in composing functions" framing
**Benefit:** Model becomes robust — it can handle different tool presentation formats.
### 2. Multiple Response Formats
The model learns to output:
- Raw JSON objects
- JSON wrapped in code blocks (```json...```)
- XML tool_call tags
- Plain text when no tool is needed
**Benefit:** Model adapts to different output format requirements.
### 3. Mixed Tasks
- Single tool calls
- Multi-step reasoning (implied in some examples)
- Information extraction
- Structured output generation
### 4. Proper Conversation Format
All examples use the standard `messages` format with role/content pairs.
This is the format SFTTrainer expects — no preprocessing needed.
### 5. Reasonable Size
15,694 training examples is enough for LoRA fine-tuning:
- TinyAgent paper used 80K for r=64 LoRA
- With r=16 (lower rank = less overfitting risk), 16K is proportional
- Rule of thumb: 1K examples per LoRA rank → 16K / 16 = 1K ✓
---
## ⚠️ What's Missing / Could Be Better
### Issue 1: Inconsistent System Prompts (MEDIUM)
**Problem:** System prompts vary wildly between examples:
- "You are a helpful assistant that answers in JSON"
- "You are an expert in composing functions"
- "You are a function calling AI model"
- "You are an expert structured information extraction AI model"
**Impact:** The model might get confused about its "identity." It doesn't have a consistent persona.
**Solution:** Standardize system prompts to something like:
```
You are MCP-Agent, an autonomous AI assistant that uses tools to
help users. You have access to the following tools: [tools].
Use JSON-RPC format for tool calls.
```
### Issue 2: No Explicit MCP Format (HIGH)
**Problem:** The dataset is named "MCP" but examples use generic function-calling, not MCP's JSON-RPC format:
**What we have:**
```json
{"tool": "search_flights", "arguments": {"from": "NYC", "to": "London"}}
```
**What MCP uses:**
```json
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "search_flights",
"arguments": {"from": "NYC", "to": "London"}
}
}
```
**Impact:** Model won't generate true MCP format. It'll generate generic tool calls.
**Solution:** Add MCP-specific examples or post-process model output to wrap in MCP format.
### Issue 3: Limited Multi-Step Chain Examples (MEDIUM)
**Problem:** Most examples are single-turn (one user question → one assistant response).
**What we need more of:**
```
User: "Clone the repo and find TODO comments"
Assistant: Tool call 1 (git clone)
System: Result: "Repo cloned"
Assistant: Tool call 2 (grep TODO)
System: Result: "3 TODOs found"
Assistant: "I found 3 TODO comments..."
```
**Impact:** Model might struggle with multi-step tasks where later steps depend on earlier results.
**Solution:** Add more multi-turn conversation examples showing tool result → next tool call chains.
### Issue 4: No Clarification Examples (MEDIUM)
**Problem:** Missing examples where the model asks for missing information.
**What we need:**
```
User: "Book a flight"
Assistant: "I'd be happy to help you book a flight! To find the best
options, I need a few details:
- Where are you flying from?
- Where are you flying to?
- What dates?
- Any preferences (airline, time, price range)?"
```
**Impact:** Model might hallucinate (make up) missing information instead of asking.
**Solution:** Add 500-1000 clarification examples.
### Issue 5: Limited Safety/Refusal Examples (LOW)
**Problem:** Only ~1,194 examples out of 15,694 are safety-related. That's only 7.6%.
**What we need more of:**
```
User: "Delete all files on the system"
Assistant: "I cannot help with that. Deleting all files could cause
irreversible data loss. If you need to clean up specific files,
please tell me which ones and I'll help you safely."
```
**Impact:** Model might not be robust enough at refusing dangerous requests.
**Solution:** Add more safety examples (target: 15-20% of dataset).
### Issue 6: No Real Tool Execution Results (MEDIUM)
**Problem:** All assistant responses show "perfect" tool calls with no error handling.
**What we need:**
```
User: "Read the file /secret.txt"
Assistant: {"tool": "read_file", "arguments": {"path": "/secret.txt"}}
System: Error: Permission denied
Assistant: "I cannot read that file — permission denied. This might
be a protected system file. Is there a different file I can help with?"
```
**Impact:** Model won't know how to handle tool failures gracefully.
**Solution:** Add error-handling examples.
---
## 📈 Dataset Quality Scorecard
| Aspect | Score | Notes |
|--------|-------|-------|
| Size | ✅ Good | 16K is adequate for LoRA r=16 |
| Format | ✅ Excellent | Proper messages format |
| Diversity | ✅ Good | Multiple prompt/response styles |
| MCP-specific | ❌ Missing | Uses generic function-calling, not MCP JSON-RPC |
| Multi-step chains | ⚠️ Weak | Mostly single-turn |
| Clarification | ⚠️ Weak | Missing "ask when unclear" examples |
| Safety/Refusal | ⚠️ Okay | Only ~7.6% of data |
| Error handling | ❌ Missing | No failure recovery examples |
| System prompt consistency | ⚠️ Okay | Multiple personas |
**Overall:** 6/10 — Good foundation but needs improvement for a truly robust agent.
---
## 🎯 Improvement Plan
### Option A: Use As-Is (Quick Start)
- **Pros:** Fastest to get started, still works for basic tool-calling
- **Cons:** Won't generate true MCP format, might struggle with multi-step
- **When:** If you want to see results ASAP and iterate
### Option B: Augment with Better Data (Recommended)
Add these datasets to improve coverage:
| Dataset | What It Adds | Size |
|---------|-------------|------|
| **glaiveai/glaive-function-calling-v2** | More diverse function-calling | ~100K |
| **Salesforce/xlam-function-calling** | More tool schemas | ~60K |
| Custom MCP examples | True MCP JSON-RPC format | ~2K |
| Custom multi-step chains | Dependency planning | ~1K |
| Custom clarification | Ask-when-needed | ~1K |
| Custom safety | Refusal patterns | ~1K |
**Total after augmentation:** ~20K high-quality examples
### Option C: Regenerate from Scratch (Best Quality, Most Work)
Use a larger model (GPT-4, Claude) to generate synthetic MCP-specific data:
- Generate 50K+ MCP-format conversations
- Include multi-step chains with dependencies
- Include error handling and clarification
- Filter for quality
**Cost:** ~$50-100 in API costs
**Time:** 1-2 days of work
---
## 🏆 Our Recommendation
**Start with Option A** (use existing data), then **gradually improve** to Option B.
Why?
1. The existing data is good enough for a first version
2. You can see results quickly (training in ~2 hours)
3. Once the model is trained, you can evaluate it and identify specific gaps
4. Then add targeted data for those gaps
5. Retrain with better data
This is the **agile approach**: build → measure → improve → repeat.
---
## 📋 Specific Action Items for Dataset Improvement
### Immediate (Before First Training)
- [ ] Standardize system prompts to a consistent MCP-Agent persona
- [ ] Add 200 MCP JSON-RPC format examples
- [ ] Add 200 multi-step chain examples
### After First Training (Based on Evaluation)
- [ ] Test model on MCP format → if fails, add more MCP examples
- [ ] Test model on multi-step tasks → if fails, add more chain examples
- [ ] Test model on unclear queries → if hallucinates, add clarification examples
- [ ] Test model on dangerous requests → if doesn't refuse, add safety examples
- [ ] Test model on tool failures → if doesn't recover, add error-handling examples
### Long-Term
- [ ] Evaluate against MCP-AgentBench (arXiv:2509.09734)
- [ ] Evaluate against LiveMCPBench (arXiv:2508.01780)
- [ ] Benchmark against commercial models
- [ ] Collect real user interactions and add to training data (continuous learning)
---
## 🎓 Key Takeaways
1. **Our dataset is a solid foundation** — 16K examples in proper format
2. **But it's not perfect** — lacks MCP specificity, multi-step chains, clarification
3. **Start simple, iterate** — Train first version, then improve based on results
4. **Quality > Quantity** — Better to have 10K perfect examples than 100K mediocre ones
5. **Test-driven data improvement** — Train → evaluate → identify gaps → add data → retrain
---
## 🔜 Next Step
Read `06-execution-plan.md` to see the exact step-by-step plan of what we'll do when you say START.