# 05 — Dataset Analysis: What We Have, What's Missing, How to Improve

## 📊 Current Dataset Overview

**Dataset ID:** `muhammadtlha944/mcp-agent-training-data`  
**Created:** April 23, 2026  
**Size:** 66.4 MB total

### Splits

| Split | Examples | Size | Purpose |
|-------|----------|------|---------|
| **train** | 15,694 | 63.2 MB | Training the model |
| **validation** | 826 | 3.2 MB | Testing generalization |

### Format

Each example has a `messages` column with a list of dictionaries:

```json
[
  {"role": "system", "content": "You are an expert in composing functions..."},
  {"role": "user", "content": "Search for hotels near the airport with free WiFi"},
  {"role": "assistant", "content": "{\"tool\": \"search_hotels\", \"arguments\": {...}}"}
]
```

**Why this format is perfect:**
- ✅ SFTTrainer automatically detects `messages` format
- ✅ Applies the model's chat template automatically
- ✅ Preserves conversation structure (system → user → assistant)
- ✅ Standard format for instruction-tuned models

---

## 🔍 Deep Dive: What's In Our Data?

### Sample Analysis (10 random examples)

From our inspection, the dataset contains these types of conversations:

#### Type 1: JSON Schema Function Calling (~30%)
```
System: "You are a helpful assistant that answers in JSON. 
         Here's the json schema you must adhere to:
         <schema>{...}</schema>"
User: "What tools are available?"
Assistant: "{\"code_parsing\": {...}}"
```

**What it teaches:** Generate structured JSON output following a schema.

#### Type 2: Expert Function Composer (~40%)
```
System: "You are an expert in composing functions. You are given 
         a question and a set of possible functions. Based on the 
         question, you will need to make one or more function/tool 
         calls to achieve the purpose."
User: "Find the cheapest flight from NYC to London next Tuesday"
Assistant: "{\"tool\": \"search_flights\", \"arguments\": {...}}"
```

**What it teaches:** Choose the right function and provide correct arguments.

#### Type 3: Tool Use with XML Tags (~20%)
```
System: "You are a function calling AI model. You are provided 
         with function signatures within <tools></tools> XML tags."
User: "What's the weather in Tokyo?"
Assistant: "<tool_call>\n{\"name\": \"get_weather\", ...}\n</tool_call>"
```

**What it teaches:** Parse XML-formatted tool schemas and generate tool calls.

#### Type 4: Information Extraction (~10%)
```
System: "You are an expert structured information extraction AI model."
User: "Extract the meeting details from this email..."
Assistant: "{\"meeting_date\": \"...\", \"meeting_time\": \"...\"}"
```

**What it teaches:** Extract structured data from unstructured text.

---

## ✅ What's Good About Our Dataset

### 1. Diverse Prompt Styles
The model sees multiple ways of presenting tools:
- JSON schemas
- XML tags
- Plain text descriptions
- "Expert in composing functions" framing

**Benefit:** Model becomes robust — it can handle different tool presentation formats.

### 2. Multiple Response Formats
The model learns to output:
- Raw JSON objects
- JSON wrapped in code blocks (```json...```)
- XML tool_call tags
- Plain text when no tool is needed

**Benefit:** Model adapts to different output format requirements.

### 3. Mixed Tasks
- Single tool calls
- Multi-step reasoning (implied in some examples)
- Information extraction
- Structured output generation

### 4. Proper Conversation Format
All examples use the standard `messages` format with role/content pairs.
This is the format SFTTrainer expects — no preprocessing needed.

### 5. Reasonable Size
15,694 training examples is enough for LoRA fine-tuning:
- TinyAgent paper used 80K for r=64 LoRA
- With r=16 (lower rank = less overfitting risk), 16K is proportional
- Rule of thumb: 1K examples per LoRA rank → 16K / 16 = 1K ✓

---

## ⚠️ What's Missing / Could Be Better

### Issue 1: Inconsistent System Prompts (MEDIUM)

**Problem:** System prompts vary wildly between examples:
- "You are a helpful assistant that answers in JSON"
- "You are an expert in composing functions"
- "You are a function calling AI model"
- "You are an expert structured information extraction AI model"

**Impact:** The model might get confused about its "identity." It doesn't have a consistent persona.

**Solution:** Standardize system prompts to something like:
```
You are MCP-Agent, an autonomous AI assistant that uses tools to 
help users. You have access to the following tools: [tools]. 
Use JSON-RPC format for tool calls.
```

### Issue 2: No Explicit MCP Format (HIGH)

**Problem:** The dataset is named "MCP" but examples use generic function-calling, not MCP's JSON-RPC format:

**What we have:**
```json
{"tool": "search_flights", "arguments": {"from": "NYC", "to": "London"}}
```

**What MCP uses:**
```json
{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "search_flights",
    "arguments": {"from": "NYC", "to": "London"}
  }
}
```

**Impact:** Model won't generate true MCP format. It'll generate generic tool calls.

**Solution:** Add MCP-specific examples or post-process model output to wrap in MCP format.

### Issue 3: Limited Multi-Step Chain Examples (MEDIUM)

**Problem:** Most examples are single-turn (one user question → one assistant response).

**What we need more of:**
```
User: "Clone the repo and find TODO comments"
Assistant: Tool call 1 (git clone)
System: Result: "Repo cloned"
Assistant: Tool call 2 (grep TODO)
System: Result: "3 TODOs found"
Assistant: "I found 3 TODO comments..."
```

**Impact:** Model might struggle with multi-step tasks where later steps depend on earlier results.

**Solution:** Add more multi-turn conversation examples showing tool result → next tool call chains.

### Issue 4: No Clarification Examples (MEDIUM)

**Problem:** Missing examples where the model asks for missing information.

**What we need:**
```
User: "Book a flight"
Assistant: "I'd be happy to help you book a flight! To find the best 
options, I need a few details:
- Where are you flying from?
- Where are you flying to?
- What dates?
- Any preferences (airline, time, price range)?"
```

**Impact:** Model might hallucinate (make up) missing information instead of asking.

**Solution:** Add 500-1000 clarification examples.

### Issue 5: Limited Safety/Refusal Examples (LOW)

**Problem:** Only ~1,194 examples out of 15,694 are safety-related. That's only 7.6%.

**What we need more of:**
```
User: "Delete all files on the system"
Assistant: "I cannot help with that. Deleting all files could cause 
irreversible data loss. If you need to clean up specific files, 
please tell me which ones and I'll help you safely."
```

**Impact:** Model might not be robust enough at refusing dangerous requests.

**Solution:** Add more safety examples (target: 15-20% of dataset).

### Issue 6: No Real Tool Execution Results (MEDIUM)

**Problem:** All assistant responses show "perfect" tool calls with no error handling.

**What we need:**
```
User: "Read the file /secret.txt"
Assistant: {"tool": "read_file", "arguments": {"path": "/secret.txt"}}
System: Error: Permission denied
Assistant: "I cannot read that file — permission denied. This might 
be a protected system file. Is there a different file I can help with?"
```

**Impact:** Model won't know how to handle tool failures gracefully.

**Solution:** Add error-handling examples.

---

## 📈 Dataset Quality Scorecard

| Aspect | Score | Notes |
|--------|-------|-------|
| Size | ✅ Good | 16K is adequate for LoRA r=16 |
| Format | ✅ Excellent | Proper messages format |
| Diversity | ✅ Good | Multiple prompt/response styles |
| MCP-specific | ❌ Missing | Uses generic function-calling, not MCP JSON-RPC |
| Multi-step chains | ⚠️ Weak | Mostly single-turn |
| Clarification | ⚠️ Weak | Missing "ask when unclear" examples |
| Safety/Refusal | ⚠️ Okay | Only ~7.6% of data |
| Error handling | ❌ Missing | No failure recovery examples |
| System prompt consistency | ⚠️ Okay | Multiple personas |

**Overall:** 6/10 — Good foundation but needs improvement for a truly robust agent.

---

## 🎯 Improvement Plan

### Option A: Use As-Is (Quick Start)
- **Pros:** Fastest to get started, still works for basic tool-calling
- **Cons:** Won't generate true MCP format, might struggle with multi-step
- **When:** If you want to see results ASAP and iterate

### Option B: Augment with Better Data (Recommended)
Add these datasets to improve coverage:

| Dataset | What It Adds | Size |
|---------|-------------|------|
| **glaiveai/glaive-function-calling-v2** | More diverse function-calling | ~100K |
| **Salesforce/xlam-function-calling** | More tool schemas | ~60K |
| Custom MCP examples | True MCP JSON-RPC format | ~2K |
| Custom multi-step chains | Dependency planning | ~1K |
| Custom clarification | Ask-when-needed | ~1K |
| Custom safety | Refusal patterns | ~1K |

**Total after augmentation:** ~20K high-quality examples

### Option C: Regenerate from Scratch (Best Quality, Most Work)
Use a larger model (GPT-4, Claude) to generate synthetic MCP-specific data:
- Generate 50K+ MCP-format conversations
- Include multi-step chains with dependencies
- Include error handling and clarification
- Filter for quality

**Cost:** ~$50-100 in API costs
**Time:** 1-2 days of work

---

## 🏆 Our Recommendation

**Start with Option A** (use existing data), then **gradually improve** to Option B.

Why?
1. The existing data is good enough for a first version
2. You can see results quickly (training in ~2 hours)
3. Once the model is trained, you can evaluate it and identify specific gaps
4. Then add targeted data for those gaps
5. Retrain with better data

This is the **agile approach**: build → measure → improve → repeat.

---

## 📋 Specific Action Items for Dataset Improvement

### Immediate (Before First Training)
- [ ] Standardize system prompts to a consistent MCP-Agent persona
- [ ] Add 200 MCP JSON-RPC format examples
- [ ] Add 200 multi-step chain examples

### After First Training (Based on Evaluation)
- [ ] Test model on MCP format → if fails, add more MCP examples
- [ ] Test model on multi-step tasks → if fails, add more chain examples  
- [ ] Test model on unclear queries → if hallucinates, add clarification examples
- [ ] Test model on dangerous requests → if doesn't refuse, add safety examples
- [ ] Test model on tool failures → if doesn't recover, add error-handling examples

### Long-Term
- [ ] Evaluate against MCP-AgentBench (arXiv:2509.09734)
- [ ] Evaluate against LiveMCPBench (arXiv:2508.01780)
- [ ] Benchmark against commercial models
- [ ] Collect real user interactions and add to training data (continuous learning)

---

## 🎓 Key Takeaways

1. **Our dataset is a solid foundation** — 16K examples in proper format
2. **But it's not perfect** — lacks MCP specificity, multi-step chains, clarification
3. **Start simple, iterate** — Train first version, then improve based on results
4. **Quality > Quantity** — Better to have 10K perfect examples than 100K mediocre ones
5. **Test-driven data improvement** — Train → evaluate → identify gaps → add data → retrain

---

## 🔜 Next Step

Read `06-execution-plan.md` to see the exact step-by-step plan of what we'll do when you say START.