Upload docs/05-dataset.md

678cba6 verified 12 days ago

11.4 kB

	# 05 — Dataset Analysis: What We Have, What's Missing, How to Improve

	## 📊 Current Dataset Overview

	Dataset ID: `muhammadtlha944/mcp-agent-training-data`
	Created: April 23, 2026
	Size: 66.4 MB total

	### Splits

	\| Split \| Examples \| Size \| Purpose \|
	\|-------\|----------\|------\|---------\|
	\| train \| 15,694 \| 63.2 MB \| Training the model \|
	\| validation \| 826 \| 3.2 MB \| Testing generalization \|

	### Format

	Each example has a `messages` column with a list of dictionaries:

	```json
	[
	{"role": "system", "content": "You are an expert in composing functions..."},
	{"role": "user", "content": "Search for hotels near the airport with free WiFi"},
	{"role": "assistant", "content": "{\"tool\": \"search_hotels\", \"arguments\": {...}}"}
	]
	```

	Why this format is perfect:
	- ✅ SFTTrainer automatically detects `messages` format
	- ✅ Applies the model's chat template automatically
	- ✅ Preserves conversation structure (system → user → assistant)
	- ✅ Standard format for instruction-tuned models

	---

	## 🔍 Deep Dive: What's In Our Data?

	### Sample Analysis (10 random examples)

	From our inspection, the dataset contains these types of conversations:

	#### Type 1: JSON Schema Function Calling (~30%)
	```
	System: "You are a helpful assistant that answers in JSON.
	Here's the json schema you must adhere to:
	<schema>{...}</schema>"
	User: "What tools are available?"
	Assistant: "{\"code_parsing\": {...}}"
	```

	What it teaches: Generate structured JSON output following a schema.

	#### Type 2: Expert Function Composer (~40%)
	```
	System: "You are an expert in composing functions. You are given
	a question and a set of possible functions. Based on the
	question, you will need to make one or more function/tool
	calls to achieve the purpose."
	User: "Find the cheapest flight from NYC to London next Tuesday"
	Assistant: "{\"tool\": \"search_flights\", \"arguments\": {...}}"
	```

	What it teaches: Choose the right function and provide correct arguments.

	#### Type 3: Tool Use with XML Tags (~20%)
	```
	System: "You are a function calling AI model. You are provided
	with function signatures within <tools></tools> XML tags."
	User: "What's the weather in Tokyo?"
	Assistant: "<tool_call>\n{\"name\": \"get_weather\", ...}\n</tool_call>"
	```

	What it teaches: Parse XML-formatted tool schemas and generate tool calls.

	#### Type 4: Information Extraction (~10%)
	```
	System: "You are an expert structured information extraction AI model."
	User: "Extract the meeting details from this email..."
	Assistant: "{\"meeting_date\": \"...\", \"meeting_time\": \"...\"}"
	```

	What it teaches: Extract structured data from unstructured text.

	---

	## ✅ What's Good About Our Dataset

	### 1. Diverse Prompt Styles
	The model sees multiple ways of presenting tools:
	- JSON schemas
	- XML tags
	- Plain text descriptions
	- "Expert in composing functions" framing

	Benefit: Model becomes robust — it can handle different tool presentation formats.

	### 2. Multiple Response Formats
	The model learns to output:
	- Raw JSON objects
	- JSON wrapped in code blocks (```json...```)
	- XML tool_call tags
	- Plain text when no tool is needed

	Benefit: Model adapts to different output format requirements.

	### 3. Mixed Tasks
	- Single tool calls
	- Multi-step reasoning (implied in some examples)
	- Information extraction
	- Structured output generation

	### 4. Proper Conversation Format
	All examples use the standard `messages` format with role/content pairs.
	This is the format SFTTrainer expects — no preprocessing needed.

	### 5. Reasonable Size
	15,694 training examples is enough for LoRA fine-tuning:
	- TinyAgent paper used 80K for r=64 LoRA
	- With r=16 (lower rank = less overfitting risk), 16K is proportional
	- Rule of thumb: 1K examples per LoRA rank → 16K / 16 = 1K ✓

	---

	## ⚠️ What's Missing / Could Be Better

	### Issue 1: Inconsistent System Prompts (MEDIUM)

	Problem: System prompts vary wildly between examples:
	- "You are a helpful assistant that answers in JSON"
	- "You are an expert in composing functions"
	- "You are a function calling AI model"
	- "You are an expert structured information extraction AI model"

	Impact: The model might get confused about its "identity." It doesn't have a consistent persona.

	Solution: Standardize system prompts to something like:
	```
	You are MCP-Agent, an autonomous AI assistant that uses tools to
	help users. You have access to the following tools: [tools].
	Use JSON-RPC format for tool calls.
	```

	### Issue 2: No Explicit MCP Format (HIGH)

	Problem: The dataset is named "MCP" but examples use generic function-calling, not MCP's JSON-RPC format:

	What we have:
	```json
	{"tool": "search_flights", "arguments": {"from": "NYC", "to": "London"}}
	```

	What MCP uses:
	```json
	{
	"jsonrpc": "2.0",
	"method": "tools/call",
	"params": {
	"name": "search_flights",
	"arguments": {"from": "NYC", "to": "London"}
	}
	}
	```

	Impact: Model won't generate true MCP format. It'll generate generic tool calls.

	Solution: Add MCP-specific examples or post-process model output to wrap in MCP format.

	### Issue 3: Limited Multi-Step Chain Examples (MEDIUM)

	Problem: Most examples are single-turn (one user question → one assistant response).

	What we need more of:
	```
	User: "Clone the repo and find TODO comments"
	Assistant: Tool call 1 (git clone)
	System: Result: "Repo cloned"
	Assistant: Tool call 2 (grep TODO)
	System: Result: "3 TODOs found"
	Assistant: "I found 3 TODO comments..."
	```

	Impact: Model might struggle with multi-step tasks where later steps depend on earlier results.

	Solution: Add more multi-turn conversation examples showing tool result → next tool call chains.

	### Issue 4: No Clarification Examples (MEDIUM)

	Problem: Missing examples where the model asks for missing information.

	What we need:
	```
	User: "Book a flight"
	Assistant: "I'd be happy to help you book a flight! To find the best
	options, I need a few details:
	- Where are you flying from?
	- Where are you flying to?
	- What dates?
	- Any preferences (airline, time, price range)?"
	```

	Impact: Model might hallucinate (make up) missing information instead of asking.

	Solution: Add 500-1000 clarification examples.

	### Issue 5: Limited Safety/Refusal Examples (LOW)

	Problem: Only ~1,194 examples out of 15,694 are safety-related. That's only 7.6%.

	What we need more of:
	```
	User: "Delete all files on the system"
	Assistant: "I cannot help with that. Deleting all files could cause
	irreversible data loss. If you need to clean up specific files,
	please tell me which ones and I'll help you safely."
	```

	Impact: Model might not be robust enough at refusing dangerous requests.

	Solution: Add more safety examples (target: 15-20% of dataset).

	### Issue 6: No Real Tool Execution Results (MEDIUM)

	Problem: All assistant responses show "perfect" tool calls with no error handling.

	What we need:
	```
	User: "Read the file /secret.txt"
	Assistant: {"tool": "read_file", "arguments": {"path": "/secret.txt"}}
	System: Error: Permission denied
	Assistant: "I cannot read that file — permission denied. This might
	be a protected system file. Is there a different file I can help with?"
	```

	Impact: Model won't know how to handle tool failures gracefully.

	Solution: Add error-handling examples.

	---

	## 📈 Dataset Quality Scorecard

	\| Aspect \| Score \| Notes \|
	\|--------\|-------\|-------\|
	\| Size \| ✅ Good \| 16K is adequate for LoRA r=16 \|
	\| Format \| ✅ Excellent \| Proper messages format \|
	\| Diversity \| ✅ Good \| Multiple prompt/response styles \|
	\| MCP-specific \| ❌ Missing \| Uses generic function-calling, not MCP JSON-RPC \|
	\| Multi-step chains \| ⚠️ Weak \| Mostly single-turn \|
	\| Clarification \| ⚠️ Weak \| Missing "ask when unclear" examples \|
	\| Safety/Refusal \| ⚠️ Okay \| Only ~7.6% of data \|
	\| Error handling \| ❌ Missing \| No failure recovery examples \|
	\| System prompt consistency \| ⚠️ Okay \| Multiple personas \|

	Overall: 6/10 — Good foundation but needs improvement for a truly robust agent.

	---

	## 🎯 Improvement Plan

	### Option A: Use As-Is (Quick Start)
	- Pros: Fastest to get started, still works for basic tool-calling
	- Cons: Won't generate true MCP format, might struggle with multi-step
	- When: If you want to see results ASAP and iterate

	### Option B: Augment with Better Data (Recommended)
	Add these datasets to improve coverage:

	\| Dataset \| What It Adds \| Size \|
	\|---------\|-------------\|------\|
	\| glaiveai/glaive-function-calling-v2 \| More diverse function-calling \| ~100K \|
	\| Salesforce/xlam-function-calling \| More tool schemas \| ~60K \|
	\| Custom MCP examples \| True MCP JSON-RPC format \| ~2K \|
	\| Custom multi-step chains \| Dependency planning \| ~1K \|
	\| Custom clarification \| Ask-when-needed \| ~1K \|
	\| Custom safety \| Refusal patterns \| ~1K \|

	Total after augmentation: ~20K high-quality examples

	### Option C: Regenerate from Scratch (Best Quality, Most Work)
	Use a larger model (GPT-4, Claude) to generate synthetic MCP-specific data:
	- Generate 50K+ MCP-format conversations
	- Include multi-step chains with dependencies
	- Include error handling and clarification
	- Filter for quality

	Cost: ~$50-100 in API costs
	Time: 1-2 days of work

	---

	## 🏆 Our Recommendation

	Start with Option A (use existing data), then gradually improve to Option B.

	Why?
	1. The existing data is good enough for a first version
	2. You can see results quickly (training in ~2 hours)
	3. Once the model is trained, you can evaluate it and identify specific gaps
	4. Then add targeted data for those gaps
	5. Retrain with better data

	This is the agile approach: build → measure → improve → repeat.

	---

	## 📋 Specific Action Items for Dataset Improvement

	### Immediate (Before First Training)
	- [ ] Standardize system prompts to a consistent MCP-Agent persona
	- [ ] Add 200 MCP JSON-RPC format examples
	- [ ] Add 200 multi-step chain examples

	### After First Training (Based on Evaluation)
	- [ ] Test model on MCP format → if fails, add more MCP examples
	- [ ] Test model on multi-step tasks → if fails, add more chain examples
	- [ ] Test model on unclear queries → if hallucinates, add clarification examples
	- [ ] Test model on dangerous requests → if doesn't refuse, add safety examples
	- [ ] Test model on tool failures → if doesn't recover, add error-handling examples

	### Long-Term
	- [ ] Evaluate against MCP-AgentBench (arXiv:2509.09734)
	- [ ] Evaluate against LiveMCPBench (arXiv:2508.01780)
	- [ ] Benchmark against commercial models
	- [ ] Collect real user interactions and add to training data (continuous learning)

	---

	## 🎓 Key Takeaways

	1. Our dataset is a solid foundation — 16K examples in proper format
	2. But it's not perfect — lacks MCP specificity, multi-step chains, clarification
	3. Start simple, iterate — Train first version, then improve based on results
	4. Quality > Quantity — Better to have 10K perfect examples than 100K mediocre ones
	5. Test-driven data improvement — Train → evaluate → identify gaps → add data → retrain

	---

	## 🔜 Next Step

	Read `06-execution-plan.md` to see the exact step-by-step plan of what we'll do when you say START.