# 05 — Dataset Analysis: What We Have, What's Missing, How to Improve ## 📊 Current Dataset Overview **Dataset ID:** `muhammadtlha944/mcp-agent-training-data` **Created:** April 23, 2026 **Size:** 66.4 MB total ### Splits | Split | Examples | Size | Purpose | |-------|----------|------|---------| | **train** | 15,694 | 63.2 MB | Training the model | | **validation** | 826 | 3.2 MB | Testing generalization | ### Format Each example has a `messages` column with a list of dictionaries: ```json [ {"role": "system", "content": "You are an expert in composing functions..."}, {"role": "user", "content": "Search for hotels near the airport with free WiFi"}, {"role": "assistant", "content": "{\"tool\": \"search_hotels\", \"arguments\": {...}}"} ] ``` **Why this format is perfect:** - ✅ SFTTrainer automatically detects `messages` format - ✅ Applies the model's chat template automatically - ✅ Preserves conversation structure (system → user → assistant) - ✅ Standard format for instruction-tuned models --- ## 🔍 Deep Dive: What's In Our Data? ### Sample Analysis (10 random examples) From our inspection, the dataset contains these types of conversations: #### Type 1: JSON Schema Function Calling (~30%) ``` System: "You are a helpful assistant that answers in JSON. Here's the json schema you must adhere to: {...}" User: "What tools are available?" Assistant: "{\"code_parsing\": {...}}" ``` **What it teaches:** Generate structured JSON output following a schema. #### Type 2: Expert Function Composer (~40%) ``` System: "You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose." User: "Find the cheapest flight from NYC to London next Tuesday" Assistant: "{\"tool\": \"search_flights\", \"arguments\": {...}}" ``` **What it teaches:** Choose the right function and provide correct arguments. #### Type 3: Tool Use with XML Tags (~20%) ``` System: "You are a function calling AI model. You are provided with function signatures within XML tags." User: "What's the weather in Tokyo?" Assistant: "\n{\"name\": \"get_weather\", ...}\n" ``` **What it teaches:** Parse XML-formatted tool schemas and generate tool calls. #### Type 4: Information Extraction (~10%) ``` System: "You are an expert structured information extraction AI model." User: "Extract the meeting details from this email..." Assistant: "{\"meeting_date\": \"...\", \"meeting_time\": \"...\"}" ``` **What it teaches:** Extract structured data from unstructured text. --- ## ✅ What's Good About Our Dataset ### 1. Diverse Prompt Styles The model sees multiple ways of presenting tools: - JSON schemas - XML tags - Plain text descriptions - "Expert in composing functions" framing **Benefit:** Model becomes robust — it can handle different tool presentation formats. ### 2. Multiple Response Formats The model learns to output: - Raw JSON objects - JSON wrapped in code blocks (```json...```) - XML tool_call tags - Plain text when no tool is needed **Benefit:** Model adapts to different output format requirements. ### 3. Mixed Tasks - Single tool calls - Multi-step reasoning (implied in some examples) - Information extraction - Structured output generation ### 4. Proper Conversation Format All examples use the standard `messages` format with role/content pairs. This is the format SFTTrainer expects — no preprocessing needed. ### 5. Reasonable Size 15,694 training examples is enough for LoRA fine-tuning: - TinyAgent paper used 80K for r=64 LoRA - With r=16 (lower rank = less overfitting risk), 16K is proportional - Rule of thumb: 1K examples per LoRA rank → 16K / 16 = 1K ✓ --- ## ⚠️ What's Missing / Could Be Better ### Issue 1: Inconsistent System Prompts (MEDIUM) **Problem:** System prompts vary wildly between examples: - "You are a helpful assistant that answers in JSON" - "You are an expert in composing functions" - "You are a function calling AI model" - "You are an expert structured information extraction AI model" **Impact:** The model might get confused about its "identity." It doesn't have a consistent persona. **Solution:** Standardize system prompts to something like: ``` You are MCP-Agent, an autonomous AI assistant that uses tools to help users. You have access to the following tools: [tools]. Use JSON-RPC format for tool calls. ``` ### Issue 2: No Explicit MCP Format (HIGH) **Problem:** The dataset is named "MCP" but examples use generic function-calling, not MCP's JSON-RPC format: **What we have:** ```json {"tool": "search_flights", "arguments": {"from": "NYC", "to": "London"}} ``` **What MCP uses:** ```json { "jsonrpc": "2.0", "method": "tools/call", "params": { "name": "search_flights", "arguments": {"from": "NYC", "to": "London"} } } ``` **Impact:** Model won't generate true MCP format. It'll generate generic tool calls. **Solution:** Add MCP-specific examples or post-process model output to wrap in MCP format. ### Issue 3: Limited Multi-Step Chain Examples (MEDIUM) **Problem:** Most examples are single-turn (one user question → one assistant response). **What we need more of:** ``` User: "Clone the repo and find TODO comments" Assistant: Tool call 1 (git clone) System: Result: "Repo cloned" Assistant: Tool call 2 (grep TODO) System: Result: "3 TODOs found" Assistant: "I found 3 TODO comments..." ``` **Impact:** Model might struggle with multi-step tasks where later steps depend on earlier results. **Solution:** Add more multi-turn conversation examples showing tool result → next tool call chains. ### Issue 4: No Clarification Examples (MEDIUM) **Problem:** Missing examples where the model asks for missing information. **What we need:** ``` User: "Book a flight" Assistant: "I'd be happy to help you book a flight! To find the best options, I need a few details: - Where are you flying from? - Where are you flying to? - What dates? - Any preferences (airline, time, price range)?" ``` **Impact:** Model might hallucinate (make up) missing information instead of asking. **Solution:** Add 500-1000 clarification examples. ### Issue 5: Limited Safety/Refusal Examples (LOW) **Problem:** Only ~1,194 examples out of 15,694 are safety-related. That's only 7.6%. **What we need more of:** ``` User: "Delete all files on the system" Assistant: "I cannot help with that. Deleting all files could cause irreversible data loss. If you need to clean up specific files, please tell me which ones and I'll help you safely." ``` **Impact:** Model might not be robust enough at refusing dangerous requests. **Solution:** Add more safety examples (target: 15-20% of dataset). ### Issue 6: No Real Tool Execution Results (MEDIUM) **Problem:** All assistant responses show "perfect" tool calls with no error handling. **What we need:** ``` User: "Read the file /secret.txt" Assistant: {"tool": "read_file", "arguments": {"path": "/secret.txt"}} System: Error: Permission denied Assistant: "I cannot read that file — permission denied. This might be a protected system file. Is there a different file I can help with?" ``` **Impact:** Model won't know how to handle tool failures gracefully. **Solution:** Add error-handling examples. --- ## 📈 Dataset Quality Scorecard | Aspect | Score | Notes | |--------|-------|-------| | Size | ✅ Good | 16K is adequate for LoRA r=16 | | Format | ✅ Excellent | Proper messages format | | Diversity | ✅ Good | Multiple prompt/response styles | | MCP-specific | ❌ Missing | Uses generic function-calling, not MCP JSON-RPC | | Multi-step chains | ⚠️ Weak | Mostly single-turn | | Clarification | ⚠️ Weak | Missing "ask when unclear" examples | | Safety/Refusal | ⚠️ Okay | Only ~7.6% of data | | Error handling | ❌ Missing | No failure recovery examples | | System prompt consistency | ⚠️ Okay | Multiple personas | **Overall:** 6/10 — Good foundation but needs improvement for a truly robust agent. --- ## 🎯 Improvement Plan ### Option A: Use As-Is (Quick Start) - **Pros:** Fastest to get started, still works for basic tool-calling - **Cons:** Won't generate true MCP format, might struggle with multi-step - **When:** If you want to see results ASAP and iterate ### Option B: Augment with Better Data (Recommended) Add these datasets to improve coverage: | Dataset | What It Adds | Size | |---------|-------------|------| | **glaiveai/glaive-function-calling-v2** | More diverse function-calling | ~100K | | **Salesforce/xlam-function-calling** | More tool schemas | ~60K | | Custom MCP examples | True MCP JSON-RPC format | ~2K | | Custom multi-step chains | Dependency planning | ~1K | | Custom clarification | Ask-when-needed | ~1K | | Custom safety | Refusal patterns | ~1K | **Total after augmentation:** ~20K high-quality examples ### Option C: Regenerate from Scratch (Best Quality, Most Work) Use a larger model (GPT-4, Claude) to generate synthetic MCP-specific data: - Generate 50K+ MCP-format conversations - Include multi-step chains with dependencies - Include error handling and clarification - Filter for quality **Cost:** ~$50-100 in API costs **Time:** 1-2 days of work --- ## 🏆 Our Recommendation **Start with Option A** (use existing data), then **gradually improve** to Option B. Why? 1. The existing data is good enough for a first version 2. You can see results quickly (training in ~2 hours) 3. Once the model is trained, you can evaluate it and identify specific gaps 4. Then add targeted data for those gaps 5. Retrain with better data This is the **agile approach**: build → measure → improve → repeat. --- ## 📋 Specific Action Items for Dataset Improvement ### Immediate (Before First Training) - [ ] Standardize system prompts to a consistent MCP-Agent persona - [ ] Add 200 MCP JSON-RPC format examples - [ ] Add 200 multi-step chain examples ### After First Training (Based on Evaluation) - [ ] Test model on MCP format → if fails, add more MCP examples - [ ] Test model on multi-step tasks → if fails, add more chain examples - [ ] Test model on unclear queries → if hallucinates, add clarification examples - [ ] Test model on dangerous requests → if doesn't refuse, add safety examples - [ ] Test model on tool failures → if doesn't recover, add error-handling examples ### Long-Term - [ ] Evaluate against MCP-AgentBench (arXiv:2509.09734) - [ ] Evaluate against LiveMCPBench (arXiv:2508.01780) - [ ] Benchmark against commercial models - [ ] Collect real user interactions and add to training data (continuous learning) --- ## 🎓 Key Takeaways 1. **Our dataset is a solid foundation** — 16K examples in proper format 2. **But it's not perfect** — lacks MCP specificity, multi-step chains, clarification 3. **Start simple, iterate** — Train first version, then improve based on results 4. **Quality > Quantity** — Better to have 10K perfect examples than 100K mediocre ones 5. **Test-driven data improvement** — Train → evaluate → identify gaps → add data → retrain --- ## 🔜 Next Step Read `06-execution-plan.md` to see the exact step-by-step plan of what we'll do when you say START.