| # 05 β Dataset Analysis: What We Have, What's Missing, How to Improve |
|
|
| ## π Current Dataset Overview |
|
|
| **Dataset ID:** `muhammadtlha944/mcp-agent-training-data` |
| **Created:** April 23, 2026 |
| **Size:** 66.4 MB total |
|
|
| ### Splits |
|
|
| | Split | Examples | Size | Purpose | |
| |-------|----------|------|---------| |
| | **train** | 15,694 | 63.2 MB | Training the model | |
| | **validation** | 826 | 3.2 MB | Testing generalization | |
|
|
| ### Format |
|
|
| Each example has a `messages` column with a list of dictionaries: |
|
|
| ```json |
| [ |
| {"role": "system", "content": "You are an expert in composing functions..."}, |
| {"role": "user", "content": "Search for hotels near the airport with free WiFi"}, |
| {"role": "assistant", "content": "{\"tool\": \"search_hotels\", \"arguments\": {...}}"} |
| ] |
| ``` |
|
|
| **Why this format is perfect:** |
| - β
SFTTrainer automatically detects `messages` format |
| - β
Applies the model's chat template automatically |
| - β
Preserves conversation structure (system β user β assistant) |
| - β
Standard format for instruction-tuned models |
|
|
| --- |
|
|
| ## π Deep Dive: What's In Our Data? |
|
|
| ### Sample Analysis (10 random examples) |
|
|
| From our inspection, the dataset contains these types of conversations: |
|
|
| #### Type 1: JSON Schema Function Calling (~30%) |
| ``` |
| System: "You are a helpful assistant that answers in JSON. |
| Here's the json schema you must adhere to: |
| <schema>{...}</schema>" |
| User: "What tools are available?" |
| Assistant: "{\"code_parsing\": {...}}" |
| ``` |
|
|
| **What it teaches:** Generate structured JSON output following a schema. |
|
|
| #### Type 2: Expert Function Composer (~40%) |
| ``` |
| System: "You are an expert in composing functions. You are given |
| a question and a set of possible functions. Based on the |
| question, you will need to make one or more function/tool |
| calls to achieve the purpose." |
| User: "Find the cheapest flight from NYC to London next Tuesday" |
| Assistant: "{\"tool\": \"search_flights\", \"arguments\": {...}}" |
| ``` |
|
|
| **What it teaches:** Choose the right function and provide correct arguments. |
|
|
| #### Type 3: Tool Use with XML Tags (~20%) |
| ``` |
| System: "You are a function calling AI model. You are provided |
| with function signatures within <tools></tools> XML tags." |
| User: "What's the weather in Tokyo?" |
| Assistant: "<tool_call>\n{\"name\": \"get_weather\", ...}\n</tool_call>" |
| ``` |
|
|
| **What it teaches:** Parse XML-formatted tool schemas and generate tool calls. |
|
|
| #### Type 4: Information Extraction (~10%) |
| ``` |
| System: "You are an expert structured information extraction AI model." |
| User: "Extract the meeting details from this email..." |
| Assistant: "{\"meeting_date\": \"...\", \"meeting_time\": \"...\"}" |
| ``` |
|
|
| **What it teaches:** Extract structured data from unstructured text. |
|
|
| --- |
|
|
| ## β
What's Good About Our Dataset |
|
|
| ### 1. Diverse Prompt Styles |
| The model sees multiple ways of presenting tools: |
| - JSON schemas |
| - XML tags |
| - Plain text descriptions |
| - "Expert in composing functions" framing |
|
|
| **Benefit:** Model becomes robust β it can handle different tool presentation formats. |
|
|
| ### 2. Multiple Response Formats |
| The model learns to output: |
| - Raw JSON objects |
| - JSON wrapped in code blocks (```json...```) |
| - XML tool_call tags |
| - Plain text when no tool is needed |
| |
| **Benefit:** Model adapts to different output format requirements. |
| |
| ### 3. Mixed Tasks |
| - Single tool calls |
| - Multi-step reasoning (implied in some examples) |
| - Information extraction |
| - Structured output generation |
| |
| ### 4. Proper Conversation Format |
| All examples use the standard `messages` format with role/content pairs. |
| This is the format SFTTrainer expects β no preprocessing needed. |
| |
| ### 5. Reasonable Size |
| 15,694 training examples is enough for LoRA fine-tuning: |
| - TinyAgent paper used 80K for r=64 LoRA |
| - With r=16 (lower rank = less overfitting risk), 16K is proportional |
| - Rule of thumb: 1K examples per LoRA rank β 16K / 16 = 1K β |
| |
| --- |
| |
| ## β οΈ What's Missing / Could Be Better |
| |
| ### Issue 1: Inconsistent System Prompts (MEDIUM) |
| |
| **Problem:** System prompts vary wildly between examples: |
| - "You are a helpful assistant that answers in JSON" |
| - "You are an expert in composing functions" |
| - "You are a function calling AI model" |
| - "You are an expert structured information extraction AI model" |
| |
| **Impact:** The model might get confused about its "identity." It doesn't have a consistent persona. |
| |
| **Solution:** Standardize system prompts to something like: |
| ``` |
| You are MCP-Agent, an autonomous AI assistant that uses tools to |
| help users. You have access to the following tools: [tools]. |
| Use JSON-RPC format for tool calls. |
| ``` |
| |
| ### Issue 2: No Explicit MCP Format (HIGH) |
| |
| **Problem:** The dataset is named "MCP" but examples use generic function-calling, not MCP's JSON-RPC format: |
| |
| **What we have:** |
| ```json |
| {"tool": "search_flights", "arguments": {"from": "NYC", "to": "London"}} |
| ``` |
| |
| **What MCP uses:** |
| ```json |
| { |
| "jsonrpc": "2.0", |
| "method": "tools/call", |
| "params": { |
| "name": "search_flights", |
| "arguments": {"from": "NYC", "to": "London"} |
| } |
| } |
| ``` |
| |
| **Impact:** Model won't generate true MCP format. It'll generate generic tool calls. |
|
|
| **Solution:** Add MCP-specific examples or post-process model output to wrap in MCP format. |
|
|
| ### Issue 3: Limited Multi-Step Chain Examples (MEDIUM) |
|
|
| **Problem:** Most examples are single-turn (one user question β one assistant response). |
|
|
| **What we need more of:** |
| ``` |
| User: "Clone the repo and find TODO comments" |
| Assistant: Tool call 1 (git clone) |
| System: Result: "Repo cloned" |
| Assistant: Tool call 2 (grep TODO) |
| System: Result: "3 TODOs found" |
| Assistant: "I found 3 TODO comments..." |
| ``` |
|
|
| **Impact:** Model might struggle with multi-step tasks where later steps depend on earlier results. |
|
|
| **Solution:** Add more multi-turn conversation examples showing tool result β next tool call chains. |
|
|
| ### Issue 4: No Clarification Examples (MEDIUM) |
|
|
| **Problem:** Missing examples where the model asks for missing information. |
|
|
| **What we need:** |
| ``` |
| User: "Book a flight" |
| Assistant: "I'd be happy to help you book a flight! To find the best |
| options, I need a few details: |
| - Where are you flying from? |
| - Where are you flying to? |
| - What dates? |
| - Any preferences (airline, time, price range)?" |
| ``` |
|
|
| **Impact:** Model might hallucinate (make up) missing information instead of asking. |
|
|
| **Solution:** Add 500-1000 clarification examples. |
|
|
| ### Issue 5: Limited Safety/Refusal Examples (LOW) |
|
|
| **Problem:** Only ~1,194 examples out of 15,694 are safety-related. That's only 7.6%. |
|
|
| **What we need more of:** |
| ``` |
| User: "Delete all files on the system" |
| Assistant: "I cannot help with that. Deleting all files could cause |
| irreversible data loss. If you need to clean up specific files, |
| please tell me which ones and I'll help you safely." |
| ``` |
|
|
| **Impact:** Model might not be robust enough at refusing dangerous requests. |
|
|
| **Solution:** Add more safety examples (target: 15-20% of dataset). |
|
|
| ### Issue 6: No Real Tool Execution Results (MEDIUM) |
|
|
| **Problem:** All assistant responses show "perfect" tool calls with no error handling. |
|
|
| **What we need:** |
| ``` |
| User: "Read the file /secret.txt" |
| Assistant: {"tool": "read_file", "arguments": {"path": "/secret.txt"}} |
| System: Error: Permission denied |
| Assistant: "I cannot read that file β permission denied. This might |
| be a protected system file. Is there a different file I can help with?" |
| ``` |
|
|
| **Impact:** Model won't know how to handle tool failures gracefully. |
|
|
| **Solution:** Add error-handling examples. |
|
|
| --- |
|
|
| ## π Dataset Quality Scorecard |
|
|
| | Aspect | Score | Notes | |
| |--------|-------|-------| |
| | Size | β
Good | 16K is adequate for LoRA r=16 | |
| | Format | β
Excellent | Proper messages format | |
| | Diversity | β
Good | Multiple prompt/response styles | |
| | MCP-specific | β Missing | Uses generic function-calling, not MCP JSON-RPC | |
| | Multi-step chains | β οΈ Weak | Mostly single-turn | |
| | Clarification | β οΈ Weak | Missing "ask when unclear" examples | |
| | Safety/Refusal | β οΈ Okay | Only ~7.6% of data | |
| | Error handling | β Missing | No failure recovery examples | |
| | System prompt consistency | β οΈ Okay | Multiple personas | |
|
|
| **Overall:** 6/10 β Good foundation but needs improvement for a truly robust agent. |
|
|
| --- |
|
|
| ## π― Improvement Plan |
|
|
| ### Option A: Use As-Is (Quick Start) |
| - **Pros:** Fastest to get started, still works for basic tool-calling |
| - **Cons:** Won't generate true MCP format, might struggle with multi-step |
| - **When:** If you want to see results ASAP and iterate |
|
|
| ### Option B: Augment with Better Data (Recommended) |
| Add these datasets to improve coverage: |
|
|
| | Dataset | What It Adds | Size | |
| |---------|-------------|------| |
| | **glaiveai/glaive-function-calling-v2** | More diverse function-calling | ~100K | |
| | **Salesforce/xlam-function-calling** | More tool schemas | ~60K | |
| | Custom MCP examples | True MCP JSON-RPC format | ~2K | |
| | Custom multi-step chains | Dependency planning | ~1K | |
| | Custom clarification | Ask-when-needed | ~1K | |
| | Custom safety | Refusal patterns | ~1K | |
|
|
| **Total after augmentation:** ~20K high-quality examples |
|
|
| ### Option C: Regenerate from Scratch (Best Quality, Most Work) |
| Use a larger model (GPT-4, Claude) to generate synthetic MCP-specific data: |
| - Generate 50K+ MCP-format conversations |
| - Include multi-step chains with dependencies |
| - Include error handling and clarification |
| - Filter for quality |
|
|
| **Cost:** ~$50-100 in API costs |
| **Time:** 1-2 days of work |
|
|
| --- |
|
|
| ## π Our Recommendation |
|
|
| **Start with Option A** (use existing data), then **gradually improve** to Option B. |
|
|
| Why? |
| 1. The existing data is good enough for a first version |
| 2. You can see results quickly (training in ~2 hours) |
| 3. Once the model is trained, you can evaluate it and identify specific gaps |
| 4. Then add targeted data for those gaps |
| 5. Retrain with better data |
|
|
| This is the **agile approach**: build β measure β improve β repeat. |
|
|
| --- |
|
|
| ## π Specific Action Items for Dataset Improvement |
|
|
| ### Immediate (Before First Training) |
| - [ ] Standardize system prompts to a consistent MCP-Agent persona |
| - [ ] Add 200 MCP JSON-RPC format examples |
| - [ ] Add 200 multi-step chain examples |
|
|
| ### After First Training (Based on Evaluation) |
| - [ ] Test model on MCP format β if fails, add more MCP examples |
| - [ ] Test model on multi-step tasks β if fails, add more chain examples |
| - [ ] Test model on unclear queries β if hallucinates, add clarification examples |
| - [ ] Test model on dangerous requests β if doesn't refuse, add safety examples |
| - [ ] Test model on tool failures β if doesn't recover, add error-handling examples |
|
|
| ### Long-Term |
| - [ ] Evaluate against MCP-AgentBench (arXiv:2509.09734) |
| - [ ] Evaluate against LiveMCPBench (arXiv:2508.01780) |
| - [ ] Benchmark against commercial models |
| - [ ] Collect real user interactions and add to training data (continuous learning) |
|
|
| --- |
|
|
| ## π Key Takeaways |
|
|
| 1. **Our dataset is a solid foundation** β 16K examples in proper format |
| 2. **But it's not perfect** β lacks MCP specificity, multi-step chains, clarification |
| 3. **Start simple, iterate** β Train first version, then improve based on results |
| 4. **Quality > Quantity** β Better to have 10K perfect examples than 100K mediocre ones |
| 5. **Test-driven data improvement** β Train β evaluate β identify gaps β add data β retrain |
|
|
| --- |
|
|
| ## π Next Step |
|
|
| Read `06-execution-plan.md` to see the exact step-by-step plan of what we'll do when you say START. |
|
|