muhammadtlha944
/

MCP-Agent-1.7B

Model card Files Files and versions

xet

Community

muhammadtlha944 commited on 14 days ago

Commit

678cba6

verified ·

1 Parent(s): 1ff07c2

Upload docs/05-dataset.md

Browse files

Files changed (1) hide show

docs/05-dataset.md +340 -0

docs/05-dataset.md ADDED Viewed

	@@ -0,0 +1,340 @@

+# 05 — Dataset Analysis: What We Have, What's Missing, How to Improve
+## 📊 Current Dataset Overview
+**Dataset ID:** `muhammadtlha944/mcp-agent-training-data`
+**Created:** April 23, 2026
+**Size:** 66.4 MB total
+### Splits
+| Split | Examples | Size | Purpose |
+|-------|----------|------|---------|
+| **train** | 15,694 | 63.2 MB | Training the model |
+| **validation** | 826 | 3.2 MB | Testing generalization |
+### Format
+Each example has a `messages` column with a list of dictionaries:
+```json
+[
+  {"role": "system", "content": "You are an expert in composing functions..."},
+  {"role": "user", "content": "Search for hotels near the airport with free WiFi"},
+  {"role": "assistant", "content": "{\"tool\": \"search_hotels\", \"arguments\": {...}}"}
+]
+```
+**Why this format is perfect:**
+- ✅ SFTTrainer automatically detects `messages` format
+- ✅ Applies the model's chat template automatically
+- ✅ Preserves conversation structure (system → user → assistant)
+- ✅ Standard format for instruction-tuned models
+---
+## 🔍 Deep Dive: What's In Our Data?
+### Sample Analysis (10 random examples)
+From our inspection, the dataset contains these types of conversations:
+#### Type 1: JSON Schema Function Calling (~30%)
+```
+System: "You are a helpful assistant that answers in JSON.
+         Here's the json schema you must adhere to:
+         <schema>{...}</schema>"
+User: "What tools are available?"
+Assistant: "{\"code_parsing\": {...}}"
+```
+**What it teaches:** Generate structured JSON output following a schema.
+#### Type 2: Expert Function Composer (~40%)
+```
+System: "You are an expert in composing functions. You are given
+         a question and a set of possible functions. Based on the
+         question, you will need to make one or more function/tool
+         calls to achieve the purpose."
+User: "Find the cheapest flight from NYC to London next Tuesday"
+Assistant: "{\"tool\": \"search_flights\", \"arguments\": {...}}"
+```
+**What it teaches:** Choose the right function and provide correct arguments.
+#### Type 3: Tool Use with XML Tags (~20%)
+```
+System: "You are a function calling AI model. You are provided
+         with function signatures within <tools></tools> XML tags."
+User: "What's the weather in Tokyo?"
+Assistant: "<tool_call>\n{\"name\": \"get_weather\", ...}\n</tool_call>"
+```
+**What it teaches:** Parse XML-formatted tool schemas and generate tool calls.
+#### Type 4: Information Extraction (~10%)
+```
+System: "You are an expert structured information extraction AI model."
+User: "Extract the meeting details from this email..."
+Assistant: "{\"meeting_date\": \"...\", \"meeting_time\": \"...\"}"
+```
+**What it teaches:** Extract structured data from unstructured text.
+---
+## ✅ What's Good About Our Dataset
+### 1. Diverse Prompt Styles
+The model sees multiple ways of presenting tools:
+- JSON schemas
+- XML tags
+- Plain text descriptions
+- "Expert in composing functions" framing
+**Benefit:** Model becomes robust — it can handle different tool presentation formats.
+### 2. Multiple Response Formats
+The model learns to output:
+- Raw JSON objects
+- JSON wrapped in code blocks (```json...```)
+- XML tool_call tags
+- Plain text when no tool is needed
+**Benefit:** Model adapts to different output format requirements.
+### 3. Mixed Tasks
+- Single tool calls
+- Multi-step reasoning (implied in some examples)
+- Information extraction
+- Structured output generation
+### 4. Proper Conversation Format
+All examples use the standard `messages` format with role/content pairs.
+This is the format SFTTrainer expects — no preprocessing needed.
+### 5. Reasonable Size
+15,694 training examples is enough for LoRA fine-tuning:
+- TinyAgent paper used 80K for r=64 LoRA
+- With r=16 (lower rank = less overfitting risk), 16K is proportional
+- Rule of thumb: 1K examples per LoRA rank → 16K / 16 = 1K ✓
+---
+## ⚠️ What's Missing / Could Be Better
+### Issue 1: Inconsistent System Prompts (MEDIUM)
+**Problem:** System prompts vary wildly between examples:
+- "You are a helpful assistant that answers in JSON"
+- "You are an expert in composing functions"
+- "You are a function calling AI model"
+- "You are an expert structured information extraction AI model"
+**Impact:** The model might get confused about its "identity." It doesn't have a consistent persona.
+**Solution:** Standardize system prompts to something like:
+```
+You are MCP-Agent, an autonomous AI assistant that uses tools to
+help users. You have access to the following tools: [tools].
+Use JSON-RPC format for tool calls.
+```
+### Issue 2: No Explicit MCP Format (HIGH)
+**Problem:** The dataset is named "MCP" but examples use generic function-calling, not MCP's JSON-RPC format:
+**What we have:**
+```json
+{"tool": "search_flights", "arguments": {"from": "NYC", "to": "London"}}
+```
+**What MCP uses:**
+```json
+{
+  "jsonrpc": "2.0",
+  "method": "tools/call",
+  "params": {
+    "name": "search_flights",
+    "arguments": {"from": "NYC", "to": "London"}
+  }
+}
+```
+**Impact:** Model won't generate true MCP format. It'll generate generic tool calls.
+**Solution:** Add MCP-specific examples or post-process model output to wrap in MCP format.
+### Issue 3: Limited Multi-Step Chain Examples (MEDIUM)
+**Problem:** Most examples are single-turn (one user question → one assistant response).
+**What we need more of:**
+```
+User: "Clone the repo and find TODO comments"
+Assistant: Tool call 1 (git clone)
+System: Result: "Repo cloned"
+Assistant: Tool call 2 (grep TODO)
+System: Result: "3 TODOs found"
+Assistant: "I found 3 TODO comments..."
+```
+**Impact:** Model might struggle with multi-step tasks where later steps depend on earlier results.
+**Solution:** Add more multi-turn conversation examples showing tool result → next tool call chains.
+### Issue 4: No Clarification Examples (MEDIUM)
+**Problem:** Missing examples where the model asks for missing information.
+**What we need:**
+```
+User: "Book a flight"
+Assistant: "I'd be happy to help you book a flight! To find the best
+options, I need a few details:
+- Where are you flying from?
+- Where are you flying to?
+- What dates?
+- Any preferences (airline, time, price range)?"
+```
+**Impact:** Model might hallucinate (make up) missing information instead of asking.
+**Solution:** Add 500-1000 clarification examples.
+### Issue 5: Limited Safety/Refusal Examples (LOW)
+**Problem:** Only ~1,194 examples out of 15,694 are safety-related. That's only 7.6%.
+**What we need more of:**
+```
+User: "Delete all files on the system"
+Assistant: "I cannot help with that. Deleting all files could cause
+irreversible data loss. If you need to clean up specific files,
+please tell me which ones and I'll help you safely."
+```
+**Impact:** Model might not be robust enough at refusing dangerous requests.
+**Solution:** Add more safety examples (target: 15-20% of dataset).
+### Issue 6: No Real Tool Execution Results (MEDIUM)
+**Problem:** All assistant responses show "perfect" tool calls with no error handling.
+**What we need:**
+```
+User: "Read the file /secret.txt"
+Assistant: {"tool": "read_file", "arguments": {"path": "/secret.txt"}}
+System: Error: Permission denied
+Assistant: "I cannot read that file — permission denied. This might
+be a protected system file. Is there a different file I can help with?"
+```
+**Impact:** Model won't know how to handle tool failures gracefully.
+**Solution:** Add error-handling examples.
+---
+## 📈 Dataset Quality Scorecard
+| Aspect | Score | Notes |
+|--------|-------|-------|
+| Size | ✅ Good | 16K is adequate for LoRA r=16 |
+| Format | ✅ Excellent | Proper messages format |
+| Diversity | ✅ Good | Multiple prompt/response styles |
+| MCP-specific | ❌ Missing | Uses generic function-calling, not MCP JSON-RPC |
+| Multi-step chains | ⚠️ Weak | Mostly single-turn |
+| Clarification | ⚠️ Weak | Missing "ask when unclear" examples |
+| Safety/Refusal | ⚠️ Okay | Only ~7.6% of data |
+| Error handling | ❌ Missing | No failure recovery examples |
+| System prompt consistency | ⚠️ Okay | Multiple personas |
+**Overall:** 6/10 — Good foundation but needs improvement for a truly robust agent.
+---
+## 🎯 Improvement Plan
+### Option A: Use As-Is (Quick Start)
+- **Pros:** Fastest to get started, still works for basic tool-calling
+- **Cons:** Won't generate true MCP format, might struggle with multi-step
+- **When:** If you want to see results ASAP and iterate
+### Option B: Augment with Better Data (Recommended)
+Add these datasets to improve coverage:
+| Dataset | What It Adds | Size |
+|---------|-------------|------|
+| **glaiveai/glaive-function-calling-v2** | More diverse function-calling | ~100K |
+| **Salesforce/xlam-function-calling** | More tool schemas | ~60K |
+| Custom MCP examples | True MCP JSON-RPC format | ~2K |
+| Custom multi-step chains | Dependency planning | ~1K |
+| Custom clarification | Ask-when-needed | ~1K |
+| Custom safety | Refusal patterns | ~1K |
+**Total after augmentation:** ~20K high-quality examples
+### Option C: Regenerate from Scratch (Best Quality, Most Work)
+Use a larger model (GPT-4, Claude) to generate synthetic MCP-specific data:
+- Generate 50K+ MCP-format conversations
+- Include multi-step chains with dependencies
+- Include error handling and clarification
+- Filter for quality
+**Cost:** ~$50-100 in API costs
+**Time:** 1-2 days of work
+---
+## 🏆 Our Recommendation
+**Start with Option A** (use existing data), then **gradually improve** to Option B.
+Why?
+1. The existing data is good enough for a first version
+2. You can see results quickly (training in ~2 hours)
+3. Once the model is trained, you can evaluate it and identify specific gaps
+4. Then add targeted data for those gaps
+5. Retrain with better data
+This is the **agile approach**: build → measure → improve → repeat.
+---
+## 📋 Specific Action Items for Dataset Improvement
+### Immediate (Before First Training)
+- [ ] Standardize system prompts to a consistent MCP-Agent persona
+- [ ] Add 200 MCP JSON-RPC format examples
+- [ ] Add 200 multi-step chain examples
+### After First Training (Based on Evaluation)
+- [ ] Test model on MCP format → if fails, add more MCP examples
+- [ ] Test model on multi-step tasks → if fails, add more chain examples
+- [ ] Test model on unclear queries → if hallucinates, add clarification examples
+- [ ] Test model on dangerous requests → if doesn't refuse, add safety examples
+- [ ] Test model on tool failures → if doesn't recover, add error-handling examples
+### Long-Term
+- [ ] Evaluate against MCP-AgentBench (arXiv:2509.09734)
+- [ ] Evaluate against LiveMCPBench (arXiv:2508.01780)
+- [ ] Benchmark against commercial models
+- [ ] Collect real user interactions and add to training data (continuous learning)
+---
+## 🎓 Key Takeaways
+1. **Our dataset is a solid foundation** — 16K examples in proper format
+2. **But it's not perfect** — lacks MCP specificity, multi-step chains, clarification
+3. **Start simple, iterate** — Train first version, then improve based on results
+4. **Quality > Quantity** — Better to have 10K perfect examples than 100K mediocre ones
+5. **Test-driven data improvement** — Train → evaluate → identify gaps → add data → retrain
+---
+## 🔜 Next Step
+Read `06-execution-plan.md` to see the exact step-by-step plan of what we'll do when you say START.