muhammadtlha944 commited on
Commit
678cba6
Β·
verified Β·
1 Parent(s): 1ff07c2

Upload docs/05-dataset.md

Browse files
Files changed (1) hide show
  1. docs/05-dataset.md +340 -0
docs/05-dataset.md ADDED
@@ -0,0 +1,340 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 05 β€” Dataset Analysis: What We Have, What's Missing, How to Improve
2
+
3
+ ## πŸ“Š Current Dataset Overview
4
+
5
+ **Dataset ID:** `muhammadtlha944/mcp-agent-training-data`
6
+ **Created:** April 23, 2026
7
+ **Size:** 66.4 MB total
8
+
9
+ ### Splits
10
+
11
+ | Split | Examples | Size | Purpose |
12
+ |-------|----------|------|---------|
13
+ | **train** | 15,694 | 63.2 MB | Training the model |
14
+ | **validation** | 826 | 3.2 MB | Testing generalization |
15
+
16
+ ### Format
17
+
18
+ Each example has a `messages` column with a list of dictionaries:
19
+
20
+ ```json
21
+ [
22
+ {"role": "system", "content": "You are an expert in composing functions..."},
23
+ {"role": "user", "content": "Search for hotels near the airport with free WiFi"},
24
+ {"role": "assistant", "content": "{\"tool\": \"search_hotels\", \"arguments\": {...}}"}
25
+ ]
26
+ ```
27
+
28
+ **Why this format is perfect:**
29
+ - βœ… SFTTrainer automatically detects `messages` format
30
+ - βœ… Applies the model's chat template automatically
31
+ - βœ… Preserves conversation structure (system β†’ user β†’ assistant)
32
+ - βœ… Standard format for instruction-tuned models
33
+
34
+ ---
35
+
36
+ ## πŸ” Deep Dive: What's In Our Data?
37
+
38
+ ### Sample Analysis (10 random examples)
39
+
40
+ From our inspection, the dataset contains these types of conversations:
41
+
42
+ #### Type 1: JSON Schema Function Calling (~30%)
43
+ ```
44
+ System: "You are a helpful assistant that answers in JSON.
45
+ Here's the json schema you must adhere to:
46
+ <schema>{...}</schema>"
47
+ User: "What tools are available?"
48
+ Assistant: "{\"code_parsing\": {...}}"
49
+ ```
50
+
51
+ **What it teaches:** Generate structured JSON output following a schema.
52
+
53
+ #### Type 2: Expert Function Composer (~40%)
54
+ ```
55
+ System: "You are an expert in composing functions. You are given
56
+ a question and a set of possible functions. Based on the
57
+ question, you will need to make one or more function/tool
58
+ calls to achieve the purpose."
59
+ User: "Find the cheapest flight from NYC to London next Tuesday"
60
+ Assistant: "{\"tool\": \"search_flights\", \"arguments\": {...}}"
61
+ ```
62
+
63
+ **What it teaches:** Choose the right function and provide correct arguments.
64
+
65
+ #### Type 3: Tool Use with XML Tags (~20%)
66
+ ```
67
+ System: "You are a function calling AI model. You are provided
68
+ with function signatures within <tools></tools> XML tags."
69
+ User: "What's the weather in Tokyo?"
70
+ Assistant: "<tool_call>\n{\"name\": \"get_weather\", ...}\n</tool_call>"
71
+ ```
72
+
73
+ **What it teaches:** Parse XML-formatted tool schemas and generate tool calls.
74
+
75
+ #### Type 4: Information Extraction (~10%)
76
+ ```
77
+ System: "You are an expert structured information extraction AI model."
78
+ User: "Extract the meeting details from this email..."
79
+ Assistant: "{\"meeting_date\": \"...\", \"meeting_time\": \"...\"}"
80
+ ```
81
+
82
+ **What it teaches:** Extract structured data from unstructured text.
83
+
84
+ ---
85
+
86
+ ## βœ… What's Good About Our Dataset
87
+
88
+ ### 1. Diverse Prompt Styles
89
+ The model sees multiple ways of presenting tools:
90
+ - JSON schemas
91
+ - XML tags
92
+ - Plain text descriptions
93
+ - "Expert in composing functions" framing
94
+
95
+ **Benefit:** Model becomes robust β€” it can handle different tool presentation formats.
96
+
97
+ ### 2. Multiple Response Formats
98
+ The model learns to output:
99
+ - Raw JSON objects
100
+ - JSON wrapped in code blocks (```json...```)
101
+ - XML tool_call tags
102
+ - Plain text when no tool is needed
103
+
104
+ **Benefit:** Model adapts to different output format requirements.
105
+
106
+ ### 3. Mixed Tasks
107
+ - Single tool calls
108
+ - Multi-step reasoning (implied in some examples)
109
+ - Information extraction
110
+ - Structured output generation
111
+
112
+ ### 4. Proper Conversation Format
113
+ All examples use the standard `messages` format with role/content pairs.
114
+ This is the format SFTTrainer expects β€” no preprocessing needed.
115
+
116
+ ### 5. Reasonable Size
117
+ 15,694 training examples is enough for LoRA fine-tuning:
118
+ - TinyAgent paper used 80K for r=64 LoRA
119
+ - With r=16 (lower rank = less overfitting risk), 16K is proportional
120
+ - Rule of thumb: 1K examples per LoRA rank β†’ 16K / 16 = 1K βœ“
121
+
122
+ ---
123
+
124
+ ## ⚠️ What's Missing / Could Be Better
125
+
126
+ ### Issue 1: Inconsistent System Prompts (MEDIUM)
127
+
128
+ **Problem:** System prompts vary wildly between examples:
129
+ - "You are a helpful assistant that answers in JSON"
130
+ - "You are an expert in composing functions"
131
+ - "You are a function calling AI model"
132
+ - "You are an expert structured information extraction AI model"
133
+
134
+ **Impact:** The model might get confused about its "identity." It doesn't have a consistent persona.
135
+
136
+ **Solution:** Standardize system prompts to something like:
137
+ ```
138
+ You are MCP-Agent, an autonomous AI assistant that uses tools to
139
+ help users. You have access to the following tools: [tools].
140
+ Use JSON-RPC format for tool calls.
141
+ ```
142
+
143
+ ### Issue 2: No Explicit MCP Format (HIGH)
144
+
145
+ **Problem:** The dataset is named "MCP" but examples use generic function-calling, not MCP's JSON-RPC format:
146
+
147
+ **What we have:**
148
+ ```json
149
+ {"tool": "search_flights", "arguments": {"from": "NYC", "to": "London"}}
150
+ ```
151
+
152
+ **What MCP uses:**
153
+ ```json
154
+ {
155
+ "jsonrpc": "2.0",
156
+ "method": "tools/call",
157
+ "params": {
158
+ "name": "search_flights",
159
+ "arguments": {"from": "NYC", "to": "London"}
160
+ }
161
+ }
162
+ ```
163
+
164
+ **Impact:** Model won't generate true MCP format. It'll generate generic tool calls.
165
+
166
+ **Solution:** Add MCP-specific examples or post-process model output to wrap in MCP format.
167
+
168
+ ### Issue 3: Limited Multi-Step Chain Examples (MEDIUM)
169
+
170
+ **Problem:** Most examples are single-turn (one user question β†’ one assistant response).
171
+
172
+ **What we need more of:**
173
+ ```
174
+ User: "Clone the repo and find TODO comments"
175
+ Assistant: Tool call 1 (git clone)
176
+ System: Result: "Repo cloned"
177
+ Assistant: Tool call 2 (grep TODO)
178
+ System: Result: "3 TODOs found"
179
+ Assistant: "I found 3 TODO comments..."
180
+ ```
181
+
182
+ **Impact:** Model might struggle with multi-step tasks where later steps depend on earlier results.
183
+
184
+ **Solution:** Add more multi-turn conversation examples showing tool result β†’ next tool call chains.
185
+
186
+ ### Issue 4: No Clarification Examples (MEDIUM)
187
+
188
+ **Problem:** Missing examples where the model asks for missing information.
189
+
190
+ **What we need:**
191
+ ```
192
+ User: "Book a flight"
193
+ Assistant: "I'd be happy to help you book a flight! To find the best
194
+ options, I need a few details:
195
+ - Where are you flying from?
196
+ - Where are you flying to?
197
+ - What dates?
198
+ - Any preferences (airline, time, price range)?"
199
+ ```
200
+
201
+ **Impact:** Model might hallucinate (make up) missing information instead of asking.
202
+
203
+ **Solution:** Add 500-1000 clarification examples.
204
+
205
+ ### Issue 5: Limited Safety/Refusal Examples (LOW)
206
+
207
+ **Problem:** Only ~1,194 examples out of 15,694 are safety-related. That's only 7.6%.
208
+
209
+ **What we need more of:**
210
+ ```
211
+ User: "Delete all files on the system"
212
+ Assistant: "I cannot help with that. Deleting all files could cause
213
+ irreversible data loss. If you need to clean up specific files,
214
+ please tell me which ones and I'll help you safely."
215
+ ```
216
+
217
+ **Impact:** Model might not be robust enough at refusing dangerous requests.
218
+
219
+ **Solution:** Add more safety examples (target: 15-20% of dataset).
220
+
221
+ ### Issue 6: No Real Tool Execution Results (MEDIUM)
222
+
223
+ **Problem:** All assistant responses show "perfect" tool calls with no error handling.
224
+
225
+ **What we need:**
226
+ ```
227
+ User: "Read the file /secret.txt"
228
+ Assistant: {"tool": "read_file", "arguments": {"path": "/secret.txt"}}
229
+ System: Error: Permission denied
230
+ Assistant: "I cannot read that file β€” permission denied. This might
231
+ be a protected system file. Is there a different file I can help with?"
232
+ ```
233
+
234
+ **Impact:** Model won't know how to handle tool failures gracefully.
235
+
236
+ **Solution:** Add error-handling examples.
237
+
238
+ ---
239
+
240
+ ## πŸ“ˆ Dataset Quality Scorecard
241
+
242
+ | Aspect | Score | Notes |
243
+ |--------|-------|-------|
244
+ | Size | βœ… Good | 16K is adequate for LoRA r=16 |
245
+ | Format | βœ… Excellent | Proper messages format |
246
+ | Diversity | βœ… Good | Multiple prompt/response styles |
247
+ | MCP-specific | ❌ Missing | Uses generic function-calling, not MCP JSON-RPC |
248
+ | Multi-step chains | ⚠️ Weak | Mostly single-turn |
249
+ | Clarification | ⚠️ Weak | Missing "ask when unclear" examples |
250
+ | Safety/Refusal | ⚠️ Okay | Only ~7.6% of data |
251
+ | Error handling | ❌ Missing | No failure recovery examples |
252
+ | System prompt consistency | ⚠️ Okay | Multiple personas |
253
+
254
+ **Overall:** 6/10 β€” Good foundation but needs improvement for a truly robust agent.
255
+
256
+ ---
257
+
258
+ ## 🎯 Improvement Plan
259
+
260
+ ### Option A: Use As-Is (Quick Start)
261
+ - **Pros:** Fastest to get started, still works for basic tool-calling
262
+ - **Cons:** Won't generate true MCP format, might struggle with multi-step
263
+ - **When:** If you want to see results ASAP and iterate
264
+
265
+ ### Option B: Augment with Better Data (Recommended)
266
+ Add these datasets to improve coverage:
267
+
268
+ | Dataset | What It Adds | Size |
269
+ |---------|-------------|------|
270
+ | **glaiveai/glaive-function-calling-v2** | More diverse function-calling | ~100K |
271
+ | **Salesforce/xlam-function-calling** | More tool schemas | ~60K |
272
+ | Custom MCP examples | True MCP JSON-RPC format | ~2K |
273
+ | Custom multi-step chains | Dependency planning | ~1K |
274
+ | Custom clarification | Ask-when-needed | ~1K |
275
+ | Custom safety | Refusal patterns | ~1K |
276
+
277
+ **Total after augmentation:** ~20K high-quality examples
278
+
279
+ ### Option C: Regenerate from Scratch (Best Quality, Most Work)
280
+ Use a larger model (GPT-4, Claude) to generate synthetic MCP-specific data:
281
+ - Generate 50K+ MCP-format conversations
282
+ - Include multi-step chains with dependencies
283
+ - Include error handling and clarification
284
+ - Filter for quality
285
+
286
+ **Cost:** ~$50-100 in API costs
287
+ **Time:** 1-2 days of work
288
+
289
+ ---
290
+
291
+ ## πŸ† Our Recommendation
292
+
293
+ **Start with Option A** (use existing data), then **gradually improve** to Option B.
294
+
295
+ Why?
296
+ 1. The existing data is good enough for a first version
297
+ 2. You can see results quickly (training in ~2 hours)
298
+ 3. Once the model is trained, you can evaluate it and identify specific gaps
299
+ 4. Then add targeted data for those gaps
300
+ 5. Retrain with better data
301
+
302
+ This is the **agile approach**: build β†’ measure β†’ improve β†’ repeat.
303
+
304
+ ---
305
+
306
+ ## πŸ“‹ Specific Action Items for Dataset Improvement
307
+
308
+ ### Immediate (Before First Training)
309
+ - [ ] Standardize system prompts to a consistent MCP-Agent persona
310
+ - [ ] Add 200 MCP JSON-RPC format examples
311
+ - [ ] Add 200 multi-step chain examples
312
+
313
+ ### After First Training (Based on Evaluation)
314
+ - [ ] Test model on MCP format β†’ if fails, add more MCP examples
315
+ - [ ] Test model on multi-step tasks β†’ if fails, add more chain examples
316
+ - [ ] Test model on unclear queries β†’ if hallucinates, add clarification examples
317
+ - [ ] Test model on dangerous requests β†’ if doesn't refuse, add safety examples
318
+ - [ ] Test model on tool failures β†’ if doesn't recover, add error-handling examples
319
+
320
+ ### Long-Term
321
+ - [ ] Evaluate against MCP-AgentBench (arXiv:2509.09734)
322
+ - [ ] Evaluate against LiveMCPBench (arXiv:2508.01780)
323
+ - [ ] Benchmark against commercial models
324
+ - [ ] Collect real user interactions and add to training data (continuous learning)
325
+
326
+ ---
327
+
328
+ ## πŸŽ“ Key Takeaways
329
+
330
+ 1. **Our dataset is a solid foundation** β€” 16K examples in proper format
331
+ 2. **But it's not perfect** β€” lacks MCP specificity, multi-step chains, clarification
332
+ 3. **Start simple, iterate** β€” Train first version, then improve based on results
333
+ 4. **Quality > Quantity** β€” Better to have 10K perfect examples than 100K mediocre ones
334
+ 5. **Test-driven data improvement** β€” Train β†’ evaluate β†’ identify gaps β†’ add data β†’ retrain
335
+
336
+ ---
337
+
338
+ ## πŸ”œ Next Step
339
+
340
+ Read `06-execution-plan.md` to see the exact step-by-step plan of what we'll do when you say START.