muhammadtlha944 commited on
Commit
1ff07c2
Β·
verified Β·
1 Parent(s): 3b065fc

Upload docs/06-execution-plan.md

Browse files
Files changed (1) hide show
  1. docs/06-execution-plan.md +350 -0
docs/06-execution-plan.md ADDED
@@ -0,0 +1,350 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 06 β€” Execution Plan: What We'll Do When You Say "START"
2
+
3
+ ## πŸš€ The Plan
4
+
5
+ When you say **"START"**, here is the EXACT sequence of steps we'll follow.
6
+ Each step has a clear goal, estimated time, and cost.
7
+
8
+ ---
9
+
10
+ ## Phase 1: Setup & Validation (15 minutes)
11
+
12
+ ### Step 1.1: Create Training Sandbox
13
+ **What:** Set up a GPU sandbox with all dependencies installed
14
+ **Why:** Test that everything works before spending money on a real training job
15
+ **Time:** 5 minutes
16
+ **Cost:** $0
17
+
18
+ ```bash
19
+ pip install transformers trl peft datasets accelerate bitsandbytes torch trackio
20
+ ```
21
+
22
+ ### Step 1.2: Validate Dataset Format
23
+ **What:** Load your dataset and verify it works with SFTTrainer
24
+ **Why:** Catch format issues BEFORE training starts (saves hours of debugging)
25
+ **Time:** 5 minutes
26
+ **Cost:** $0
27
+
28
+ ```python
29
+ from datasets import load_dataset
30
+ dataset = load_dataset("muhammadtlha944/mcp-agent-training-data")
31
+ print(dataset["train"][0]) # Peek at first example
32
+ ```
33
+
34
+ ### Step 1.3: Verify Model Compatibility
35
+ **What:** Load Qwen3-1.7B tokenizer and test chat template
36
+ **Why:** Make sure the model can process our messages format
37
+ **Time:** 5 minutes
38
+ **Cost:** $0
39
+
40
+ ```python
41
+ from transformers import AutoTokenizer
42
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
43
+ print(tokenizer.chat_template) # Should not be None
44
+ ```
45
+
46
+ ---
47
+
48
+ ## Phase 2: Training Script Development (30 minutes)
49
+
50
+ ### Step 2.1: Write Training Script
51
+ **What:** Create `train.py` with full educational comments
52
+ **Why:** Every line documented so you learn as we build
53
+ **Time:** 15 minutes
54
+ **Cost:** $0
55
+
56
+ **What the script contains:**
57
+ - LoRA configuration (r=16, all-linear, dropout=0.05)
58
+ - SFTConfig with all hyperparameters documented
59
+ - Trackio monitoring setup
60
+ - push_to_hub configuration
61
+ - Plain-text logging (no tqdm progress bars)
62
+
63
+ ### Step 2.2: Test Script in Sandbox
64
+ **What:** Run the script for 10 steps to catch errors
65
+ **Why:** Find bugs NOW before the expensive training job
66
+ **Time:** 10 minutes
67
+ **Cost:** $0 (sandbox GPU time)
68
+
69
+ ```python
70
+ # Run just 10 steps as a smoke test
71
+ training_args.max_steps = 10
72
+ trainer.train()
73
+ ```
74
+
75
+ ### Step 2.3: Review & Fix Issues
76
+ **What:** Fix any import errors, API mismatches, or config issues
77
+ **Why:** Training jobs are expensive β€” we only launch when the script is solid
78
+ **Time:** 5 minutes
79
+ **Cost:** $0
80
+
81
+ ---
82
+
83
+ ## Phase 3: Model Training (2-3 hours)
84
+
85
+ ### Step 3.1: Launch Training Job
86
+ **What:** Submit training to HF Jobs on T4 GPU
87
+ **Why:** T4 is cheapest GPU that fits our model (16GB VRAM)
88
+ **Time:** 2-3 hours (automated)
89
+ **Cost:** ~$1.20-1.80
90
+
91
+ **Pre-flight check before launch:**
92
+ - βœ… Dataset format validated
93
+ - βœ… Script tested in sandbox
94
+ - βœ… push_to_hub=True and hub_model_id set
95
+ - βœ… Timeout set to 4 hours (plenty of buffer)
96
+ - βœ… Trackio monitoring enabled
97
+ - βœ… disable_tqdm=True for clean logs
98
+
99
+ ### Step 3.2: Monitor Training
100
+ **What:** Watch loss curves via Trackio dashboard
101
+ **Why:** Make sure loss is going down (model is learning)
102
+ **Time:** Check every 15 minutes
103
+ **Cost:** $0 (just watching)
104
+
105
+ **What to watch for:**
106
+ ```
107
+ Good: Step 100: loss=2.5 β†’ Step 500: loss=1.2 β†’ Step 2450: loss=0.9
108
+ Warning: Step 100: loss=2.5 β†’ Step 500: loss=2.4 β†’ Step 1000: loss=2.3
109
+ (Learning very slowly β€” might need more epochs or higher LR)
110
+ Bad: Step 100: loss=2.5 β†’ Step 500: loss=3.0 β†’ Step 1000: loss=3.5
111
+ (Loss going UP β€” stop immediately, something is wrong)
112
+ ```
113
+
114
+ ### Step 3.3: Verify Model Pushed to Hub
115
+ **What:** Check that the model appears in your HF repo
116
+ **Why:** Job storage is ephemeral β€” if push_to_hub fails, model is LOST
117
+ **Time:** 5 minutes
118
+ **Cost:** $0
119
+
120
+ **Check URL:** https://huggingface.co/muhammadtlha944/MCP-Agent-1.7B
121
+
122
+ ---
123
+
124
+ ## Phase 4: Testing & Evaluation (30 minutes)
125
+
126
+ ### Step 4.1: Load Trained Model
127
+ **What:** Download the model from Hub and test inference
128
+ **Why:** Verify the model actually works after training
129
+ **Time:** 10 minutes
130
+ **Cost:** $0
131
+
132
+ ```python
133
+ from transformers import pipeline
134
+ pipe = pipeline("text-generation", model="muhammadtlha944/MCP-Agent-1.7B")
135
+ ```
136
+
137
+ ### Step 4.2: Run Test Prompts
138
+ **What:** Test the model on real tool-calling scenarios
139
+ **Why:** See if training actually worked
140
+ **Time:** 10 minutes
141
+ **Cost:** $0
142
+
143
+ **Test cases:**
144
+ 1. Simple tool call: "Find all Python files"
145
+ 2. Multi-step: "Clone a repo and find TODO comments"
146
+ 3. Clarification: "Book a flight" (missing info)
147
+ 4. Safety: "Delete all files" (should refuse)
148
+ 5. MCP format: "Use the github_search tool to find ML repos"
149
+
150
+ ### Step 4.3: Document Results
151
+ **What:** Save test outputs and observations
152
+ **Why:** Track what works and what needs improvement
153
+ **Time:** 10 minutes
154
+ **Cost:** $0
155
+
156
+ ---
157
+
158
+ ## Phase 5: Agent Harness App (1 hour)
159
+
160
+ ### Step 5.1: Write Agent App
161
+ **What:** Create `app.py` with Gradio UI + ReAct loop + tool registry
162
+ **Why:** Turn the model into an actual usable agent
163
+ **Time:** 30 minutes
164
+ **Cost:** $0
165
+
166
+ **What the app contains:**
167
+ - Gradio chat interface
168
+ - Agent mode toggle (on/off)
169
+ - Tool registry with 7 built-in tools
170
+ - ReAct loop (think β†’ act β†’ observe β†’ repeat)
171
+ - Tool execution log
172
+ - Safety filters (block dangerous commands)
173
+
174
+ ### Step 5.2: Test Agent Locally
175
+ **What:** Run the app and test with real user queries
176
+ **Why:** Make sure the whole system works end-to-end
177
+ **Time:** 15 minutes
178
+ **Cost:** $0
179
+
180
+ ### Step 5.3: Deploy to HF Space
181
+ **What:** Upload app to a Gradio Space
182
+ **Why:** Share with the world!
183
+ **Time:** 15 minutes
184
+ **Cost:** $0 (Spaces free tier)
185
+
186
+ ---
187
+
188
+ ## Phase 6: Documentation & Publication (30 minutes)
189
+
190
+ ### Step 6.1: Update Model README
191
+ **What:** Write a compelling README for the model card
192
+ **Why:** Model cards are how people discover and understand your model
193
+ **Time:** 15 minutes
194
+ **Cost:** $0
195
+
196
+ **What to include:**
197
+ - What the model does
198
+ - How it was trained
199
+ - How to use it
200
+ - Benchmarks/results
201
+ - Limitations
202
+ - Citation info
203
+
204
+ ### Step 6.2: Create Dataset Card
205
+ **What:** Document the training dataset
206
+ **Why:** Transparency is valued in the ML community
207
+ **Time:** 10 minutes
208
+ **Cost:** $0
209
+
210
+ ### Step 6.3: Share Results
211
+ **What:** Post on social media, share with community
212
+ **Why:** Get feedback, attract collaborators
213
+ **Time:** 5 minutes
214
+ **Cost:** $0
215
+
216
+ ---
217
+
218
+ ## πŸ“… Timeline Summary
219
+
220
+ | Phase | Steps | Time | Cost | Cumulative |
221
+ |-------|-------|------|------|------------|
222
+ | 1. Setup | 1.1-1.3 | 15 min | $0 | 15 min / $0 |
223
+ | 2. Script | 2.1-2.3 | 30 min | $0 | 45 min / $0 |
224
+ | 3. Training | 3.1-3.3 | 2-3 hrs | ~$1.50 | 3-4 hrs / $1.50 |
225
+ | 4. Testing | 4.1-4.3 | 30 min | $0 | 3.5-4.5 hrs / $1.50 |
226
+ | 5. App | 5.1-5.3 | 1 hr | $0 | 4.5-5.5 hrs / $1.50 |
227
+ | 6. Publish | 6.1-6.3 | 30 min | $0 | 5-6 hrs / $1.50 |
228
+
229
+ **Total time:** ~5-6 hours of active work
230
+ **Total cost:** ~$1.50 (training only)
231
+ **Total budget used:** ~15% of $10 budget βœ…
232
+
233
+ ---
234
+
235
+ ## 🎯 Decision Points
236
+
237
+ At each phase, we'll make decisions based on results:
238
+
239
+ ### After Phase 3 (Training):
240
+ **If training loss < 1.5 and eval loss < 1.8:** βœ… Proceed to testing
241
+ **If training loss > 2.0:** ⚠️ Consider more epochs or higher LR
242
+ **If eval loss >> train loss:** ❌ Overfitting β€” need more data or lower rank
243
+ **If model didn't push to Hub:** ❌ Stop and fix push_to_hub configuration
244
+
245
+ ### After Phase 4 (Testing):
246
+ **If model generates tool calls correctly:** βœ… Proceed to app
247
+ **If model generates text but not tool calls:** ⚠️ Need more MCP-specific training data
248
+ **If model hallucinates tools:** ⚠️ Need more diverse tool schemas in data
249
+ **If model refuses everything:** ⚠️ Too much safety data β€” need balance
250
+
251
+ ### After Phase 5 (App):
252
+ **If app works end-to-end:** βœ… Publish and celebrate!
253
+ **If tools fail to execute:** ⚠️ Fix tool implementations
254
+ **If model runs out of context:** ⚠️ Reduce max_iterations or use sliding window
255
+
256
+ ---
257
+
258
+ ## πŸ’‘ What You'll Learn During Execution
259
+
260
+ ### During Phase 1:
261
+ - How to set up a GPU environment
262
+ - How to validate data formats
263
+ - How model tokenizers work
264
+
265
+ ### During Phase 2:
266
+ - How to write production training scripts
267
+ - How LoRA configuration works
268
+ - How SFTConfig parameters affect training
269
+
270
+ ### During Phase 3:
271
+ - How to submit jobs to cloud GPUs
272
+ - How to monitor training in real-time
273
+ - How to read loss curves
274
+ - How Trackio dashboards work
275
+
276
+ ### During Phase 4:
277
+ - How to load fine-tuned models
278
+ - How to test models systematically
279
+ - How to identify model weaknesses
280
+
281
+ ### During Phase 5:
282
+ - How to build agent applications
283
+ - How the ReAct pattern works in practice
284
+ - How tool registries function
285
+ - How to deploy Gradio apps
286
+
287
+ ### During Phase 6:
288
+ - How to write effective model cards
289
+ - How to share research with the community
290
+
291
+ ---
292
+
293
+ ## 🚨 Contingency Plans
294
+
295
+ ### If Training Fails (OOM Error)
296
+ **Symptom:** "CUDA out of memory" error
297
+ **Fix:**
298
+ 1. Reduce batch_size from 4 to 2 (keep accumulation at 4 β†’ effective batch = 8)
299
+ 2. Reduce max_seq_length from 2048 to 1024
300
+ 3. If still fails, use gradient checkpointing (already enabled)
301
+ 4. Last resort: upgrade to a10g-small (24GB VRAM, ~$1.20/hr)
302
+
303
+ ### If Training Is Too Slow
304
+ **Symptom:** Loss barely moving after 1 hour
305
+ **Fix:**
306
+ 1. Check learning rate β€” might be too low
307
+ 2. Increase warmup ratio from 0.1 to 0.2
308
+ 3. Reduce gradient accumulation from 4 to 2 (faster but less stable)
309
+
310
+ ### If Model Doesn't Generate Tool Calls
311
+ **Symptom:** Model answers questions normally but doesn't use tools
312
+ **Fix:**
313
+ 1. Add more MCP-specific training data
314
+ 2. Adjust system prompt to emphasize tool use
315
+ 3. Use higher temperature (0.9) to encourage creativity
316
+ 4. Add few-shot examples in the system prompt
317
+
318
+ ### If Push to Hub Fails
319
+ **Symptom:** Model trained but not on Hub
320
+ **Fix:**
321
+ 1. Check HF token has write permissions
322
+ 2. Manually upload: `trainer.push_to_hub()` after training
323
+ 3. Save locally first: `trainer.save_model("./local-save")`
324
+
325
+ ---
326
+
327
+ ## πŸŽ‰ Success Criteria
328
+
329
+ We'll consider this project a success when:
330
+
331
+ - βœ… Model trains without errors (loss < 1.5)
332
+ - βœ… Model pushed to Hub successfully
333
+ - βœ… Model generates structured tool calls on test prompts
334
+ - βœ… Agent app runs locally with tool execution
335
+ - βœ… App deployed to HF Space
336
+ - βœ… Total cost under $10 (target: $1.50)
337
+
338
+ ---
339
+
340
+ ## πŸš€ Ready?
341
+
342
+ When you've read all the files and feel confident, just say:
343
+
344
+ > **"START"**
345
+
346
+ And we'll begin with Phase 1.
347
+
348
+ ---
349
+
350
+ *Learning ML by building real things β€” one step at a time.*