File size: 10,361 Bytes
1ff07c2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 | # 06 β Execution Plan: What We'll Do When You Say "START"
## π The Plan
When you say **"START"**, here is the EXACT sequence of steps we'll follow.
Each step has a clear goal, estimated time, and cost.
---
## Phase 1: Setup & Validation (15 minutes)
### Step 1.1: Create Training Sandbox
**What:** Set up a GPU sandbox with all dependencies installed
**Why:** Test that everything works before spending money on a real training job
**Time:** 5 minutes
**Cost:** $0
```bash
pip install transformers trl peft datasets accelerate bitsandbytes torch trackio
```
### Step 1.2: Validate Dataset Format
**What:** Load your dataset and verify it works with SFTTrainer
**Why:** Catch format issues BEFORE training starts (saves hours of debugging)
**Time:** 5 minutes
**Cost:** $0
```python
from datasets import load_dataset
dataset = load_dataset("muhammadtlha944/mcp-agent-training-data")
print(dataset["train"][0]) # Peek at first example
```
### Step 1.3: Verify Model Compatibility
**What:** Load Qwen3-1.7B tokenizer and test chat template
**Why:** Make sure the model can process our messages format
**Time:** 5 minutes
**Cost:** $0
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
print(tokenizer.chat_template) # Should not be None
```
---
## Phase 2: Training Script Development (30 minutes)
### Step 2.1: Write Training Script
**What:** Create `train.py` with full educational comments
**Why:** Every line documented so you learn as we build
**Time:** 15 minutes
**Cost:** $0
**What the script contains:**
- LoRA configuration (r=16, all-linear, dropout=0.05)
- SFTConfig with all hyperparameters documented
- Trackio monitoring setup
- push_to_hub configuration
- Plain-text logging (no tqdm progress bars)
### Step 2.2: Test Script in Sandbox
**What:** Run the script for 10 steps to catch errors
**Why:** Find bugs NOW before the expensive training job
**Time:** 10 minutes
**Cost:** $0 (sandbox GPU time)
```python
# Run just 10 steps as a smoke test
training_args.max_steps = 10
trainer.train()
```
### Step 2.3: Review & Fix Issues
**What:** Fix any import errors, API mismatches, or config issues
**Why:** Training jobs are expensive β we only launch when the script is solid
**Time:** 5 minutes
**Cost:** $0
---
## Phase 3: Model Training (2-3 hours)
### Step 3.1: Launch Training Job
**What:** Submit training to HF Jobs on T4 GPU
**Why:** T4 is cheapest GPU that fits our model (16GB VRAM)
**Time:** 2-3 hours (automated)
**Cost:** ~$1.20-1.80
**Pre-flight check before launch:**
- β
Dataset format validated
- β
Script tested in sandbox
- β
push_to_hub=True and hub_model_id set
- β
Timeout set to 4 hours (plenty of buffer)
- β
Trackio monitoring enabled
- β
disable_tqdm=True for clean logs
### Step 3.2: Monitor Training
**What:** Watch loss curves via Trackio dashboard
**Why:** Make sure loss is going down (model is learning)
**Time:** Check every 15 minutes
**Cost:** $0 (just watching)
**What to watch for:**
```
Good: Step 100: loss=2.5 β Step 500: loss=1.2 β Step 2450: loss=0.9
Warning: Step 100: loss=2.5 β Step 500: loss=2.4 β Step 1000: loss=2.3
(Learning very slowly β might need more epochs or higher LR)
Bad: Step 100: loss=2.5 β Step 500: loss=3.0 β Step 1000: loss=3.5
(Loss going UP β stop immediately, something is wrong)
```
### Step 3.3: Verify Model Pushed to Hub
**What:** Check that the model appears in your HF repo
**Why:** Job storage is ephemeral β if push_to_hub fails, model is LOST
**Time:** 5 minutes
**Cost:** $0
**Check URL:** https://huggingface.co/muhammadtlha944/MCP-Agent-1.7B
---
## Phase 4: Testing & Evaluation (30 minutes)
### Step 4.1: Load Trained Model
**What:** Download the model from Hub and test inference
**Why:** Verify the model actually works after training
**Time:** 10 minutes
**Cost:** $0
```python
from transformers import pipeline
pipe = pipeline("text-generation", model="muhammadtlha944/MCP-Agent-1.7B")
```
### Step 4.2: Run Test Prompts
**What:** Test the model on real tool-calling scenarios
**Why:** See if training actually worked
**Time:** 10 minutes
**Cost:** $0
**Test cases:**
1. Simple tool call: "Find all Python files"
2. Multi-step: "Clone a repo and find TODO comments"
3. Clarification: "Book a flight" (missing info)
4. Safety: "Delete all files" (should refuse)
5. MCP format: "Use the github_search tool to find ML repos"
### Step 4.3: Document Results
**What:** Save test outputs and observations
**Why:** Track what works and what needs improvement
**Time:** 10 minutes
**Cost:** $0
---
## Phase 5: Agent Harness App (1 hour)
### Step 5.1: Write Agent App
**What:** Create `app.py` with Gradio UI + ReAct loop + tool registry
**Why:** Turn the model into an actual usable agent
**Time:** 30 minutes
**Cost:** $0
**What the app contains:**
- Gradio chat interface
- Agent mode toggle (on/off)
- Tool registry with 7 built-in tools
- ReAct loop (think β act β observe β repeat)
- Tool execution log
- Safety filters (block dangerous commands)
### Step 5.2: Test Agent Locally
**What:** Run the app and test with real user queries
**Why:** Make sure the whole system works end-to-end
**Time:** 15 minutes
**Cost:** $0
### Step 5.3: Deploy to HF Space
**What:** Upload app to a Gradio Space
**Why:** Share with the world!
**Time:** 15 minutes
**Cost:** $0 (Spaces free tier)
---
## Phase 6: Documentation & Publication (30 minutes)
### Step 6.1: Update Model README
**What:** Write a compelling README for the model card
**Why:** Model cards are how people discover and understand your model
**Time:** 15 minutes
**Cost:** $0
**What to include:**
- What the model does
- How it was trained
- How to use it
- Benchmarks/results
- Limitations
- Citation info
### Step 6.2: Create Dataset Card
**What:** Document the training dataset
**Why:** Transparency is valued in the ML community
**Time:** 10 minutes
**Cost:** $0
### Step 6.3: Share Results
**What:** Post on social media, share with community
**Why:** Get feedback, attract collaborators
**Time:** 5 minutes
**Cost:** $0
---
## π
Timeline Summary
| Phase | Steps | Time | Cost | Cumulative |
|-------|-------|------|------|------------|
| 1. Setup | 1.1-1.3 | 15 min | $0 | 15 min / $0 |
| 2. Script | 2.1-2.3 | 30 min | $0 | 45 min / $0 |
| 3. Training | 3.1-3.3 | 2-3 hrs | ~$1.50 | 3-4 hrs / $1.50 |
| 4. Testing | 4.1-4.3 | 30 min | $0 | 3.5-4.5 hrs / $1.50 |
| 5. App | 5.1-5.3 | 1 hr | $0 | 4.5-5.5 hrs / $1.50 |
| 6. Publish | 6.1-6.3 | 30 min | $0 | 5-6 hrs / $1.50 |
**Total time:** ~5-6 hours of active work
**Total cost:** ~$1.50 (training only)
**Total budget used:** ~15% of $10 budget β
---
## π― Decision Points
At each phase, we'll make decisions based on results:
### After Phase 3 (Training):
**If training loss < 1.5 and eval loss < 1.8:** β
Proceed to testing
**If training loss > 2.0:** β οΈ Consider more epochs or higher LR
**If eval loss >> train loss:** β Overfitting β need more data or lower rank
**If model didn't push to Hub:** β Stop and fix push_to_hub configuration
### After Phase 4 (Testing):
**If model generates tool calls correctly:** β
Proceed to app
**If model generates text but not tool calls:** β οΈ Need more MCP-specific training data
**If model hallucinates tools:** β οΈ Need more diverse tool schemas in data
**If model refuses everything:** β οΈ Too much safety data β need balance
### After Phase 5 (App):
**If app works end-to-end:** β
Publish and celebrate!
**If tools fail to execute:** β οΈ Fix tool implementations
**If model runs out of context:** β οΈ Reduce max_iterations or use sliding window
---
## π‘ What You'll Learn During Execution
### During Phase 1:
- How to set up a GPU environment
- How to validate data formats
- How model tokenizers work
### During Phase 2:
- How to write production training scripts
- How LoRA configuration works
- How SFTConfig parameters affect training
### During Phase 3:
- How to submit jobs to cloud GPUs
- How to monitor training in real-time
- How to read loss curves
- How Trackio dashboards work
### During Phase 4:
- How to load fine-tuned models
- How to test models systematically
- How to identify model weaknesses
### During Phase 5:
- How to build agent applications
- How the ReAct pattern works in practice
- How tool registries function
- How to deploy Gradio apps
### During Phase 6:
- How to write effective model cards
- How to share research with the community
---
## π¨ Contingency Plans
### If Training Fails (OOM Error)
**Symptom:** "CUDA out of memory" error
**Fix:**
1. Reduce batch_size from 4 to 2 (keep accumulation at 4 β effective batch = 8)
2. Reduce max_seq_length from 2048 to 1024
3. If still fails, use gradient checkpointing (already enabled)
4. Last resort: upgrade to a10g-small (24GB VRAM, ~$1.20/hr)
### If Training Is Too Slow
**Symptom:** Loss barely moving after 1 hour
**Fix:**
1. Check learning rate β might be too low
2. Increase warmup ratio from 0.1 to 0.2
3. Reduce gradient accumulation from 4 to 2 (faster but less stable)
### If Model Doesn't Generate Tool Calls
**Symptom:** Model answers questions normally but doesn't use tools
**Fix:**
1. Add more MCP-specific training data
2. Adjust system prompt to emphasize tool use
3. Use higher temperature (0.9) to encourage creativity
4. Add few-shot examples in the system prompt
### If Push to Hub Fails
**Symptom:** Model trained but not on Hub
**Fix:**
1. Check HF token has write permissions
2. Manually upload: `trainer.push_to_hub()` after training
3. Save locally first: `trainer.save_model("./local-save")`
---
## π Success Criteria
We'll consider this project a success when:
- β
Model trains without errors (loss < 1.5)
- β
Model pushed to Hub successfully
- β
Model generates structured tool calls on test prompts
- β
Agent app runs locally with tool execution
- β
App deployed to HF Space
- β
Total cost under $10 (target: $1.50)
---
## π Ready?
When you've read all the files and feel confident, just say:
> **"START"**
And we'll begin with Phase 1.
---
*Learning ML by building real things β one step at a time.*
|