File size: 10,361 Bytes

1ff07c2

# 06 — Execution Plan: What We'll Do When You Say "START"

## 🚀 The Plan

When you say **"START"**, here is the EXACT sequence of steps we'll follow.
Each step has a clear goal, estimated time, and cost.

---

## Phase 1: Setup & Validation (15 minutes)

### Step 1.1: Create Training Sandbox
**What:** Set up a GPU sandbox with all dependencies installed  
**Why:** Test that everything works before spending money on a real training job  
**Time:** 5 minutes  
**Cost:** $0

```bash
pip install transformers trl peft datasets accelerate bitsandbytes torch trackio
```

### Step 1.2: Validate Dataset Format
**What:** Load your dataset and verify it works with SFTTrainer  
**Why:** Catch format issues BEFORE training starts (saves hours of debugging)  
**Time:** 5 minutes  
**Cost:** $0

```python
from datasets import load_dataset
dataset = load_dataset("muhammadtlha944/mcp-agent-training-data")
print(dataset["train"][0])  # Peek at first example
```

### Step 1.3: Verify Model Compatibility
**What:** Load Qwen3-1.7B tokenizer and test chat template  
**Why:** Make sure the model can process our messages format  
**Time:** 5 minutes  
**Cost:** $0

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
print(tokenizer.chat_template)  # Should not be None
```

---

## Phase 2: Training Script Development (30 minutes)

### Step 2.1: Write Training Script
**What:** Create `train.py` with full educational comments  
**Why:** Every line documented so you learn as we build  
**Time:** 15 minutes  
**Cost:** $0

**What the script contains:**
- LoRA configuration (r=16, all-linear, dropout=0.05)
- SFTConfig with all hyperparameters documented
- Trackio monitoring setup
- push_to_hub configuration
- Plain-text logging (no tqdm progress bars)

### Step 2.2: Test Script in Sandbox
**What:** Run the script for 10 steps to catch errors  
**Why:** Find bugs NOW before the expensive training job  
**Time:** 10 minutes  
**Cost:** $0 (sandbox GPU time)

```python
# Run just 10 steps as a smoke test
training_args.max_steps = 10
trainer.train()
```

### Step 2.3: Review & Fix Issues
**What:** Fix any import errors, API mismatches, or config issues  
**Why:** Training jobs are expensive — we only launch when the script is solid  
**Time:** 5 minutes  
**Cost:** $0

---

## Phase 3: Model Training (2-3 hours)

### Step 3.1: Launch Training Job
**What:** Submit training to HF Jobs on T4 GPU  
**Why:** T4 is cheapest GPU that fits our model (16GB VRAM)  
**Time:** 2-3 hours (automated)  
**Cost:** ~$1.20-1.80

**Pre-flight check before launch:**
- ✅ Dataset format validated
- ✅ Script tested in sandbox
- ✅ push_to_hub=True and hub_model_id set
- ✅ Timeout set to 4 hours (plenty of buffer)
- ✅ Trackio monitoring enabled
- ✅ disable_tqdm=True for clean logs

### Step 3.2: Monitor Training
**What:** Watch loss curves via Trackio dashboard  
**Why:** Make sure loss is going down (model is learning)  
**Time:** Check every 15 minutes  
**Cost:** $0 (just watching)

**What to watch for:**
```
Good:    Step 100: loss=2.5 → Step 500: loss=1.2 → Step 2450: loss=0.9
Warning: Step 100: loss=2.5 → Step 500: loss=2.4 → Step 1000: loss=2.3
  (Learning very slowly — might need more epochs or higher LR)
Bad:     Step 100: loss=2.5 → Step 500: loss=3.0 → Step 1000: loss=3.5
  (Loss going UP — stop immediately, something is wrong)
```

### Step 3.3: Verify Model Pushed to Hub
**What:** Check that the model appears in your HF repo  
**Why:** Job storage is ephemeral — if push_to_hub fails, model is LOST  
**Time:** 5 minutes  
**Cost:** $0

**Check URL:** https://huggingface.co/muhammadtlha944/MCP-Agent-1.7B

---

## Phase 4: Testing & Evaluation (30 minutes)

### Step 4.1: Load Trained Model
**What:** Download the model from Hub and test inference  
**Why:** Verify the model actually works after training  
**Time:** 10 minutes  
**Cost:** $0

```python
from transformers import pipeline
pipe = pipeline("text-generation", model="muhammadtlha944/MCP-Agent-1.7B")
```

### Step 4.2: Run Test Prompts
**What:** Test the model on real tool-calling scenarios  
**Why:** See if training actually worked  
**Time:** 10 minutes  
**Cost:** $0

**Test cases:**
1. Simple tool call: "Find all Python files"
2. Multi-step: "Clone a repo and find TODO comments"
3. Clarification: "Book a flight" (missing info)
4. Safety: "Delete all files" (should refuse)
5. MCP format: "Use the github_search tool to find ML repos"

### Step 4.3: Document Results
**What:** Save test outputs and observations  
**Why:** Track what works and what needs improvement  
**Time:** 10 minutes  
**Cost:** $0

---

## Phase 5: Agent Harness App (1 hour)

### Step 5.1: Write Agent App
**What:** Create `app.py` with Gradio UI + ReAct loop + tool registry  
**Why:** Turn the model into an actual usable agent  
**Time:** 30 minutes  
**Cost:** $0

**What the app contains:**
- Gradio chat interface
- Agent mode toggle (on/off)
- Tool registry with 7 built-in tools
- ReAct loop (think → act → observe → repeat)
- Tool execution log
- Safety filters (block dangerous commands)

### Step 5.2: Test Agent Locally
**What:** Run the app and test with real user queries  
**Why:** Make sure the whole system works end-to-end  
**Time:** 15 minutes  
**Cost:** $0

### Step 5.3: Deploy to HF Space
**What:** Upload app to a Gradio Space  
**Why:** Share with the world!  
**Time:** 15 minutes  
**Cost:** $0 (Spaces free tier)

---

## Phase 6: Documentation & Publication (30 minutes)

### Step 6.1: Update Model README
**What:** Write a compelling README for the model card  
**Why:** Model cards are how people discover and understand your model  
**Time:** 15 minutes  
**Cost:** $0

**What to include:**
- What the model does
- How it was trained
- How to use it
- Benchmarks/results
- Limitations
- Citation info

### Step 6.2: Create Dataset Card
**What:** Document the training dataset  
**Why:** Transparency is valued in the ML community  
**Time:** 10 minutes  
**Cost:** $0

### Step 6.3: Share Results
**What:** Post on social media, share with community  
**Why:** Get feedback, attract collaborators  
**Time:** 5 minutes  
**Cost:** $0

---

## 📅 Timeline Summary

| Phase | Steps | Time | Cost | Cumulative |
|-------|-------|------|------|------------|
| 1. Setup | 1.1-1.3 | 15 min | $0 | 15 min / $0 |
| 2. Script | 2.1-2.3 | 30 min | $0 | 45 min / $0 |
| 3. Training | 3.1-3.3 | 2-3 hrs | ~$1.50 | 3-4 hrs / $1.50 |
| 4. Testing | 4.1-4.3 | 30 min | $0 | 3.5-4.5 hrs / $1.50 |
| 5. App | 5.1-5.3 | 1 hr | $0 | 4.5-5.5 hrs / $1.50 |
| 6. Publish | 6.1-6.3 | 30 min | $0 | 5-6 hrs / $1.50 |

**Total time:** ~5-6 hours of active work  
**Total cost:** ~$1.50 (training only)  
**Total budget used:** ~15% of $10 budget ✅

---

## 🎯 Decision Points

At each phase, we'll make decisions based on results:

### After Phase 3 (Training):
**If training loss < 1.5 and eval loss < 1.8:** ✅ Proceed to testing  
**If training loss > 2.0:** ⚠️ Consider more epochs or higher LR  
**If eval loss >> train loss:** ❌ Overfitting — need more data or lower rank  
**If model didn't push to Hub:** ❌ Stop and fix push_to_hub configuration

### After Phase 4 (Testing):
**If model generates tool calls correctly:** ✅ Proceed to app  
**If model generates text but not tool calls:** ⚠️ Need more MCP-specific training data  
**If model hallucinates tools:** ⚠️ Need more diverse tool schemas in data  
**If model refuses everything:** ⚠️ Too much safety data — need balance

### After Phase 5 (App):
**If app works end-to-end:** ✅ Publish and celebrate!  
**If tools fail to execute:** ⚠️ Fix tool implementations  
**If model runs out of context:** ⚠️ Reduce max_iterations or use sliding window  

---

## 💡 What You'll Learn During Execution

### During Phase 1:
- How to set up a GPU environment
- How to validate data formats
- How model tokenizers work

### During Phase 2:
- How to write production training scripts
- How LoRA configuration works
- How SFTConfig parameters affect training

### During Phase 3:
- How to submit jobs to cloud GPUs
- How to monitor training in real-time
- How to read loss curves
- How Trackio dashboards work

### During Phase 4:
- How to load fine-tuned models
- How to test models systematically
- How to identify model weaknesses

### During Phase 5:
- How to build agent applications
- How the ReAct pattern works in practice
- How tool registries function
- How to deploy Gradio apps

### During Phase 6:
- How to write effective model cards
- How to share research with the community

---

## 🚨 Contingency Plans

### If Training Fails (OOM Error)
**Symptom:** "CUDA out of memory" error  
**Fix:**
1. Reduce batch_size from 4 to 2 (keep accumulation at 4 → effective batch = 8)
2. Reduce max_seq_length from 2048 to 1024
3. If still fails, use gradient checkpointing (already enabled)
4. Last resort: upgrade to a10g-small (24GB VRAM, ~$1.20/hr)

### If Training Is Too Slow
**Symptom:** Loss barely moving after 1 hour  
**Fix:**
1. Check learning rate — might be too low
2. Increase warmup ratio from 0.1 to 0.2
3. Reduce gradient accumulation from 4 to 2 (faster but less stable)

### If Model Doesn't Generate Tool Calls
**Symptom:** Model answers questions normally but doesn't use tools  
**Fix:**
1. Add more MCP-specific training data
2. Adjust system prompt to emphasize tool use
3. Use higher temperature (0.9) to encourage creativity
4. Add few-shot examples in the system prompt

### If Push to Hub Fails
**Symptom:** Model trained but not on Hub  
**Fix:**
1. Check HF token has write permissions
2. Manually upload: `trainer.push_to_hub()` after training
3. Save locally first: `trainer.save_model("./local-save")`

---

## 🎉 Success Criteria

We'll consider this project a success when:

- ✅ Model trains without errors (loss < 1.5)
- ✅ Model pushed to Hub successfully
- ✅ Model generates structured tool calls on test prompts
- ✅ Agent app runs locally with tool execution
- ✅ App deployed to HF Space
- ✅ Total cost under $10 (target: $1.50)

---

## 🚀 Ready?

When you've read all the files and feel confident, just say:

> **"START"**

And we'll begin with Phase 1.

---

*Learning ML by building real things — one step at a time.*