# 06 — Execution Plan: What We'll Do When You Say "START" ## 🚀 The Plan When you say **"START"**, here is the EXACT sequence of steps we'll follow. Each step has a clear goal, estimated time, and cost. --- ## Phase 1: Setup & Validation (15 minutes) ### Step 1.1: Create Training Sandbox **What:** Set up a GPU sandbox with all dependencies installed **Why:** Test that everything works before spending money on a real training job **Time:** 5 minutes **Cost:** $0 ```bash pip install transformers trl peft datasets accelerate bitsandbytes torch trackio ``` ### Step 1.2: Validate Dataset Format **What:** Load your dataset and verify it works with SFTTrainer **Why:** Catch format issues BEFORE training starts (saves hours of debugging) **Time:** 5 minutes **Cost:** $0 ```python from datasets import load_dataset dataset = load_dataset("muhammadtlha944/mcp-agent-training-data") print(dataset["train"][0]) # Peek at first example ``` ### Step 1.3: Verify Model Compatibility **What:** Load Qwen3-1.7B tokenizer and test chat template **Why:** Make sure the model can process our messages format **Time:** 5 minutes **Cost:** $0 ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B") print(tokenizer.chat_template) # Should not be None ``` --- ## Phase 2: Training Script Development (30 minutes) ### Step 2.1: Write Training Script **What:** Create `train.py` with full educational comments **Why:** Every line documented so you learn as we build **Time:** 15 minutes **Cost:** $0 **What the script contains:** - LoRA configuration (r=16, all-linear, dropout=0.05) - SFTConfig with all hyperparameters documented - Trackio monitoring setup - push_to_hub configuration - Plain-text logging (no tqdm progress bars) ### Step 2.2: Test Script in Sandbox **What:** Run the script for 10 steps to catch errors **Why:** Find bugs NOW before the expensive training job **Time:** 10 minutes **Cost:** $0 (sandbox GPU time) ```python # Run just 10 steps as a smoke test training_args.max_steps = 10 trainer.train() ``` ### Step 2.3: Review & Fix Issues **What:** Fix any import errors, API mismatches, or config issues **Why:** Training jobs are expensive — we only launch when the script is solid **Time:** 5 minutes **Cost:** $0 --- ## Phase 3: Model Training (2-3 hours) ### Step 3.1: Launch Training Job **What:** Submit training to HF Jobs on T4 GPU **Why:** T4 is cheapest GPU that fits our model (16GB VRAM) **Time:** 2-3 hours (automated) **Cost:** ~$1.20-1.80 **Pre-flight check before launch:** - ✅ Dataset format validated - ✅ Script tested in sandbox - ✅ push_to_hub=True and hub_model_id set - ✅ Timeout set to 4 hours (plenty of buffer) - ✅ Trackio monitoring enabled - ✅ disable_tqdm=True for clean logs ### Step 3.2: Monitor Training **What:** Watch loss curves via Trackio dashboard **Why:** Make sure loss is going down (model is learning) **Time:** Check every 15 minutes **Cost:** $0 (just watching) **What to watch for:** ``` Good: Step 100: loss=2.5 → Step 500: loss=1.2 → Step 2450: loss=0.9 Warning: Step 100: loss=2.5 → Step 500: loss=2.4 → Step 1000: loss=2.3 (Learning very slowly — might need more epochs or higher LR) Bad: Step 100: loss=2.5 → Step 500: loss=3.0 → Step 1000: loss=3.5 (Loss going UP — stop immediately, something is wrong) ``` ### Step 3.3: Verify Model Pushed to Hub **What:** Check that the model appears in your HF repo **Why:** Job storage is ephemeral — if push_to_hub fails, model is LOST **Time:** 5 minutes **Cost:** $0 **Check URL:** https://huggingface.co/muhammadtlha944/MCP-Agent-1.7B --- ## Phase 4: Testing & Evaluation (30 minutes) ### Step 4.1: Load Trained Model **What:** Download the model from Hub and test inference **Why:** Verify the model actually works after training **Time:** 10 minutes **Cost:** $0 ```python from transformers import pipeline pipe = pipeline("text-generation", model="muhammadtlha944/MCP-Agent-1.7B") ``` ### Step 4.2: Run Test Prompts **What:** Test the model on real tool-calling scenarios **Why:** See if training actually worked **Time:** 10 minutes **Cost:** $0 **Test cases:** 1. Simple tool call: "Find all Python files" 2. Multi-step: "Clone a repo and find TODO comments" 3. Clarification: "Book a flight" (missing info) 4. Safety: "Delete all files" (should refuse) 5. MCP format: "Use the github_search tool to find ML repos" ### Step 4.3: Document Results **What:** Save test outputs and observations **Why:** Track what works and what needs improvement **Time:** 10 minutes **Cost:** $0 --- ## Phase 5: Agent Harness App (1 hour) ### Step 5.1: Write Agent App **What:** Create `app.py` with Gradio UI + ReAct loop + tool registry **Why:** Turn the model into an actual usable agent **Time:** 30 minutes **Cost:** $0 **What the app contains:** - Gradio chat interface - Agent mode toggle (on/off) - Tool registry with 7 built-in tools - ReAct loop (think → act → observe → repeat) - Tool execution log - Safety filters (block dangerous commands) ### Step 5.2: Test Agent Locally **What:** Run the app and test with real user queries **Why:** Make sure the whole system works end-to-end **Time:** 15 minutes **Cost:** $0 ### Step 5.3: Deploy to HF Space **What:** Upload app to a Gradio Space **Why:** Share with the world! **Time:** 15 minutes **Cost:** $0 (Spaces free tier) --- ## Phase 6: Documentation & Publication (30 minutes) ### Step 6.1: Update Model README **What:** Write a compelling README for the model card **Why:** Model cards are how people discover and understand your model **Time:** 15 minutes **Cost:** $0 **What to include:** - What the model does - How it was trained - How to use it - Benchmarks/results - Limitations - Citation info ### Step 6.2: Create Dataset Card **What:** Document the training dataset **Why:** Transparency is valued in the ML community **Time:** 10 minutes **Cost:** $0 ### Step 6.3: Share Results **What:** Post on social media, share with community **Why:** Get feedback, attract collaborators **Time:** 5 minutes **Cost:** $0 --- ## 📅 Timeline Summary | Phase | Steps | Time | Cost | Cumulative | |-------|-------|------|------|------------| | 1. Setup | 1.1-1.3 | 15 min | $0 | 15 min / $0 | | 2. Script | 2.1-2.3 | 30 min | $0 | 45 min / $0 | | 3. Training | 3.1-3.3 | 2-3 hrs | ~$1.50 | 3-4 hrs / $1.50 | | 4. Testing | 4.1-4.3 | 30 min | $0 | 3.5-4.5 hrs / $1.50 | | 5. App | 5.1-5.3 | 1 hr | $0 | 4.5-5.5 hrs / $1.50 | | 6. Publish | 6.1-6.3 | 30 min | $0 | 5-6 hrs / $1.50 | **Total time:** ~5-6 hours of active work **Total cost:** ~$1.50 (training only) **Total budget used:** ~15% of $10 budget ✅ --- ## 🎯 Decision Points At each phase, we'll make decisions based on results: ### After Phase 3 (Training): **If training loss < 1.5 and eval loss < 1.8:** ✅ Proceed to testing **If training loss > 2.0:** ⚠️ Consider more epochs or higher LR **If eval loss >> train loss:** ❌ Overfitting — need more data or lower rank **If model didn't push to Hub:** ❌ Stop and fix push_to_hub configuration ### After Phase 4 (Testing): **If model generates tool calls correctly:** ✅ Proceed to app **If model generates text but not tool calls:** ⚠️ Need more MCP-specific training data **If model hallucinates tools:** ⚠️ Need more diverse tool schemas in data **If model refuses everything:** ⚠️ Too much safety data — need balance ### After Phase 5 (App): **If app works end-to-end:** ✅ Publish and celebrate! **If tools fail to execute:** ⚠️ Fix tool implementations **If model runs out of context:** ⚠️ Reduce max_iterations or use sliding window --- ## 💡 What You'll Learn During Execution ### During Phase 1: - How to set up a GPU environment - How to validate data formats - How model tokenizers work ### During Phase 2: - How to write production training scripts - How LoRA configuration works - How SFTConfig parameters affect training ### During Phase 3: - How to submit jobs to cloud GPUs - How to monitor training in real-time - How to read loss curves - How Trackio dashboards work ### During Phase 4: - How to load fine-tuned models - How to test models systematically - How to identify model weaknesses ### During Phase 5: - How to build agent applications - How the ReAct pattern works in practice - How tool registries function - How to deploy Gradio apps ### During Phase 6: - How to write effective model cards - How to share research with the community --- ## 🚨 Contingency Plans ### If Training Fails (OOM Error) **Symptom:** "CUDA out of memory" error **Fix:** 1. Reduce batch_size from 4 to 2 (keep accumulation at 4 → effective batch = 8) 2. Reduce max_seq_length from 2048 to 1024 3. If still fails, use gradient checkpointing (already enabled) 4. Last resort: upgrade to a10g-small (24GB VRAM, ~$1.20/hr) ### If Training Is Too Slow **Symptom:** Loss barely moving after 1 hour **Fix:** 1. Check learning rate — might be too low 2. Increase warmup ratio from 0.1 to 0.2 3. Reduce gradient accumulation from 4 to 2 (faster but less stable) ### If Model Doesn't Generate Tool Calls **Symptom:** Model answers questions normally but doesn't use tools **Fix:** 1. Add more MCP-specific training data 2. Adjust system prompt to emphasize tool use 3. Use higher temperature (0.9) to encourage creativity 4. Add few-shot examples in the system prompt ### If Push to Hub Fails **Symptom:** Model trained but not on Hub **Fix:** 1. Check HF token has write permissions 2. Manually upload: `trainer.push_to_hub()` after training 3. Save locally first: `trainer.save_model("./local-save")` --- ## 🎉 Success Criteria We'll consider this project a success when: - ✅ Model trains without errors (loss < 1.5) - ✅ Model pushed to Hub successfully - ✅ Model generates structured tool calls on test prompts - ✅ Agent app runs locally with tool execution - ✅ App deployed to HF Space - ✅ Total cost under $10 (target: $1.50) --- ## 🚀 Ready? When you've read all the files and feel confident, just say: > **"START"** And we'll begin with Phase 1. --- *Learning ML by building real things — one step at a time.*