| # 06 β Execution Plan: What We'll Do When You Say "START" |
|
|
| ## π The Plan |
|
|
| When you say **"START"**, here is the EXACT sequence of steps we'll follow. |
| Each step has a clear goal, estimated time, and cost. |
|
|
| --- |
|
|
| ## Phase 1: Setup & Validation (15 minutes) |
|
|
| ### Step 1.1: Create Training Sandbox |
| **What:** Set up a GPU sandbox with all dependencies installed |
| **Why:** Test that everything works before spending money on a real training job |
| **Time:** 5 minutes |
| **Cost:** $0 |
|
|
| ```bash |
| pip install transformers trl peft datasets accelerate bitsandbytes torch trackio |
| ``` |
|
|
| ### Step 1.2: Validate Dataset Format |
| **What:** Load your dataset and verify it works with SFTTrainer |
| **Why:** Catch format issues BEFORE training starts (saves hours of debugging) |
| **Time:** 5 minutes |
| **Cost:** $0 |
|
|
| ```python |
| from datasets import load_dataset |
| dataset = load_dataset("muhammadtlha944/mcp-agent-training-data") |
| print(dataset["train"][0]) # Peek at first example |
| ``` |
|
|
| ### Step 1.3: Verify Model Compatibility |
| **What:** Load Qwen3-1.7B tokenizer and test chat template |
| **Why:** Make sure the model can process our messages format |
| **Time:** 5 minutes |
| **Cost:** $0 |
|
|
| ```python |
| from transformers import AutoTokenizer |
| tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B") |
| print(tokenizer.chat_template) # Should not be None |
| ``` |
|
|
| --- |
|
|
| ## Phase 2: Training Script Development (30 minutes) |
|
|
| ### Step 2.1: Write Training Script |
| **What:** Create `train.py` with full educational comments |
| **Why:** Every line documented so you learn as we build |
| **Time:** 15 minutes |
| **Cost:** $0 |
|
|
| **What the script contains:** |
| - LoRA configuration (r=16, all-linear, dropout=0.05) |
| - SFTConfig with all hyperparameters documented |
| - Trackio monitoring setup |
| - push_to_hub configuration |
| - Plain-text logging (no tqdm progress bars) |
|
|
| ### Step 2.2: Test Script in Sandbox |
| **What:** Run the script for 10 steps to catch errors |
| **Why:** Find bugs NOW before the expensive training job |
| **Time:** 10 minutes |
| **Cost:** $0 (sandbox GPU time) |
|
|
| ```python |
| # Run just 10 steps as a smoke test |
| training_args.max_steps = 10 |
| trainer.train() |
| ``` |
|
|
| ### Step 2.3: Review & Fix Issues |
| **What:** Fix any import errors, API mismatches, or config issues |
| **Why:** Training jobs are expensive β we only launch when the script is solid |
| **Time:** 5 minutes |
| **Cost:** $0 |
|
|
| --- |
|
|
| ## Phase 3: Model Training (2-3 hours) |
|
|
| ### Step 3.1: Launch Training Job |
| **What:** Submit training to HF Jobs on T4 GPU |
| **Why:** T4 is cheapest GPU that fits our model (16GB VRAM) |
| **Time:** 2-3 hours (automated) |
| **Cost:** ~$1.20-1.80 |
|
|
| **Pre-flight check before launch:** |
| - β
Dataset format validated |
| - β
Script tested in sandbox |
| - β
push_to_hub=True and hub_model_id set |
| - β
Timeout set to 4 hours (plenty of buffer) |
| - β
Trackio monitoring enabled |
| - β
disable_tqdm=True for clean logs |
| |
| ### Step 3.2: Monitor Training |
| **What:** Watch loss curves via Trackio dashboard |
| **Why:** Make sure loss is going down (model is learning) |
| **Time:** Check every 15 minutes |
| **Cost:** $0 (just watching) |
| |
| **What to watch for:** |
| ``` |
| Good: Step 100: loss=2.5 β Step 500: loss=1.2 β Step 2450: loss=0.9 |
| Warning: Step 100: loss=2.5 β Step 500: loss=2.4 β Step 1000: loss=2.3 |
| (Learning very slowly β might need more epochs or higher LR) |
| Bad: Step 100: loss=2.5 β Step 500: loss=3.0 β Step 1000: loss=3.5 |
| (Loss going UP β stop immediately, something is wrong) |
| ``` |
| |
| ### Step 3.3: Verify Model Pushed to Hub |
| **What:** Check that the model appears in your HF repo |
| **Why:** Job storage is ephemeral β if push_to_hub fails, model is LOST |
| **Time:** 5 minutes |
| **Cost:** $0 |
| |
| **Check URL:** https://huggingface.co/muhammadtlha944/MCP-Agent-1.7B |
| |
| --- |
| |
| ## Phase 4: Testing & Evaluation (30 minutes) |
| |
| ### Step 4.1: Load Trained Model |
| **What:** Download the model from Hub and test inference |
| **Why:** Verify the model actually works after training |
| **Time:** 10 minutes |
| **Cost:** $0 |
| |
| ```python |
| from transformers import pipeline |
| pipe = pipeline("text-generation", model="muhammadtlha944/MCP-Agent-1.7B") |
| ``` |
| |
| ### Step 4.2: Run Test Prompts |
| **What:** Test the model on real tool-calling scenarios |
| **Why:** See if training actually worked |
| **Time:** 10 minutes |
| **Cost:** $0 |
| |
| **Test cases:** |
| 1. Simple tool call: "Find all Python files" |
| 2. Multi-step: "Clone a repo and find TODO comments" |
| 3. Clarification: "Book a flight" (missing info) |
| 4. Safety: "Delete all files" (should refuse) |
| 5. MCP format: "Use the github_search tool to find ML repos" |
|
|
| ### Step 4.3: Document Results |
| **What:** Save test outputs and observations |
| **Why:** Track what works and what needs improvement |
| **Time:** 10 minutes |
| **Cost:** $0 |
|
|
| --- |
|
|
| ## Phase 5: Agent Harness App (1 hour) |
|
|
| ### Step 5.1: Write Agent App |
| **What:** Create `app.py` with Gradio UI + ReAct loop + tool registry |
| **Why:** Turn the model into an actual usable agent |
| **Time:** 30 minutes |
| **Cost:** $0 |
|
|
| **What the app contains:** |
| - Gradio chat interface |
| - Agent mode toggle (on/off) |
| - Tool registry with 7 built-in tools |
| - ReAct loop (think β act β observe β repeat) |
| - Tool execution log |
| - Safety filters (block dangerous commands) |
|
|
| ### Step 5.2: Test Agent Locally |
| **What:** Run the app and test with real user queries |
| **Why:** Make sure the whole system works end-to-end |
| **Time:** 15 minutes |
| **Cost:** $0 |
|
|
| ### Step 5.3: Deploy to HF Space |
| **What:** Upload app to a Gradio Space |
| **Why:** Share with the world! |
| **Time:** 15 minutes |
| **Cost:** $0 (Spaces free tier) |
|
|
| --- |
|
|
| ## Phase 6: Documentation & Publication (30 minutes) |
|
|
| ### Step 6.1: Update Model README |
| **What:** Write a compelling README for the model card |
| **Why:** Model cards are how people discover and understand your model |
| **Time:** 15 minutes |
| **Cost:** $0 |
|
|
| **What to include:** |
| - What the model does |
| - How it was trained |
| - How to use it |
| - Benchmarks/results |
| - Limitations |
| - Citation info |
|
|
| ### Step 6.2: Create Dataset Card |
| **What:** Document the training dataset |
| **Why:** Transparency is valued in the ML community |
| **Time:** 10 minutes |
| **Cost:** $0 |
|
|
| ### Step 6.3: Share Results |
| **What:** Post on social media, share with community |
| **Why:** Get feedback, attract collaborators |
| **Time:** 5 minutes |
| **Cost:** $0 |
|
|
| --- |
|
|
| ## π
Timeline Summary |
|
|
| | Phase | Steps | Time | Cost | Cumulative | |
| |-------|-------|------|------|------------| |
| | 1. Setup | 1.1-1.3 | 15 min | $0 | 15 min / $0 | |
| | 2. Script | 2.1-2.3 | 30 min | $0 | 45 min / $0 | |
| | 3. Training | 3.1-3.3 | 2-3 hrs | ~$1.50 | 3-4 hrs / $1.50 | |
| | 4. Testing | 4.1-4.3 | 30 min | $0 | 3.5-4.5 hrs / $1.50 | |
| | 5. App | 5.1-5.3 | 1 hr | $0 | 4.5-5.5 hrs / $1.50 | |
| | 6. Publish | 6.1-6.3 | 30 min | $0 | 5-6 hrs / $1.50 | |
|
|
| **Total time:** ~5-6 hours of active work |
| **Total cost:** ~$1.50 (training only) |
| **Total budget used:** ~15% of $10 budget β
|
|
|
| --- |
|
|
| ## π― Decision Points |
|
|
| At each phase, we'll make decisions based on results: |
|
|
| ### After Phase 3 (Training): |
| **If training loss < 1.5 and eval loss < 1.8:** β
Proceed to testing |
| **If training loss > 2.0:** β οΈ Consider more epochs or higher LR |
| **If eval loss >> train loss:** β Overfitting β need more data or lower rank |
| **If model didn't push to Hub:** β Stop and fix push_to_hub configuration |
|
|
| ### After Phase 4 (Testing): |
| **If model generates tool calls correctly:** β
Proceed to app |
| **If model generates text but not tool calls:** β οΈ Need more MCP-specific training data |
| **If model hallucinates tools:** β οΈ Need more diverse tool schemas in data |
| **If model refuses everything:** β οΈ Too much safety data β need balance |
|
|
| ### After Phase 5 (App): |
| **If app works end-to-end:** β
Publish and celebrate! |
| **If tools fail to execute:** β οΈ Fix tool implementations |
| **If model runs out of context:** β οΈ Reduce max_iterations or use sliding window |
| |
| --- |
| |
| ## π‘ What You'll Learn During Execution |
| |
| ### During Phase 1: |
| - How to set up a GPU environment |
| - How to validate data formats |
| - How model tokenizers work |
| |
| ### During Phase 2: |
| - How to write production training scripts |
| - How LoRA configuration works |
| - How SFTConfig parameters affect training |
| |
| ### During Phase 3: |
| - How to submit jobs to cloud GPUs |
| - How to monitor training in real-time |
| - How to read loss curves |
| - How Trackio dashboards work |
| |
| ### During Phase 4: |
| - How to load fine-tuned models |
| - How to test models systematically |
| - How to identify model weaknesses |
| |
| ### During Phase 5: |
| - How to build agent applications |
| - How the ReAct pattern works in practice |
| - How tool registries function |
| - How to deploy Gradio apps |
| |
| ### During Phase 6: |
| - How to write effective model cards |
| - How to share research with the community |
| |
| --- |
| |
| ## π¨ Contingency Plans |
| |
| ### If Training Fails (OOM Error) |
| **Symptom:** "CUDA out of memory" error |
| **Fix:** |
| 1. Reduce batch_size from 4 to 2 (keep accumulation at 4 β effective batch = 8) |
| 2. Reduce max_seq_length from 2048 to 1024 |
| 3. If still fails, use gradient checkpointing (already enabled) |
| 4. Last resort: upgrade to a10g-small (24GB VRAM, ~$1.20/hr) |
|
|
| ### If Training Is Too Slow |
| **Symptom:** Loss barely moving after 1 hour |
| **Fix:** |
| 1. Check learning rate β might be too low |
| 2. Increase warmup ratio from 0.1 to 0.2 |
| 3. Reduce gradient accumulation from 4 to 2 (faster but less stable) |
|
|
| ### If Model Doesn't Generate Tool Calls |
| **Symptom:** Model answers questions normally but doesn't use tools |
| **Fix:** |
| 1. Add more MCP-specific training data |
| 2. Adjust system prompt to emphasize tool use |
| 3. Use higher temperature (0.9) to encourage creativity |
| 4. Add few-shot examples in the system prompt |
|
|
| ### If Push to Hub Fails |
| **Symptom:** Model trained but not on Hub |
| **Fix:** |
| 1. Check HF token has write permissions |
| 2. Manually upload: `trainer.push_to_hub()` after training |
| 3. Save locally first: `trainer.save_model("./local-save")` |
|
|
| --- |
|
|
| ## π Success Criteria |
|
|
| We'll consider this project a success when: |
|
|
| - β
Model trains without errors (loss < 1.5) |
| - β
Model pushed to Hub successfully |
| - β
Model generates structured tool calls on test prompts |
| - β
Agent app runs locally with tool execution |
| - β
App deployed to HF Space |
| - β
Total cost under $10 (target: $1.50) |
|
|
| --- |
|
|
| ## π Ready? |
|
|
| When you've read all the files and feel confident, just say: |
|
|
| > **"START"** |
|
|
| And we'll begin with Phase 1. |
|
|
| --- |
|
|
| *Learning ML by building real things β one step at a time.* |
|
|