MCP-Agent-1.7B / docs /06-execution-plan.md
muhammadtlha944's picture
Upload docs/06-execution-plan.md
1ff07c2 verified
# 06 β€” Execution Plan: What We'll Do When You Say "START"
## πŸš€ The Plan
When you say **"START"**, here is the EXACT sequence of steps we'll follow.
Each step has a clear goal, estimated time, and cost.
---
## Phase 1: Setup & Validation (15 minutes)
### Step 1.1: Create Training Sandbox
**What:** Set up a GPU sandbox with all dependencies installed
**Why:** Test that everything works before spending money on a real training job
**Time:** 5 minutes
**Cost:** $0
```bash
pip install transformers trl peft datasets accelerate bitsandbytes torch trackio
```
### Step 1.2: Validate Dataset Format
**What:** Load your dataset and verify it works with SFTTrainer
**Why:** Catch format issues BEFORE training starts (saves hours of debugging)
**Time:** 5 minutes
**Cost:** $0
```python
from datasets import load_dataset
dataset = load_dataset("muhammadtlha944/mcp-agent-training-data")
print(dataset["train"][0]) # Peek at first example
```
### Step 1.3: Verify Model Compatibility
**What:** Load Qwen3-1.7B tokenizer and test chat template
**Why:** Make sure the model can process our messages format
**Time:** 5 minutes
**Cost:** $0
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
print(tokenizer.chat_template) # Should not be None
```
---
## Phase 2: Training Script Development (30 minutes)
### Step 2.1: Write Training Script
**What:** Create `train.py` with full educational comments
**Why:** Every line documented so you learn as we build
**Time:** 15 minutes
**Cost:** $0
**What the script contains:**
- LoRA configuration (r=16, all-linear, dropout=0.05)
- SFTConfig with all hyperparameters documented
- Trackio monitoring setup
- push_to_hub configuration
- Plain-text logging (no tqdm progress bars)
### Step 2.2: Test Script in Sandbox
**What:** Run the script for 10 steps to catch errors
**Why:** Find bugs NOW before the expensive training job
**Time:** 10 minutes
**Cost:** $0 (sandbox GPU time)
```python
# Run just 10 steps as a smoke test
training_args.max_steps = 10
trainer.train()
```
### Step 2.3: Review & Fix Issues
**What:** Fix any import errors, API mismatches, or config issues
**Why:** Training jobs are expensive β€” we only launch when the script is solid
**Time:** 5 minutes
**Cost:** $0
---
## Phase 3: Model Training (2-3 hours)
### Step 3.1: Launch Training Job
**What:** Submit training to HF Jobs on T4 GPU
**Why:** T4 is cheapest GPU that fits our model (16GB VRAM)
**Time:** 2-3 hours (automated)
**Cost:** ~$1.20-1.80
**Pre-flight check before launch:**
- βœ… Dataset format validated
- βœ… Script tested in sandbox
- βœ… push_to_hub=True and hub_model_id set
- βœ… Timeout set to 4 hours (plenty of buffer)
- βœ… Trackio monitoring enabled
- βœ… disable_tqdm=True for clean logs
### Step 3.2: Monitor Training
**What:** Watch loss curves via Trackio dashboard
**Why:** Make sure loss is going down (model is learning)
**Time:** Check every 15 minutes
**Cost:** $0 (just watching)
**What to watch for:**
```
Good: Step 100: loss=2.5 β†’ Step 500: loss=1.2 β†’ Step 2450: loss=0.9
Warning: Step 100: loss=2.5 β†’ Step 500: loss=2.4 β†’ Step 1000: loss=2.3
(Learning very slowly β€” might need more epochs or higher LR)
Bad: Step 100: loss=2.5 β†’ Step 500: loss=3.0 β†’ Step 1000: loss=3.5
(Loss going UP β€” stop immediately, something is wrong)
```
### Step 3.3: Verify Model Pushed to Hub
**What:** Check that the model appears in your HF repo
**Why:** Job storage is ephemeral β€” if push_to_hub fails, model is LOST
**Time:** 5 minutes
**Cost:** $0
**Check URL:** https://huggingface.co/muhammadtlha944/MCP-Agent-1.7B
---
## Phase 4: Testing & Evaluation (30 minutes)
### Step 4.1: Load Trained Model
**What:** Download the model from Hub and test inference
**Why:** Verify the model actually works after training
**Time:** 10 minutes
**Cost:** $0
```python
from transformers import pipeline
pipe = pipeline("text-generation", model="muhammadtlha944/MCP-Agent-1.7B")
```
### Step 4.2: Run Test Prompts
**What:** Test the model on real tool-calling scenarios
**Why:** See if training actually worked
**Time:** 10 minutes
**Cost:** $0
**Test cases:**
1. Simple tool call: "Find all Python files"
2. Multi-step: "Clone a repo and find TODO comments"
3. Clarification: "Book a flight" (missing info)
4. Safety: "Delete all files" (should refuse)
5. MCP format: "Use the github_search tool to find ML repos"
### Step 4.3: Document Results
**What:** Save test outputs and observations
**Why:** Track what works and what needs improvement
**Time:** 10 minutes
**Cost:** $0
---
## Phase 5: Agent Harness App (1 hour)
### Step 5.1: Write Agent App
**What:** Create `app.py` with Gradio UI + ReAct loop + tool registry
**Why:** Turn the model into an actual usable agent
**Time:** 30 minutes
**Cost:** $0
**What the app contains:**
- Gradio chat interface
- Agent mode toggle (on/off)
- Tool registry with 7 built-in tools
- ReAct loop (think β†’ act β†’ observe β†’ repeat)
- Tool execution log
- Safety filters (block dangerous commands)
### Step 5.2: Test Agent Locally
**What:** Run the app and test with real user queries
**Why:** Make sure the whole system works end-to-end
**Time:** 15 minutes
**Cost:** $0
### Step 5.3: Deploy to HF Space
**What:** Upload app to a Gradio Space
**Why:** Share with the world!
**Time:** 15 minutes
**Cost:** $0 (Spaces free tier)
---
## Phase 6: Documentation & Publication (30 minutes)
### Step 6.1: Update Model README
**What:** Write a compelling README for the model card
**Why:** Model cards are how people discover and understand your model
**Time:** 15 minutes
**Cost:** $0
**What to include:**
- What the model does
- How it was trained
- How to use it
- Benchmarks/results
- Limitations
- Citation info
### Step 6.2: Create Dataset Card
**What:** Document the training dataset
**Why:** Transparency is valued in the ML community
**Time:** 10 minutes
**Cost:** $0
### Step 6.3: Share Results
**What:** Post on social media, share with community
**Why:** Get feedback, attract collaborators
**Time:** 5 minutes
**Cost:** $0
---
## πŸ“… Timeline Summary
| Phase | Steps | Time | Cost | Cumulative |
|-------|-------|------|------|------------|
| 1. Setup | 1.1-1.3 | 15 min | $0 | 15 min / $0 |
| 2. Script | 2.1-2.3 | 30 min | $0 | 45 min / $0 |
| 3. Training | 3.1-3.3 | 2-3 hrs | ~$1.50 | 3-4 hrs / $1.50 |
| 4. Testing | 4.1-4.3 | 30 min | $0 | 3.5-4.5 hrs / $1.50 |
| 5. App | 5.1-5.3 | 1 hr | $0 | 4.5-5.5 hrs / $1.50 |
| 6. Publish | 6.1-6.3 | 30 min | $0 | 5-6 hrs / $1.50 |
**Total time:** ~5-6 hours of active work
**Total cost:** ~$1.50 (training only)
**Total budget used:** ~15% of $10 budget βœ…
---
## 🎯 Decision Points
At each phase, we'll make decisions based on results:
### After Phase 3 (Training):
**If training loss < 1.5 and eval loss < 1.8:** βœ… Proceed to testing
**If training loss > 2.0:** ⚠️ Consider more epochs or higher LR
**If eval loss >> train loss:** ❌ Overfitting β€” need more data or lower rank
**If model didn't push to Hub:** ❌ Stop and fix push_to_hub configuration
### After Phase 4 (Testing):
**If model generates tool calls correctly:** βœ… Proceed to app
**If model generates text but not tool calls:** ⚠️ Need more MCP-specific training data
**If model hallucinates tools:** ⚠️ Need more diverse tool schemas in data
**If model refuses everything:** ⚠️ Too much safety data β€” need balance
### After Phase 5 (App):
**If app works end-to-end:** βœ… Publish and celebrate!
**If tools fail to execute:** ⚠️ Fix tool implementations
**If model runs out of context:** ⚠️ Reduce max_iterations or use sliding window
---
## πŸ’‘ What You'll Learn During Execution
### During Phase 1:
- How to set up a GPU environment
- How to validate data formats
- How model tokenizers work
### During Phase 2:
- How to write production training scripts
- How LoRA configuration works
- How SFTConfig parameters affect training
### During Phase 3:
- How to submit jobs to cloud GPUs
- How to monitor training in real-time
- How to read loss curves
- How Trackio dashboards work
### During Phase 4:
- How to load fine-tuned models
- How to test models systematically
- How to identify model weaknesses
### During Phase 5:
- How to build agent applications
- How the ReAct pattern works in practice
- How tool registries function
- How to deploy Gradio apps
### During Phase 6:
- How to write effective model cards
- How to share research with the community
---
## 🚨 Contingency Plans
### If Training Fails (OOM Error)
**Symptom:** "CUDA out of memory" error
**Fix:**
1. Reduce batch_size from 4 to 2 (keep accumulation at 4 β†’ effective batch = 8)
2. Reduce max_seq_length from 2048 to 1024
3. If still fails, use gradient checkpointing (already enabled)
4. Last resort: upgrade to a10g-small (24GB VRAM, ~$1.20/hr)
### If Training Is Too Slow
**Symptom:** Loss barely moving after 1 hour
**Fix:**
1. Check learning rate β€” might be too low
2. Increase warmup ratio from 0.1 to 0.2
3. Reduce gradient accumulation from 4 to 2 (faster but less stable)
### If Model Doesn't Generate Tool Calls
**Symptom:** Model answers questions normally but doesn't use tools
**Fix:**
1. Add more MCP-specific training data
2. Adjust system prompt to emphasize tool use
3. Use higher temperature (0.9) to encourage creativity
4. Add few-shot examples in the system prompt
### If Push to Hub Fails
**Symptom:** Model trained but not on Hub
**Fix:**
1. Check HF token has write permissions
2. Manually upload: `trainer.push_to_hub()` after training
3. Save locally first: `trainer.save_model("./local-save")`
---
## πŸŽ‰ Success Criteria
We'll consider this project a success when:
- βœ… Model trains without errors (loss < 1.5)
- βœ… Model pushed to Hub successfully
- βœ… Model generates structured tool calls on test prompts
- βœ… Agent app runs locally with tool execution
- βœ… App deployed to HF Space
- βœ… Total cost under $10 (target: $1.50)
---
## πŸš€ Ready?
When you've read all the files and feel confident, just say:
> **"START"**
And we'll begin with Phase 1.
---
*Learning ML by building real things β€” one step at a time.*