MCP-Agent-1.7B / docs /06-execution-plan.md

Upload docs/06-execution-plan.md

1ff07c2 verified 10 days ago

10.4 kB

	# 06 — Execution Plan: What We'll Do When You Say "START"

	## 🚀 The Plan

	When you say "START", here is the EXACT sequence of steps we'll follow.
	Each step has a clear goal, estimated time, and cost.

	---

	## Phase 1: Setup & Validation (15 minutes)

	### Step 1.1: Create Training Sandbox
	What: Set up a GPU sandbox with all dependencies installed
	Why: Test that everything works before spending money on a real training job
	Time: 5 minutes
	Cost: $0

	```bash
	pip install transformers trl peft datasets accelerate bitsandbytes torch trackio
	```

	### Step 1.2: Validate Dataset Format
	What: Load your dataset and verify it works with SFTTrainer
	Why: Catch format issues BEFORE training starts (saves hours of debugging)
	Time: 5 minutes
	Cost: $0

	```python
	from datasets import load_dataset
	dataset = load_dataset("muhammadtlha944/mcp-agent-training-data")
	print(dataset["train"][0]) # Peek at first example
	```

	### Step 1.3: Verify Model Compatibility
	What: Load Qwen3-1.7B tokenizer and test chat template
	Why: Make sure the model can process our messages format
	Time: 5 minutes
	Cost: $0

	```python
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
	print(tokenizer.chat_template) # Should not be None
	```

	---

	## Phase 2: Training Script Development (30 minutes)

	### Step 2.1: Write Training Script
	What: Create `train.py` with full educational comments
	Why: Every line documented so you learn as we build
	Time: 15 minutes
	Cost: $0

	What the script contains:
	- LoRA configuration (r=16, all-linear, dropout=0.05)
	- SFTConfig with all hyperparameters documented
	- Trackio monitoring setup
	- push_to_hub configuration
	- Plain-text logging (no tqdm progress bars)

	### Step 2.2: Test Script in Sandbox
	What: Run the script for 10 steps to catch errors
	Why: Find bugs NOW before the expensive training job
	Time: 10 minutes
	Cost: $0 (sandbox GPU time)

	```python
	# Run just 10 steps as a smoke test
	training_args.max_steps = 10
	trainer.train()
	```

	### Step 2.3: Review & Fix Issues
	What: Fix any import errors, API mismatches, or config issues
	Why: Training jobs are expensive — we only launch when the script is solid
	Time: 5 minutes
	Cost: $0

	---

	## Phase 3: Model Training (2-3 hours)

	### Step 3.1: Launch Training Job
	What: Submit training to HF Jobs on T4 GPU
	Why: T4 is cheapest GPU that fits our model (16GB VRAM)
	Time: 2-3 hours (automated)
	Cost: ~$1.20-1.80

	Pre-flight check before launch:
	- ✅ Dataset format validated
	- ✅ Script tested in sandbox
	- ✅ push_to_hub=True and hub_model_id set
	- ✅ Timeout set to 4 hours (plenty of buffer)
	- ✅ Trackio monitoring enabled
	- ✅ disable_tqdm=True for clean logs

	### Step 3.2: Monitor Training
	What: Watch loss curves via Trackio dashboard
	Why: Make sure loss is going down (model is learning)
	Time: Check every 15 minutes
	Cost: $0 (just watching)

	What to watch for:
	```
	Good: Step 100: loss=2.5 → Step 500: loss=1.2 → Step 2450: loss=0.9
	Warning: Step 100: loss=2.5 → Step 500: loss=2.4 → Step 1000: loss=2.3
	(Learning very slowly — might need more epochs or higher LR)
	Bad: Step 100: loss=2.5 → Step 500: loss=3.0 → Step 1000: loss=3.5
	(Loss going UP — stop immediately, something is wrong)
	```

	### Step 3.3: Verify Model Pushed to Hub
	What: Check that the model appears in your HF repo
	Why: Job storage is ephemeral — if push_to_hub fails, model is LOST
	Time: 5 minutes
	Cost: $0

	Check URL: https://huggingface.co/muhammadtlha944/MCP-Agent-1.7B

	---

	## Phase 4: Testing & Evaluation (30 minutes)

	### Step 4.1: Load Trained Model
	What: Download the model from Hub and test inference
	Why: Verify the model actually works after training
	Time: 10 minutes
	Cost: $0

	```python
	from transformers import pipeline
	pipe = pipeline("text-generation", model="muhammadtlha944/MCP-Agent-1.7B")
	```

	### Step 4.2: Run Test Prompts
	What: Test the model on real tool-calling scenarios
	Why: See if training actually worked
	Time: 10 minutes
	Cost: $0

	Test cases:
	1. Simple tool call: "Find all Python files"
	2. Multi-step: "Clone a repo and find TODO comments"
	3. Clarification: "Book a flight" (missing info)
	4. Safety: "Delete all files" (should refuse)
	5. MCP format: "Use the github_search tool to find ML repos"

	### Step 4.3: Document Results
	What: Save test outputs and observations
	Why: Track what works and what needs improvement
	Time: 10 minutes
	Cost: $0

	---

	## Phase 5: Agent Harness App (1 hour)

	### Step 5.1: Write Agent App
	What: Create `app.py` with Gradio UI + ReAct loop + tool registry
	Why: Turn the model into an actual usable agent
	Time: 30 minutes
	Cost: $0

	What the app contains:
	- Gradio chat interface
	- Agent mode toggle (on/off)
	- Tool registry with 7 built-in tools
	- ReAct loop (think → act → observe → repeat)
	- Tool execution log
	- Safety filters (block dangerous commands)

	### Step 5.2: Test Agent Locally
	What: Run the app and test with real user queries
	Why: Make sure the whole system works end-to-end
	Time: 15 minutes
	Cost: $0

	### Step 5.3: Deploy to HF Space
	What: Upload app to a Gradio Space
	Why: Share with the world!
	Time: 15 minutes
	Cost: $0 (Spaces free tier)

	---

	## Phase 6: Documentation & Publication (30 minutes)

	### Step 6.1: Update Model README
	What: Write a compelling README for the model card
	Why: Model cards are how people discover and understand your model
	Time: 15 minutes
	Cost: $0

	What to include:
	- What the model does
	- How it was trained
	- How to use it
	- Benchmarks/results
	- Limitations
	- Citation info

	### Step 6.2: Create Dataset Card
	What: Document the training dataset
	Why: Transparency is valued in the ML community
	Time: 10 minutes
	Cost: $0

	### Step 6.3: Share Results
	What: Post on social media, share with community
	Why: Get feedback, attract collaborators
	Time: 5 minutes
	Cost: $0

	---

	## 📅 Timeline Summary

	\| Phase \| Steps \| Time \| Cost \| Cumulative \|
	\|-------\|-------\|------\|------\|------------\|
	\| 1. Setup \| 1.1-1.3 \| 15 min \| $0 \| 15 min / $0 \|
	\| 2. Script \| 2.1-2.3 \| 30 min \| $0 \| 45 min / $0 \|
	\| 3. Training \| 3.1-3.3 \| 2-3 hrs \| ~$1.50 \| 3-4 hrs / $1.50 \|
	\| 4. Testing \| 4.1-4.3 \| 30 min \| $0 \| 3.5-4.5 hrs / $1.50 \|
	\| 5. App \| 5.1-5.3 \| 1 hr \| $0 \| 4.5-5.5 hrs / $1.50 \|
	\| 6. Publish \| 6.1-6.3 \| 30 min \| $0 \| 5-6 hrs / $1.50 \|

	Total time: ~5-6 hours of active work
	Total cost: ~$1.50 (training only)
	Total budget used: ~15% of $10 budget ✅

	---

	## 🎯 Decision Points

	At each phase, we'll make decisions based on results:

	### After Phase 3 (Training):
	If training loss < 1.5 and eval loss < 1.8: ✅ Proceed to testing
	If training loss > 2.0: ⚠️ Consider more epochs or higher LR
	If eval loss >> train loss: ❌ Overfitting — need more data or lower rank
	If model didn't push to Hub: ❌ Stop and fix push_to_hub configuration

	### After Phase 4 (Testing):
	If model generates tool calls correctly: ✅ Proceed to app
	If model generates text but not tool calls: ⚠️ Need more MCP-specific training data
	If model hallucinates tools: ⚠️ Need more diverse tool schemas in data
	If model refuses everything: ⚠️ Too much safety data — need balance

	### After Phase 5 (App):
	If app works end-to-end: ✅ Publish and celebrate!
	If tools fail to execute: ⚠️ Fix tool implementations
	If model runs out of context: ⚠️ Reduce max_iterations or use sliding window

	---

	## 💡 What You'll Learn During Execution

	### During Phase 1:
	- How to set up a GPU environment
	- How to validate data formats
	- How model tokenizers work

	### During Phase 2:
	- How to write production training scripts
	- How LoRA configuration works
	- How SFTConfig parameters affect training

	### During Phase 3:
	- How to submit jobs to cloud GPUs
	- How to monitor training in real-time
	- How to read loss curves
	- How Trackio dashboards work

	### During Phase 4:
	- How to load fine-tuned models
	- How to test models systematically
	- How to identify model weaknesses

	### During Phase 5:
	- How to build agent applications
	- How the ReAct pattern works in practice
	- How tool registries function
	- How to deploy Gradio apps

	### During Phase 6:
	- How to write effective model cards
	- How to share research with the community

	---

	## 🚨 Contingency Plans

	### If Training Fails (OOM Error)
	Symptom: "CUDA out of memory" error
	Fix:
	1. Reduce batch_size from 4 to 2 (keep accumulation at 4 → effective batch = 8)
	2. Reduce max_seq_length from 2048 to 1024
	3. If still fails, use gradient checkpointing (already enabled)
	4. Last resort: upgrade to a10g-small (24GB VRAM, ~$1.20/hr)

	### If Training Is Too Slow
	Symptom: Loss barely moving after 1 hour
	Fix:
	1. Check learning rate — might be too low
	2. Increase warmup ratio from 0.1 to 0.2
	3. Reduce gradient accumulation from 4 to 2 (faster but less stable)

	### If Model Doesn't Generate Tool Calls
	Symptom: Model answers questions normally but doesn't use tools
	Fix:
	1. Add more MCP-specific training data
	2. Adjust system prompt to emphasize tool use
	3. Use higher temperature (0.9) to encourage creativity
	4. Add few-shot examples in the system prompt

	### If Push to Hub Fails
	Symptom: Model trained but not on Hub
	Fix:
	1. Check HF token has write permissions
	2. Manually upload: `trainer.push_to_hub()` after training
	3. Save locally first: `trainer.save_model("./local-save")`

	---

	## 🎉 Success Criteria

	We'll consider this project a success when:

	- ✅ Model trains without errors (loss < 1.5)
	- ✅ Model pushed to Hub successfully
	- ✅ Model generates structured tool calls on test prompts
	- ✅ Agent app runs locally with tool execution
	- ✅ App deployed to HF Space
	- ✅ Total cost under $10 (target: $1.50)

	---

	## 🚀 Ready?

	When you've read all the files and feel confident, just say:

	> "START"

	And we'll begin with Phase 1.

	---

	Learning ML by building real things — one step at a time.