MCP-Agent-1.7B / docs /06-execution-plan.md

Upload docs/06-execution-plan.md

1ff07c2 verified 10 days ago

preview code

raw

history blame contribute delete

10.4 kB

06 — Execution Plan: What We'll Do When You Say "START"

🚀 The Plan

When you say "START", here is the EXACT sequence of steps we'll follow. Each step has a clear goal, estimated time, and cost.

Phase 1: Setup & Validation (15 minutes)

Step 1.1: Create Training Sandbox

What: Set up a GPU sandbox with all dependencies installed
Why: Test that everything works before spending money on a real training job
Time: 5 minutes
Cost: $0

pip install transformers trl peft datasets accelerate bitsandbytes torch trackio

Step 1.2: Validate Dataset Format

What: Load your dataset and verify it works with SFTTrainer
Why: Catch format issues BEFORE training starts (saves hours of debugging)
Time: 5 minutes
Cost: $0

from datasets import load_dataset
dataset = load_dataset("muhammadtlha944/mcp-agent-training-data")
print(dataset["train"][0])  # Peek at first example

Step 1.3: Verify Model Compatibility

What: Load Qwen3-1.7B tokenizer and test chat template
Why: Make sure the model can process our messages format
Time: 5 minutes
Cost: $0

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
print(tokenizer.chat_template)  # Should not be None

Phase 2: Training Script Development (30 minutes)

Step 2.1: Write Training Script

What: Create train.py with full educational comments
Why: Every line documented so you learn as we build
Time: 15 minutes
Cost: $0

What the script contains:

LoRA configuration (r=16, all-linear, dropout=0.05)
SFTConfig with all hyperparameters documented
Trackio monitoring setup
push_to_hub configuration
Plain-text logging (no tqdm progress bars)

Step 2.2: Test Script in Sandbox

What: Run the script for 10 steps to catch errors
Why: Find bugs NOW before the expensive training job
Time: 10 minutes
Cost: $0 (sandbox GPU time)

# Run just 10 steps as a smoke test
training_args.max_steps = 10
trainer.train()

Step 2.3: Review & Fix Issues

What: Fix any import errors, API mismatches, or config issues
Why: Training jobs are expensive — we only launch when the script is solid
Time: 5 minutes
Cost: $0

Phase 3: Model Training (2-3 hours)

Step 3.1: Launch Training Job

What: Submit training to HF Jobs on T4 GPU
Why: T4 is cheapest GPU that fits our model (16GB VRAM)
Time: 2-3 hours (automated)
Cost: ~$1.20-1.80

Pre-flight check before launch:

✅ Dataset format validated
✅ Script tested in sandbox
✅ push_to_hub=True and hub_model_id set
✅ Timeout set to 4 hours (plenty of buffer)
✅ Trackio monitoring enabled
✅ disable_tqdm=True for clean logs

Step 3.2: Monitor Training

What: Watch loss curves via Trackio dashboard
Why: Make sure loss is going down (model is learning)
Time: Check every 15 minutes
Cost: $0 (just watching)

What to watch for:

Good:    Step 100: loss=2.5 → Step 500: loss=1.2 → Step 2450: loss=0.9
Warning: Step 100: loss=2.5 → Step 500: loss=2.4 → Step 1000: loss=2.3
  (Learning very slowly — might need more epochs or higher LR)
Bad:     Step 100: loss=2.5 → Step 500: loss=3.0 → Step 1000: loss=3.5
  (Loss going UP — stop immediately, something is wrong)

Step 3.3: Verify Model Pushed to Hub

What: Check that the model appears in your HF repo
Why: Job storage is ephemeral — if push_to_hub fails, model is LOST
Time: 5 minutes
Cost: $0

Check URL: https://huggingface.co/muhammadtlha944/MCP-Agent-1.7B

Phase 4: Testing & Evaluation (30 minutes)

Step 4.1: Load Trained Model

What: Download the model from Hub and test inference
Why: Verify the model actually works after training
Time: 10 minutes
Cost: $0

from transformers import pipeline
pipe = pipeline("text-generation", model="muhammadtlha944/MCP-Agent-1.7B")

Step 4.2: Run Test Prompts

What: Test the model on real tool-calling scenarios
Why: See if training actually worked
Time: 10 minutes
Cost: $0

Test cases:

Simple tool call: "Find all Python files"
Multi-step: "Clone a repo and find TODO comments"
Clarification: "Book a flight" (missing info)
Safety: "Delete all files" (should refuse)
MCP format: "Use the github_search tool to find ML repos"

Step 4.3: Document Results

What: Save test outputs and observations
Why: Track what works and what needs improvement
Time: 10 minutes
Cost: $0

Phase 5: Agent Harness App (1 hour)

Step 5.1: Write Agent App

What: Create app.py with Gradio UI + ReAct loop + tool registry
Why: Turn the model into an actual usable agent
Time: 30 minutes
Cost: $0

What the app contains:

Gradio chat interface
Agent mode toggle (on/off)
Tool registry with 7 built-in tools
ReAct loop (think → act → observe → repeat)
Tool execution log
Safety filters (block dangerous commands)

Step 5.2: Test Agent Locally

What: Run the app and test with real user queries
Why: Make sure the whole system works end-to-end
Time: 15 minutes
Cost: $0

Step 5.3: Deploy to HF Space

What: Upload app to a Gradio Space
Why: Share with the world!
Time: 15 minutes
Cost: $0 (Spaces free tier)

Phase 6: Documentation & Publication (30 minutes)

Step 6.1: Update Model README

What: Write a compelling README for the model card
Why: Model cards are how people discover and understand your model
Time: 15 minutes
Cost: $0

What to include:

What the model does
How it was trained
How to use it
Benchmarks/results
Limitations
Citation info

Step 6.2: Create Dataset Card

What: Document the training dataset
Why: Transparency is valued in the ML community
Time: 10 minutes
Cost: $0

Step 6.3: Share Results

What: Post on social media, share with community
Why: Get feedback, attract collaborators
Time: 5 minutes
Cost: $0

📅 Timeline Summary

Phase	Steps	Time	Cost	Cumulative
1. Setup	1.1-1.3	15 min	$0	15 min / $0
2. Script	2.1-2.3	30 min	$0	45 min / $0
3. Training	3.1-3.3	2-3 hrs	~$1.50	3-4 hrs / $1.50
4. Testing	4.1-4.3	30 min	$0	3.5-4.5 hrs / $1.50
5. App	5.1-5.3	1 hr	$0	4.5-5.5 hrs / $1.50
6. Publish	6.1-6.3	30 min	$0	5-6 hrs / $1.50

Total time: ~5-6 hours of active work
Total cost: ~$1.50 (training only)
Total budget used: ~15% of $10 budget ✅

🎯 Decision Points

At each phase, we'll make decisions based on results:

After Phase 3 (Training):

If training loss < 1.5 and eval loss < 1.8:** ✅ Proceed to testing
**If training loss > 2.0: ⚠️ Consider more epochs or higher LR
If eval loss >> train loss: ❌ Overfitting — need more data or lower rank
If model didn't push to Hub: ❌ Stop and fix push_to_hub configuration

After Phase 4 (Testing):

If model generates tool calls correctly: ✅ Proceed to app
If model generates text but not tool calls: ⚠️ Need more MCP-specific training data
If model hallucinates tools: ⚠️ Need more diverse tool schemas in data
If model refuses everything: ⚠️ Too much safety data — need balance

After Phase 5 (App):

If app works end-to-end: ✅ Publish and celebrate!
If tools fail to execute: ⚠️ Fix tool implementations
If model runs out of context: ⚠️ Reduce max_iterations or use sliding window

💡 What You'll Learn During Execution

During Phase 1:

How to set up a GPU environment
How to validate data formats
How model tokenizers work

During Phase 2:

How to write production training scripts
How LoRA configuration works
How SFTConfig parameters affect training

During Phase 3:

How to submit jobs to cloud GPUs
How to monitor training in real-time
How to read loss curves
How Trackio dashboards work

During Phase 4:

How to load fine-tuned models
How to test models systematically
How to identify model weaknesses

During Phase 5:

How to build agent applications
How the ReAct pattern works in practice
How tool registries function
How to deploy Gradio apps

During Phase 6:

How to write effective model cards
How to share research with the community

🚨 Contingency Plans

If Training Fails (OOM Error)

Symptom: "CUDA out of memory" error
Fix:

Reduce batch_size from 4 to 2 (keep accumulation at 4 → effective batch = 8)
Reduce max_seq_length from 2048 to 1024
If still fails, use gradient checkpointing (already enabled)
Last resort: upgrade to a10g-small (24GB VRAM, ~$1.20/hr)

If Training Is Too Slow

Symptom: Loss barely moving after 1 hour
Fix:

Check learning rate — might be too low
Increase warmup ratio from 0.1 to 0.2
Reduce gradient accumulation from 4 to 2 (faster but less stable)

If Model Doesn't Generate Tool Calls

Symptom: Model answers questions normally but doesn't use tools
Fix:

Add more MCP-specific training data
Adjust system prompt to emphasize tool use
Use higher temperature (0.9) to encourage creativity
Add few-shot examples in the system prompt

If Push to Hub Fails

Symptom: Model trained but not on Hub
Fix:

Check HF token has write permissions
Manually upload: trainer.push_to_hub() after training
Save locally first: trainer.save_model("./local-save")

🎉 Success Criteria

We'll consider this project a success when:

✅ Model trains without errors (loss < 1.5)
✅ Model pushed to Hub successfully
✅ Model generates structured tool calls on test prompts
✅ Agent app runs locally with tool execution
✅ App deployed to HF Space
✅ Total cost under $10 (target: $1.50)

🚀 Ready?

When you've read all the files and feel confident, just say:

"START"

And we'll begin with Phase 1.

Learning ML by building real things — one step at a time.