MCP-Agent-1.7B / docs /06-execution-plan.md
muhammadtlha944's picture
Upload docs/06-execution-plan.md
1ff07c2 verified

06 β€” Execution Plan: What We'll Do When You Say "START"

πŸš€ The Plan

When you say "START", here is the EXACT sequence of steps we'll follow. Each step has a clear goal, estimated time, and cost.


Phase 1: Setup & Validation (15 minutes)

Step 1.1: Create Training Sandbox

What: Set up a GPU sandbox with all dependencies installed
Why: Test that everything works before spending money on a real training job
Time: 5 minutes
Cost: $0

pip install transformers trl peft datasets accelerate bitsandbytes torch trackio

Step 1.2: Validate Dataset Format

What: Load your dataset and verify it works with SFTTrainer
Why: Catch format issues BEFORE training starts (saves hours of debugging)
Time: 5 minutes
Cost: $0

from datasets import load_dataset
dataset = load_dataset("muhammadtlha944/mcp-agent-training-data")
print(dataset["train"][0])  # Peek at first example

Step 1.3: Verify Model Compatibility

What: Load Qwen3-1.7B tokenizer and test chat template
Why: Make sure the model can process our messages format
Time: 5 minutes
Cost: $0

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
print(tokenizer.chat_template)  # Should not be None

Phase 2: Training Script Development (30 minutes)

Step 2.1: Write Training Script

What: Create train.py with full educational comments
Why: Every line documented so you learn as we build
Time: 15 minutes
Cost: $0

What the script contains:

  • LoRA configuration (r=16, all-linear, dropout=0.05)
  • SFTConfig with all hyperparameters documented
  • Trackio monitoring setup
  • push_to_hub configuration
  • Plain-text logging (no tqdm progress bars)

Step 2.2: Test Script in Sandbox

What: Run the script for 10 steps to catch errors
Why: Find bugs NOW before the expensive training job
Time: 10 minutes
Cost: $0 (sandbox GPU time)

# Run just 10 steps as a smoke test
training_args.max_steps = 10
trainer.train()

Step 2.3: Review & Fix Issues

What: Fix any import errors, API mismatches, or config issues
Why: Training jobs are expensive β€” we only launch when the script is solid
Time: 5 minutes
Cost: $0


Phase 3: Model Training (2-3 hours)

Step 3.1: Launch Training Job

What: Submit training to HF Jobs on T4 GPU
Why: T4 is cheapest GPU that fits our model (16GB VRAM)
Time: 2-3 hours (automated)
Cost: ~$1.20-1.80

Pre-flight check before launch:

  • βœ… Dataset format validated
  • βœ… Script tested in sandbox
  • βœ… push_to_hub=True and hub_model_id set
  • βœ… Timeout set to 4 hours (plenty of buffer)
  • βœ… Trackio monitoring enabled
  • βœ… disable_tqdm=True for clean logs

Step 3.2: Monitor Training

What: Watch loss curves via Trackio dashboard
Why: Make sure loss is going down (model is learning)
Time: Check every 15 minutes
Cost: $0 (just watching)

What to watch for:

Good:    Step 100: loss=2.5 β†’ Step 500: loss=1.2 β†’ Step 2450: loss=0.9
Warning: Step 100: loss=2.5 β†’ Step 500: loss=2.4 β†’ Step 1000: loss=2.3
  (Learning very slowly β€” might need more epochs or higher LR)
Bad:     Step 100: loss=2.5 β†’ Step 500: loss=3.0 β†’ Step 1000: loss=3.5
  (Loss going UP β€” stop immediately, something is wrong)

Step 3.3: Verify Model Pushed to Hub

What: Check that the model appears in your HF repo
Why: Job storage is ephemeral β€” if push_to_hub fails, model is LOST
Time: 5 minutes
Cost: $0

Check URL: https://huggingface.co/muhammadtlha944/MCP-Agent-1.7B


Phase 4: Testing & Evaluation (30 minutes)

Step 4.1: Load Trained Model

What: Download the model from Hub and test inference
Why: Verify the model actually works after training
Time: 10 minutes
Cost: $0

from transformers import pipeline
pipe = pipeline("text-generation", model="muhammadtlha944/MCP-Agent-1.7B")

Step 4.2: Run Test Prompts

What: Test the model on real tool-calling scenarios
Why: See if training actually worked
Time: 10 minutes
Cost: $0

Test cases:

  1. Simple tool call: "Find all Python files"
  2. Multi-step: "Clone a repo and find TODO comments"
  3. Clarification: "Book a flight" (missing info)
  4. Safety: "Delete all files" (should refuse)
  5. MCP format: "Use the github_search tool to find ML repos"

Step 4.3: Document Results

What: Save test outputs and observations
Why: Track what works and what needs improvement
Time: 10 minutes
Cost: $0


Phase 5: Agent Harness App (1 hour)

Step 5.1: Write Agent App

What: Create app.py with Gradio UI + ReAct loop + tool registry
Why: Turn the model into an actual usable agent
Time: 30 minutes
Cost: $0

What the app contains:

  • Gradio chat interface
  • Agent mode toggle (on/off)
  • Tool registry with 7 built-in tools
  • ReAct loop (think β†’ act β†’ observe β†’ repeat)
  • Tool execution log
  • Safety filters (block dangerous commands)

Step 5.2: Test Agent Locally

What: Run the app and test with real user queries
Why: Make sure the whole system works end-to-end
Time: 15 minutes
Cost: $0

Step 5.3: Deploy to HF Space

What: Upload app to a Gradio Space
Why: Share with the world!
Time: 15 minutes
Cost: $0 (Spaces free tier)


Phase 6: Documentation & Publication (30 minutes)

Step 6.1: Update Model README

What: Write a compelling README for the model card
Why: Model cards are how people discover and understand your model
Time: 15 minutes
Cost: $0

What to include:

  • What the model does
  • How it was trained
  • How to use it
  • Benchmarks/results
  • Limitations
  • Citation info

Step 6.2: Create Dataset Card

What: Document the training dataset
Why: Transparency is valued in the ML community
Time: 10 minutes
Cost: $0

Step 6.3: Share Results

What: Post on social media, share with community
Why: Get feedback, attract collaborators
Time: 5 minutes
Cost: $0


πŸ“… Timeline Summary

Phase Steps Time Cost Cumulative
1. Setup 1.1-1.3 15 min $0 15 min / $0
2. Script 2.1-2.3 30 min $0 45 min / $0
3. Training 3.1-3.3 2-3 hrs ~$1.50 3-4 hrs / $1.50
4. Testing 4.1-4.3 30 min $0 3.5-4.5 hrs / $1.50
5. App 5.1-5.3 1 hr $0 4.5-5.5 hrs / $1.50
6. Publish 6.1-6.3 30 min $0 5-6 hrs / $1.50

Total time: ~5-6 hours of active work
Total cost: ~$1.50 (training only)
Total budget used: ~15% of $10 budget βœ…


🎯 Decision Points

At each phase, we'll make decisions based on results:

After Phase 3 (Training):

If training loss < 1.5 and eval loss < 1.8:** βœ… Proceed to testing
**If training loss > 2.0:
⚠️ Consider more epochs or higher LR
If eval loss >> train loss: ❌ Overfitting β€” need more data or lower rank
If model didn't push to Hub: ❌ Stop and fix push_to_hub configuration

After Phase 4 (Testing):

If model generates tool calls correctly: βœ… Proceed to app
If model generates text but not tool calls: ⚠️ Need more MCP-specific training data
If model hallucinates tools: ⚠️ Need more diverse tool schemas in data
If model refuses everything: ⚠️ Too much safety data β€” need balance

After Phase 5 (App):

If app works end-to-end: βœ… Publish and celebrate!
If tools fail to execute: ⚠️ Fix tool implementations
If model runs out of context: ⚠️ Reduce max_iterations or use sliding window


πŸ’‘ What You'll Learn During Execution

During Phase 1:

  • How to set up a GPU environment
  • How to validate data formats
  • How model tokenizers work

During Phase 2:

  • How to write production training scripts
  • How LoRA configuration works
  • How SFTConfig parameters affect training

During Phase 3:

  • How to submit jobs to cloud GPUs
  • How to monitor training in real-time
  • How to read loss curves
  • How Trackio dashboards work

During Phase 4:

  • How to load fine-tuned models
  • How to test models systematically
  • How to identify model weaknesses

During Phase 5:

  • How to build agent applications
  • How the ReAct pattern works in practice
  • How tool registries function
  • How to deploy Gradio apps

During Phase 6:

  • How to write effective model cards
  • How to share research with the community

🚨 Contingency Plans

If Training Fails (OOM Error)

Symptom: "CUDA out of memory" error
Fix:

  1. Reduce batch_size from 4 to 2 (keep accumulation at 4 β†’ effective batch = 8)
  2. Reduce max_seq_length from 2048 to 1024
  3. If still fails, use gradient checkpointing (already enabled)
  4. Last resort: upgrade to a10g-small (24GB VRAM, ~$1.20/hr)

If Training Is Too Slow

Symptom: Loss barely moving after 1 hour
Fix:

  1. Check learning rate β€” might be too low
  2. Increase warmup ratio from 0.1 to 0.2
  3. Reduce gradient accumulation from 4 to 2 (faster but less stable)

If Model Doesn't Generate Tool Calls

Symptom: Model answers questions normally but doesn't use tools
Fix:

  1. Add more MCP-specific training data
  2. Adjust system prompt to emphasize tool use
  3. Use higher temperature (0.9) to encourage creativity
  4. Add few-shot examples in the system prompt

If Push to Hub Fails

Symptom: Model trained but not on Hub
Fix:

  1. Check HF token has write permissions
  2. Manually upload: trainer.push_to_hub() after training
  3. Save locally first: trainer.save_model("./local-save")

πŸŽ‰ Success Criteria

We'll consider this project a success when:

  • βœ… Model trains without errors (loss < 1.5)
  • βœ… Model pushed to Hub successfully
  • βœ… Model generates structured tool calls on test prompts
  • βœ… Agent app runs locally with tool execution
  • βœ… App deployed to HF Space
  • βœ… Total cost under $10 (target: $1.50)

πŸš€ Ready?

When you've read all the files and feel confident, just say:

"START"

And we'll begin with Phase 1.


Learning ML by building real things β€” one step at a time.