MCP-Agent-1.7B / docs /04-training.md

Upload docs/04-training.md

40d8168 verified 10 days ago

preview code

raw

history blame contribute delete

13.5 kB

04 — Training Explained: LoRA, SFT & Hyperparameters

🎓 Why This Chapter Matters

This is where we answer: "How do we actually teach the model to use tools?"

By the end of this chapter, you'll understand:

What LoRA is and why it's magical for budget training
What SFT does step-by-step
What each hyperparameter controls
How to read training logs and know if it's working

🧠 Concept 1: Why Can't We Just Use the Base Model?

Qwen3-1.7B is already a great model. It can chat, answer questions, write code. But it doesn't know how to use tools in a structured way.

What Base Models Know

Base model Qwen3-1.7B:

✅ Understands English, can chat
✅ Can write Python code
✅ Can answer questions about the world
❌ Doesn't know about your specific tool schemas
❌ Doesn't output tool calls in correct JSON-RPC format
❌ Doesn't plan multi-step tool chains
❌ Doesn't ask clarifying questions
❌ Doesn't refuse dangerous requests

What Fine-Tuning Adds

After training on 15,694 tool-calling examples:

✅ Understands tool schemas ("Here's what this tool needs")
✅ Generates correct JSON-RPC tool calls
✅ Plans multi-step sequences ("First A, then B using A's result")
✅ Asks when info is missing
✅ Refuses harmful operations

Think of it like this:

Base model = A smart person who knows how to talk but doesn't know your tools
Fine-tuned model = The same person after reading 15,000 instruction manuals

🧠 Concept 2: LoRA — The Magic of Cheap Fine-Tuning

The Problem: Full Fine-Tuning Is Expensive

To fine-tune all 2 billion parameters of Qwen3-1.7B:

Component	Size	Why
Model weights	4 GB	2B params × 2 bytes (fp16)
Gradients	4 GB	Need gradients for every parameter
Optimizer states	16 GB	Adam optimizer keeps 2 copies per param
Total	24 GB	Doesn't fit on T4 (16GB)!

You'd need an A100 GPU (80GB) which costs $3-4/hour.

The Solution: LoRA (Low-Rank Adaptation)

Instead of updating ALL parameters, we add tiny matrices to each layer:

Original Layer (Frozen — Never Changes)
┌─────────────────────────────┐
│  W (2048 × 2048) = 4.2M    │  ← 4 MILLION parameters
│  parameters                 │     These stay FROZEN
└─────────────────────────────┘
         │
         │ input x
         ▼
    y = W × x
         │
         ▼
    output

LoRA Adapters (Trainable — These Learn)
┌─────────────────────┐    ┌─────────────────────┐
│  A (2048 × 16)      │───▶│  B (16 × 2048)      │
│  = 32K params       │    │  = 32K params       │
│  (initialized       │    │  (initialized to 0) │
│   randomly)         │    │                     │
└─────────────────────┘    └─────────────────────┘
         │                         │
         ▼                         ▼
    h = A × x                  y' = B × h
                                   = B × (A × x)

Final Output:
y = W × x + B × A × x
    ↑           ↑
  frozen     trained

Math:

Original: W is 2048×2048 = 4,194,304 parameters
LoRA: A is 2048×16 = 32,768, B is 16×2048 = 32,768
Total LoRA: 65,536 parameters (1.6% of original!)
Memory for training: ~5GB total (fits on T4!)

Why This Works

The idea: neural network weights often have low-rank structure. Even though W is 2048×2048, the "important directions" of change can be captured by much smaller matrices.

Think of it like adjusting a steering wheel:

Full fine-tuning = Rebuilding the entire car to turn better
LoRA = Adding a small steering adjustment module (tiny, cheap, effective)

Our LoRA Configuration

from peft import LoraConfig

peft_config = LoraConfig(
    r=16,                    # Rank: "resolution" of the adapter
    lora_alpha=32,          # Scaling: how strongly LoRA affects output
    target_modules="all-linear",  # Apply to ALL linear layers
    lora_dropout=0.05,      # Dropout: 5% random zeroing (prevents overfitting)
    bias="none",             # Don't train bias terms (saves memory)
    task_type="CAUSAL_LM",   # This is a language model
)

r=16: Think of this as the "resolution." Higher = more detail but more memory. For ~16K training examples, r=16 is the sweet spot. (TinyAgent used r=64 for 80K examples)

lora_alpha=32: Scaling factor. Rule of thumb: 2× rank. Controls how much the LoRA output contributes to the final result.

target_modules="all-linear": The "LoRA Without Regret" paper proved that applying LoRA to ALL linear layers (not just attention projections) matches full fine-tuning quality. This is our secret sauce.

🧠 Concept 3: SFT — Supervised Fine-Tuning

What Is SFT?

SFT = teaching by example. We show the model:

Input:  "Find all Python files"
Output: {"tool": "shell_exec", "arguments": {"command": "find . -name '*.py'"}}

Input:  "Delete all files"
Output: "I cannot help with that. Deleting all files is dangerous..."

Input:  "Clone the repo and find TODOs"
Output: {"tool": "shell_exec", "arguments": {"command": "git clone https://... && grep -r 'TODO' ."}}

The model learns to predict the output given the input.

How SFT Works Step by Step

Step 1: Tokenize

Convert text → numbers:

"Find Python files"
↓ Tokenizer
[4921, 12729, 4367, 8921, 1023]

Each number is an index in a vocabulary of ~100,000 tokens.

Step 2: Forward Pass

The model processes the tokenized input and predicts the next token at EACH position:

Input tokens:  [4921, 12729, 4367, 8921]
                                    │
Predictions:   [?,    ?,    ?,    ?  ] ──▶ next token should be 1023

The model outputs a probability distribution over all ~100,000 possible tokens.

Step 3: Compute Loss (Cross-Entropy)

Predicted probabilities:  [0.01, 0.03, 0.001, ..., 0.45, ..., 0.002]
                              ↑                    ↑
                           wrong                 correct (1023)

Loss = -log(probability_of_correct_token)
     = -log(0.45)
     = 0.80

Lower loss = better prediction.

If the model predicted token 1023 with probability 0.45, loss is 0.80. If it predicted with probability 0.99, loss is 0.01 (much better!).

Step 4: Backward Pass (Backpropagation)

Compute gradients: which direction to adjust weights to reduce loss.

For each LoRA parameter:
  gradient = how much changing this parameter would change the loss

This is done automatically by PyTorch's autograd.

Step 5: Update Weights (Adam Optimizer)

new_weight = old_weight - learning_rate × gradient

Adam is smarter — it uses momentum and adaptive learning rates per parameter.

Step 6: Repeat

Do this for ALL examples in the dataset, then repeat for 3 epochs.

🧠 Concept 4: Hyperparameters — The Recipe

Think of training as cooking. Hyperparameters are your recipe.

Learning Rate: 2e-4

What it controls: How big each weight update step is.

Learning Rate
   │
1e-2│  ╳─── Too high: loss oscillates, model never settles
   │   │
2e-4│    ●── Sweet spot for LoRA (10× higher than full fine-tuning)
   │      ╲
1e-5│       ╲── Too low: barely moves, takes forever
   │         ╲
   └─────────────────
        Steps

Why 2e-4 for LoRA?

Full fine-tuning typically uses 2e-5
LoRA has 100× fewer parameters
Each parameter update needs 10× more impact
So: 2e-5 × 10 = 2e-4

Batch Size: 4 × 4 = 16 Effective

What it controls: How many examples the model sees before updating weights.

Without Gradient Accumulation: Process 4 examples → Compute gradients → Update weights → Next 4

With Gradient Accumulation (what we do): Process 4 examples → Compute gradients → SAVE gradients (don't update) Process 4 examples → Compute gradients → ADD to saved gradients Process 4 examples → Compute gradients → ADD to saved gradients Process 4 examples → Compute gradients → ADD to saved gradients Now update weights (accumulated from 4 × 4 = 16 examples)

Why gradient accumulation?

GPU can only fit 4 examples at once (memory limit)
But effective batch of 16 gives more stable gradients
It's a memory-saving trick

Trade-off: Slower (4× more forward passes per update) but better quality.

Epochs: 3

What it controls: How many times the model sees the entire dataset.

Epoch 1: Sees all 15,694 examples → learns basic patterns Epoch 2: Sees all again → refines understanding Epoch 3: Sees all again → final tuning

Why 3?

1 epoch: Underfitting (hasn't seen enough)
3 epochs: Sweet spot (learns patterns without memorizing)
10 epochs: Overfitting (memorizes training data, fails on new data)

Warmup Ratio: 0.1 (10%)

What it controls: For the first 10% of training, learning rate starts at 0 and gradually ramps up to the full rate.

Why warmup?

At the start, model knows NOTHING about tool-calling
Large updates could push weights in random bad directions
Warmup lets model "get its bearings" first

Cosine LR Schedule

After warmup, learning rate follows a cosine curve:

Learning Rate
   │
2e-4│    ╱──╲
   │   ╱    ╲
   │  ╱      ╲
   │ ╱        ╲
  0 │╱          ╲────
   └─────────────────
     warmup    end

Why cosine?

High in the middle: aggressive learning when model has basic understanding
Low at the end: fine-tuning details, settling into optimal weights
Prevents overshooting at the end of training

Max Sequence Length: 2048 tokens

What it controls: Maximum number of tokens per training example.

Example conversation:
  System prompt:  ~500 tokens
  User message:   ~100 tokens
  Assistant reply: ~300 tokens
  Total:          ~900 tokens  ← Fits in 2048 ✓

Why 2048?

Covers all our examples (most are under 1000 tokens)
Fits in T4 memory (longer sequences = more memory)
Standard for instruction-tuned models

Gradient Checkpointing: ON

What it does: Saves memory by recomputing some values during backward pass.

Without checkpointing:
  Forward pass: Store all intermediate activations → Backward pass uses them
  Memory: 8 GB

With checkpointing:
  Forward pass: Store only SOME activations
  Backward pass: Recompute missing ones on-the-fly
  Memory: 5 GB (saves ~40%)

Trade-off: Slower (needs extra computation) but fits on T4.

📊 Reading Training Logs

What You'll See

Step 10/245: loss=2.847, learning_rate=2.0e-05
Step 20/245: loss=2.654, learning_rate=4.0e-05
...
Step 100/245: loss=1.234, learning_rate=1.8e-04
...
Step 245/245: loss=0.876, learning_rate=1.2e-05

How to Interpret

Observation	Meaning
Loss going DOWN	✅ Model is learning
Loss going UP after going down	⚠️ Overfitting — stop early
Loss stuck at ~3.0	❌ Not learning — check data/format
Loss drops fast then plateaus	✅ Normal — model learned basics
Eval loss ≈ Train loss	✅ Good generalization
Eval loss >> Train loss	❌ Overfitting — model memorized training data

Target Numbers (for reference)

Initial loss: ~2.5-3.5 (random guessing among many tokens)
Final loss: ~0.8-1.2 (decent learning on 16K examples)
Eval loss: Should be within 0.1-0.3 of train loss

🧮 Training Math

How Long Does It Take?

Dataset: 15,694 examples
Batch size: 4 (per device)
Gradient accumulation: 4 steps
Effective batch: 4 × 4 = 16

Steps per epoch: 15,694 ÷ 16 = ~980 steps
Total steps (3 epochs): 980 × 3 = ~2,940 steps

Time per step (T4): ~2-3 seconds
Total time: 2,940 × 2.5s = ~7,350s = ~2 hours

Cost Calculation

T4 GPU on HF Jobs: ~$0.60/hour
Training time: ~2 hours
Total cost: $0.60 × 2 = $1.20

Well under $10! ✅

🎓 Summary: Key Training Concepts

Concept	What It Is	Why It Matters
LoRA	Tiny trainable matrices added to frozen layers	Makes training affordable (5GB vs 24GB)
SFT	Teaching model with input→output examples	Gives model tool-calling knowledge
Loss	Measure of how wrong predictions are	Lower = better learning
Learning Rate	Size of weight updates	Too high = chaos, too low = slow
Batch Size	Examples per weight update	More = stable gradients, needs more memory
Gradient Accumulation	Fake larger batch sizes	Memory-saving trick
Epochs	Times model sees full dataset	3 is sweet spot
Warmup	Gradual LR increase at start	Prevents early instability
Cosine Schedule	LR high→low curve	Aggressive middle, gentle end
Gradient Checkpointing	Recompute activations	Saves ~40% memory

🔜 Next Step

Read 05-dataset.md to understand our training data — what we have, what's missing, and how to make it better.