muhammadtlha944
/

MCP-Agent-1.7B

Model card Files Files and versions

xet

Community

muhammadtlha944 commited on 12 days ago

Commit

40d8168

verified ·

1 Parent(s): 678cba6

Upload docs/04-training.md

Browse files

Files changed (1) hide show

docs/04-training.md +435 -0

docs/04-training.md ADDED Viewed

	@@ -0,0 +1,435 @@

+# 04 — Training Explained: LoRA, SFT & Hyperparameters
+## 🎓 Why This Chapter Matters
+This is where we answer: *"How do we actually teach the model to use tools?"*
+By the end of this chapter, you'll understand:
+- What LoRA is and why it's magical for budget training
+- What SFT does step-by-step
+- What each hyperparameter controls
+- How to read training logs and know if it's working
+---
+## 🧠 Concept 1: Why Can't We Just Use the Base Model?
+**Qwen3-1.7B** is already a great model. It can chat, answer questions, write code.
+But it doesn't know how to use **tools** in a structured way.
+### What Base Models Know
+Base model Qwen3-1.7B:
+- ✅ Understands English, can chat
+- ✅ Can write Python code
+- ✅ Can answer questions about the world
+- ❌ Doesn't know about your specific tool schemas
+- ❌ Doesn't output tool calls in correct JSON-RPC format
+- ❌ Doesn't plan multi-step tool chains
+- ❌ Doesn't ask clarifying questions
+- ❌ Doesn't refuse dangerous requests
+### What Fine-Tuning Adds
+After training on 15,694 tool-calling examples:
+- ✅ Understands tool schemas ("Here's what this tool needs")
+- ✅ Generates correct JSON-RPC tool calls
+- ✅ Plans multi-step sequences ("First A, then B using A's result")
+- ✅ Asks when info is missing
+- ✅ Refuses harmful operations
+**Think of it like this:**
+- Base model = A smart person who knows how to talk but doesn't know your tools
+- Fine-tuned model = The same person after reading 15,000 instruction manuals
+---
+## 🧠 Concept 2: LoRA — The Magic of Cheap Fine-Tuning
+### The Problem: Full Fine-Tuning Is Expensive
+To fine-tune all 2 billion parameters of Qwen3-1.7B:
+| Component | Size | Why |
+|-----------|------|-----|
+| Model weights | 4 GB | 2B params × 2 bytes (fp16) |
+| Gradients | 4 GB | Need gradients for every parameter |
+| Optimizer states | 16 GB | Adam optimizer keeps 2 copies per param |
+| **Total** | **24 GB** | **Doesn't fit on T4 (16GB)!** |
+You'd need an **A100 GPU** (80GB) which costs **$3-4/hour**.
+### The Solution: LoRA (Low-Rank Adaptation)
+Instead of updating ALL parameters, we add tiny matrices to each layer:
+```
+Original Layer (Frozen — Never Changes)
+┌─────────────────────────────┐
+│  W (2048 × 2048) = 4.2M    │  ← 4 MILLION parameters
+│  parameters                 │     These stay FROZEN
+└─────────────────────────────┘
+         │
+         │ input x
+         ▼
+    y = W × x
+         │
+         ▼
+    output
+LoRA Adapters (Trainable — These Learn)
+┌─────────────────────┐    ┌─────────────────────┐
+│  A (2048 × 16)      │───▶│  B (16 × 2048)      │
+│  = 32K params       │    │  = 32K params       │
+│  (initialized       │    │  (initialized to 0) │
+│   randomly)         │    │                     │
+└─────────────────────┘    └─────────────────────┘
+         │                         │
+         ▼                         ▼
+    h = A × x                  y' = B × h
+                                   = B × (A × x)
+Final Output:
+y = W × x + B × A × x
+    ↑           ↑
+  frozen     trained
+```
+**Math:**
+- Original: W is 2048×2048 = 4,194,304 parameters
+- LoRA: A is 2048×16 = 32,768, B is 16×2048 = 32,768
+- Total LoRA: 65,536 parameters (1.6% of original!)
+- Memory for training: ~5GB total (fits on T4!)
+### Why This Works
+The idea: neural network weights often have **low-rank structure**.
+Even though W is 2048×2048, the "important directions" of change can be
+captured by much smaller matrices.
+Think of it like adjusting a steering wheel:
+- Full fine-tuning = Rebuilding the entire car to turn better
+- LoRA = Adding a small steering adjustment module (tiny, cheap, effective)
+### Our LoRA Configuration
+```python
+from peft import LoraConfig
+peft_config = LoraConfig(
+    r=16,                    # Rank: "resolution" of the adapter
+    lora_alpha=32,          # Scaling: how strongly LoRA affects output
+    target_modules="all-linear",  # Apply to ALL linear layers
+    lora_dropout=0.05,      # Dropout: 5% random zeroing (prevents overfitting)
+    bias="none",             # Don't train bias terms (saves memory)
+    task_type="CAUSAL_LM",   # This is a language model
+)
+```
+**r=16:** Think of this as the "resolution." Higher = more detail but more memory.
+For ~16K training examples, r=16 is the sweet spot. (TinyAgent used r=64 for 80K examples)
+**lora_alpha=32:** Scaling factor. Rule of thumb: 2× rank. Controls how much
+the LoRA output contributes to the final result.
+**target_modules="all-linear":** The "LoRA Without Regret" paper proved that
+applying LoRA to ALL linear layers (not just attention projections) matches
+full fine-tuning quality. This is our secret sauce.
+---
+## 🧠 Concept 3: SFT — Supervised Fine-Tuning
+### What Is SFT?
+SFT = **teaching by example.** We show the model:
+```
+Input:  "Find all Python files"
+Output: {"tool": "shell_exec", "arguments": {"command": "find . -name '*.py'"}}
+Input:  "Delete all files"
+Output: "I cannot help with that. Deleting all files is dangerous..."
+Input:  "Clone the repo and find TODOs"
+Output: {"tool": "shell_exec", "arguments": {"command": "git clone https://... && grep -r 'TODO' ."}}
+```
+The model learns to predict the output given the input.
+### How SFT Works Step by Step
+#### Step 1: Tokenize
+Convert text → numbers:
+```
+"Find Python files"
+↓ Tokenizer
+[4921, 12729, 4367, 8921, 1023]
+```
+Each number is an index in a vocabulary of ~100,000 tokens.
+#### Step 2: Forward Pass
+The model processes the tokenized input and predicts the next token at EACH position:
+```
+Input tokens:  [4921, 12729, 4367, 8921]
+                                    │
+Predictions:   [?,    ?,    ?,    ?  ] ──▶ next token should be 1023
+```
+The model outputs a probability distribution over all ~100,000 possible tokens.
+#### Step 3: Compute Loss (Cross-Entropy)
+```
+Predicted probabilities:  [0.01, 0.03, 0.001, ..., 0.45, ..., 0.002]
+                              ↑                    ↑
+                           wrong                 correct (1023)
+Loss = -log(probability_of_correct_token)
+     = -log(0.45)
+     = 0.80
+```
+**Lower loss = better prediction.**
+If the model predicted token 1023 with probability 0.45, loss is 0.80.
+If it predicted with probability 0.99, loss is 0.01 (much better!).
+#### Step 4: Backward Pass (Backpropagation)
+Compute gradients: which direction to adjust weights to reduce loss.
+```
+For each LoRA parameter:
+  gradient = how much changing this parameter would change the loss
+```
+This is done automatically by PyTorch's autograd.
+#### Step 5: Update Weights (Adam Optimizer)
+```
+new_weight = old_weight - learning_rate × gradient
+```
+Adam is smarter — it uses momentum and adaptive learning rates per parameter.
+#### Step 6: Repeat
+Do this for ALL examples in the dataset, then repeat for 3 epochs.
+---
+## 🧠 Concept 4: Hyperparameters — The Recipe
+Think of training as cooking. Hyperparameters are your recipe.
+### Learning Rate: 2e-4
+**What it controls:** How big each weight update step is.
+```
+Learning Rate
+   │
+1e-2│  ╳─── Too high: loss oscillates, model never settles
+   │   │
+2e-4│    ●── Sweet spot for LoRA (10× higher than full fine-tuning)
+   │      ╲
+1e-5│       ╲── Too low: barely moves, takes forever
+   │         ╲
+   └─────────────────
+        Steps
+```
+**Why 2e-4 for LoRA?**
+- Full fine-tuning typically uses 2e-5
+- LoRA has 100× fewer parameters
+- Each parameter update needs 10× more impact
+- So: 2e-5 × 10 = **2e-4**
+### Batch Size: 4 × 4 = 16 Effective
+**What it controls:** How many examples the model sees before updating weights.
+**Without Gradient Accumulation:**
+  Process 4 examples → Compute gradients → Update weights → Next 4
+**With Gradient Accumulation (what we do):**
+  Process 4 examples → Compute gradients → SAVE gradients (don't update)
+  Process 4 examples → Compute gradients → ADD to saved gradients
+  Process 4 examples → Compute gradients → ADD to saved gradients
+  Process 4 examples → Compute gradients → ADD to saved gradients
+  Now update weights (accumulated from 4 × 4 = 16 examples)
+**Why gradient accumulation?**
+- GPU can only fit 4 examples at once (memory limit)
+- But effective batch of 16 gives more stable gradients
+- It's a memory-saving trick
+**Trade-off:** Slower (4× more forward passes per update) but better quality.
+### Epochs: 3
+**What it controls:** How many times the model sees the entire dataset.
+**Epoch 1:** Sees all 15,694 examples → learns basic patterns
+**Epoch 2:** Sees all again → refines understanding
+**Epoch 3:** Sees all again → final tuning
+**Why 3?**
+- 1 epoch: Underfitting (hasn't seen enough)
+- 3 epochs: Sweet spot (learns patterns without memorizing)
+- 10 epochs: Overfitting (memorizes training data, fails on new data)
+### Warmup Ratio: 0.1 (10%)
+**What it controls:** For the first 10% of training, learning rate starts at 0
+and gradually ramps up to the full rate.
+**Why warmup?**
+- At the start, model knows NOTHING about tool-calling
+- Large updates could push weights in random bad directions
+- Warmup lets model "get its bearings" first
+### Cosine LR Schedule
+After warmup, learning rate follows a cosine curve:
+```
+Learning Rate
+   │
+2e-4│    ╱──╲
+   │   ╱    ╲
+   │  ╱      ╲
+   │ ╱        ╲
+  0 │╱          ╲────
+   └─────────────────
+     warmup    end
+```
+**Why cosine?**
+- High in the middle: aggressive learning when model has basic understanding
+- Low at the end: fine-tuning details, settling into optimal weights
+- Prevents overshooting at the end of training
+### Max Sequence Length: 2048 tokens
+**What it controls:** Maximum number of tokens per training example.
+```
+Example conversation:
+  System prompt:  ~500 tokens
+  User message:   ~100 tokens
+  Assistant reply: ~300 tokens
+  Total:          ~900 tokens  ← Fits in 2048 ✓
+```
+**Why 2048?**
+- Covers all our examples (most are under 1000 tokens)
+- Fits in T4 memory (longer sequences = more memory)
+- Standard for instruction-tuned models
+### Gradient Checkpointing: ON
+**What it does:** Saves memory by recomputing some values during backward pass.
+```
+Without checkpointing:
+  Forward pass: Store all intermediate activations → Backward pass uses them
+  Memory: 8 GB
+With checkpointing:
+  Forward pass: Store only SOME activations
+  Backward pass: Recompute missing ones on-the-fly
+  Memory: 5 GB (saves ~40%)
+```
+**Trade-off:** Slower (needs extra computation) but fits on T4.
+---
+## 📊 Reading Training Logs
+### What You'll See
+```
+Step 10/245: loss=2.847, learning_rate=2.0e-05
+Step 20/245: loss=2.654, learning_rate=4.0e-05
+...
+Step 100/245: loss=1.234, learning_rate=1.8e-04
+...
+Step 245/245: loss=0.876, learning_rate=1.2e-05
+```
+### How to Interpret
+| Observation | Meaning |
+|-------------|---------|
+| Loss going DOWN | ✅ Model is learning |
+| Loss going UP after going down | ⚠️ Overfitting — stop early |
+| Loss stuck at ~3.0 | ❌ Not learning — check data/format |
+| Loss drops fast then plateaus | ✅ Normal — model learned basics |
+| Eval loss ≈ Train loss | ✅ Good generalization |
+| Eval loss >> Train loss | ❌ Overfitting — model memorized training data |
+### Target Numbers (for reference)
+- **Initial loss:** ~2.5-3.5 (random guessing among many tokens)
+- **Final loss:** ~0.8-1.2 (decent learning on 16K examples)
+- **Eval loss:** Should be within 0.1-0.3 of train loss
+---
+## 🧮 Training Math
+### How Long Does It Take?
+```
+Dataset: 15,694 examples
+Batch size: 4 (per device)
+Gradient accumulation: 4 steps
+Effective batch: 4 × 4 = 16
+Steps per epoch: 15,694 ÷ 16 = ~980 steps
+Total steps (3 epochs): 980 × 3 = ~2,940 steps
+Time per step (T4): ~2-3 seconds
+Total time: 2,940 × 2.5s = ~7,350s = ~2 hours
+```
+### Cost Calculation
+```
+T4 GPU on HF Jobs: ~$0.60/hour
+Training time: ~2 hours
+Total cost: $0.60 × 2 = $1.20
+```
+Well under $10! ✅
+---
+## 🎓 Summary: Key Training Concepts
+| Concept | What It Is | Why It Matters |
+|---------|-----------|----------------|
+| **LoRA** | Tiny trainable matrices added to frozen layers | Makes training affordable (5GB vs 24GB) |
+| **SFT** | Teaching model with input→output examples | Gives model tool-calling knowledge |
+| **Loss** | Measure of how wrong predictions are | Lower = better learning |
+| **Learning Rate** | Size of weight updates | Too high = chaos, too low = slow |
+| **Batch Size** | Examples per weight update | More = stable gradients, needs more memory |
+| **Gradient Accumulation** | Fake larger batch sizes | Memory-saving trick |
+| **Epochs** | Times model sees full dataset | 3 is sweet spot |
+| **Warmup** | Gradual LR increase at start | Prevents early instability |
+| **Cosine Schedule** | LR high→low curve | Aggressive middle, gentle end |
+| **Gradient Checkpointing** | Recompute activations | Saves ~40% memory |
+---
+## 🔜 Next Step
+Read `05-dataset.md` to understand our training data — what we have, what's missing, and how to make it better.