Upload docs/04-training.md

40d8168 verified 10 days ago

13.5 kB

	# 04 — Training Explained: LoRA, SFT & Hyperparameters

	## 🎓 Why This Chapter Matters

	This is where we answer: "How do we actually teach the model to use tools?"

	By the end of this chapter, you'll understand:
	- What LoRA is and why it's magical for budget training
	- What SFT does step-by-step
	- What each hyperparameter controls
	- How to read training logs and know if it's working

	---

	## 🧠 Concept 1: Why Can't We Just Use the Base Model?

	Qwen3-1.7B is already a great model. It can chat, answer questions, write code.
	But it doesn't know how to use tools in a structured way.

	### What Base Models Know

	Base model Qwen3-1.7B:
	- ✅ Understands English, can chat
	- ✅ Can write Python code
	- ✅ Can answer questions about the world
	- ❌ Doesn't know about your specific tool schemas
	- ❌ Doesn't output tool calls in correct JSON-RPC format
	- ❌ Doesn't plan multi-step tool chains
	- ❌ Doesn't ask clarifying questions
	- ❌ Doesn't refuse dangerous requests

	### What Fine-Tuning Adds

	After training on 15,694 tool-calling examples:
	- ✅ Understands tool schemas ("Here's what this tool needs")
	- ✅ Generates correct JSON-RPC tool calls
	- ✅ Plans multi-step sequences ("First A, then B using A's result")
	- ✅ Asks when info is missing
	- ✅ Refuses harmful operations

	Think of it like this:
	- Base model = A smart person who knows how to talk but doesn't know your tools
	- Fine-tuned model = The same person after reading 15,000 instruction manuals

	---

	## 🧠 Concept 2: LoRA — The Magic of Cheap Fine-Tuning

	### The Problem: Full Fine-Tuning Is Expensive

	To fine-tune all 2 billion parameters of Qwen3-1.7B:

	\| Component \| Size \| Why \|
	\|-----------\|------\|-----\|
	\| Model weights \| 4 GB \| 2B params × 2 bytes (fp16) \|
	\| Gradients \| 4 GB \| Need gradients for every parameter \|
	\| Optimizer states \| 16 GB \| Adam optimizer keeps 2 copies per param \|
	\| Total \| 24 GB \| Doesn't fit on T4 (16GB)! \|

	You'd need an A100 GPU (80GB) which costs $3-4/hour.

	### The Solution: LoRA (Low-Rank Adaptation)

	Instead of updating ALL parameters, we add tiny matrices to each layer:

	```
	Original Layer (Frozen — Never Changes)
	┌─────────────────────────────┐
	│ W (2048 × 2048) = 4.2M │ ← 4 MILLION parameters
	│ parameters │ These stay FROZEN
	└─────────────────────────────┘
	│
	│ input x
	▼
	y = W × x
	│
	▼
	output

	LoRA Adapters (Trainable — These Learn)
	┌─────────────────────┐ ┌─────────────────────┐
	│ A (2048 × 16) │───▶│ B (16 × 2048) │
	│ = 32K params │ │ = 32K params │
	│ (initialized │ │ (initialized to 0) │
	│ randomly) │ │ │
	└─────────────────────┘ └─────────────────────┘
	│ │
	▼ ▼
	h = A × x y' = B × h
	= B × (A × x)

	Final Output:
	y = W × x + B × A × x
	↑ ↑
	frozen trained
	```

	Math:
	- Original: W is 2048×2048 = 4,194,304 parameters
	- LoRA: A is 2048×16 = 32,768, B is 16×2048 = 32,768
	- Total LoRA: 65,536 parameters (1.6% of original!)
	- Memory for training: ~5GB total (fits on T4!)

	### Why This Works

	The idea: neural network weights often have low-rank structure.
	Even though W is 2048×2048, the "important directions" of change can be
	captured by much smaller matrices.

	Think of it like adjusting a steering wheel:
	- Full fine-tuning = Rebuilding the entire car to turn better
	- LoRA = Adding a small steering adjustment module (tiny, cheap, effective)

	### Our LoRA Configuration

	```python
	from peft import LoraConfig

	peft_config = LoraConfig(
	r=16, # Rank: "resolution" of the adapter
	lora_alpha=32, # Scaling: how strongly LoRA affects output
	target_modules="all-linear", # Apply to ALL linear layers
	lora_dropout=0.05, # Dropout: 5% random zeroing (prevents overfitting)
	bias="none", # Don't train bias terms (saves memory)
	task_type="CAUSAL_LM", # This is a language model
	)
	```

	r=16: Think of this as the "resolution." Higher = more detail but more memory.
	For ~16K training examples, r=16 is the sweet spot. (TinyAgent used r=64 for 80K examples)

	lora_alpha=32: Scaling factor. Rule of thumb: 2× rank. Controls how much
	the LoRA output contributes to the final result.

	target_modules="all-linear": The "LoRA Without Regret" paper proved that
	applying LoRA to ALL linear layers (not just attention projections) matches
	full fine-tuning quality. This is our secret sauce.

	---

	## 🧠 Concept 3: SFT — Supervised Fine-Tuning

	### What Is SFT?

	SFT = teaching by example. We show the model:

	```
	Input: "Find all Python files"
	Output: {"tool": "shell_exec", "arguments": {"command": "find . -name '*.py'"}}

	Input: "Delete all files"
	Output: "I cannot help with that. Deleting all files is dangerous..."

	Input: "Clone the repo and find TODOs"
	Output: {"tool": "shell_exec", "arguments": {"command": "git clone https://... && grep -r 'TODO' ."}}
	```

	The model learns to predict the output given the input.

	### How SFT Works Step by Step

	#### Step 1: Tokenize

	Convert text → numbers:

	```
	"Find Python files"
	↓ Tokenizer
	[4921, 12729, 4367, 8921, 1023]
	```

	Each number is an index in a vocabulary of ~100,000 tokens.

	#### Step 2: Forward Pass

	The model processes the tokenized input and predicts the next token at EACH position:

	```
	Input tokens: [4921, 12729, 4367, 8921]
	│
	Predictions: [?, ?, ?, ? ] ──▶ next token should be 1023
	```

	The model outputs a probability distribution over all ~100,000 possible tokens.

	#### Step 3: Compute Loss (Cross-Entropy)

	```
	Predicted probabilities: [0.01, 0.03, 0.001, ..., 0.45, ..., 0.002]
	↑ ↑
	wrong correct (1023)

	Loss = -log(probability_of_correct_token)
	= -log(0.45)
	= 0.80
	```

	Lower loss = better prediction.

	If the model predicted token 1023 with probability 0.45, loss is 0.80.
	If it predicted with probability 0.99, loss is 0.01 (much better!).

	#### Step 4: Backward Pass (Backpropagation)

	Compute gradients: which direction to adjust weights to reduce loss.

	```
	For each LoRA parameter:
	gradient = how much changing this parameter would change the loss
	```

	This is done automatically by PyTorch's autograd.

	#### Step 5: Update Weights (Adam Optimizer)

	```
	new_weight = old_weight - learning_rate × gradient
	```

	Adam is smarter — it uses momentum and adaptive learning rates per parameter.

	#### Step 6: Repeat

	Do this for ALL examples in the dataset, then repeat for 3 epochs.

	---

	## 🧠 Concept 4: Hyperparameters — The Recipe

	Think of training as cooking. Hyperparameters are your recipe.

	### Learning Rate: 2e-4

	What it controls: How big each weight update step is.

	```
	Learning Rate
	│
	1e-2│ ╳─── Too high: loss oscillates, model never settles
	│ │
	2e-4│ ●── Sweet spot for LoRA (10× higher than full fine-tuning)
	│ ╲
	1e-5│ ╲── Too low: barely moves, takes forever
	│ ╲
	└─────────────────
	Steps
	```

	Why 2e-4 for LoRA?
	- Full fine-tuning typically uses 2e-5
	- LoRA has 100× fewer parameters
	- Each parameter update needs 10× more impact
	- So: 2e-5 × 10 = 2e-4

	### Batch Size: 4 × 4 = 16 Effective

	What it controls: How many examples the model sees before updating weights.

	Without Gradient Accumulation:
	Process 4 examples → Compute gradients → Update weights → Next 4

	With Gradient Accumulation (what we do):
	Process 4 examples → Compute gradients → SAVE gradients (don't update)
	Process 4 examples → Compute gradients → ADD to saved gradients
	Process 4 examples → Compute gradients → ADD to saved gradients
	Process 4 examples → Compute gradients → ADD to saved gradients
	Now update weights (accumulated from 4 × 4 = 16 examples)

	Why gradient accumulation?
	- GPU can only fit 4 examples at once (memory limit)
	- But effective batch of 16 gives more stable gradients
	- It's a memory-saving trick

	Trade-off: Slower (4× more forward passes per update) but better quality.

	### Epochs: 3

	What it controls: How many times the model sees the entire dataset.

	Epoch 1: Sees all 15,694 examples → learns basic patterns
	Epoch 2: Sees all again → refines understanding
	Epoch 3: Sees all again → final tuning

	Why 3?
	- 1 epoch: Underfitting (hasn't seen enough)
	- 3 epochs: Sweet spot (learns patterns without memorizing)
	- 10 epochs: Overfitting (memorizes training data, fails on new data)

	### Warmup Ratio: 0.1 (10%)

	What it controls: For the first 10% of training, learning rate starts at 0
	and gradually ramps up to the full rate.

	Why warmup?
	- At the start, model knows NOTHING about tool-calling
	- Large updates could push weights in random bad directions
	- Warmup lets model "get its bearings" first

	### Cosine LR Schedule

	After warmup, learning rate follows a cosine curve:
	```
	Learning Rate
	│
	2e-4│ ╱──╲
	│ ╱ ╲
	│ ╱ ╲
	│ ╱ ╲
	0 │╱ ╲────
	└─────────────────
	warmup end
	```

	Why cosine?
	- High in the middle: aggressive learning when model has basic understanding
	- Low at the end: fine-tuning details, settling into optimal weights
	- Prevents overshooting at the end of training

	### Max Sequence Length: 2048 tokens

	What it controls: Maximum number of tokens per training example.

	```
	Example conversation:
	System prompt: ~500 tokens
	User message: ~100 tokens
	Assistant reply: ~300 tokens
	Total: ~900 tokens ← Fits in 2048 ✓
	```

	Why 2048?
	- Covers all our examples (most are under 1000 tokens)
	- Fits in T4 memory (longer sequences = more memory)
	- Standard for instruction-tuned models

	### Gradient Checkpointing: ON

	What it does: Saves memory by recomputing some values during backward pass.

	```
	Without checkpointing:
	Forward pass: Store all intermediate activations → Backward pass uses them
	Memory: 8 GB

	With checkpointing:
	Forward pass: Store only SOME activations
	Backward pass: Recompute missing ones on-the-fly
	Memory: 5 GB (saves ~40%)
	```

	Trade-off: Slower (needs extra computation) but fits on T4.

	---

	## 📊 Reading Training Logs

	### What You'll See

	```
	Step 10/245: loss=2.847, learning_rate=2.0e-05
	Step 20/245: loss=2.654, learning_rate=4.0e-05
	...
	Step 100/245: loss=1.234, learning_rate=1.8e-04
	...
	Step 245/245: loss=0.876, learning_rate=1.2e-05
	```

	### How to Interpret

	\| Observation \| Meaning \|
	\|-------------\|---------\|
	\| Loss going DOWN \| ✅ Model is learning \|
	\| Loss going UP after going down \| ⚠️ Overfitting — stop early \|
	\| Loss stuck at ~3.0 \| ❌ Not learning — check data/format \|
	\| Loss drops fast then plateaus \| ✅ Normal — model learned basics \|
	\| Eval loss ≈ Train loss \| ✅ Good generalization \|
	\| Eval loss >> Train loss \| ❌ Overfitting — model memorized training data \|

	### Target Numbers (for reference)

	- Initial loss: ~2.5-3.5 (random guessing among many tokens)
	- Final loss: ~0.8-1.2 (decent learning on 16K examples)
	- Eval loss: Should be within 0.1-0.3 of train loss

	---

	## 🧮 Training Math

	### How Long Does It Take?

	```
	Dataset: 15,694 examples
	Batch size: 4 (per device)
	Gradient accumulation: 4 steps
	Effective batch: 4 × 4 = 16

	Steps per epoch: 15,694 ÷ 16 = ~980 steps
	Total steps (3 epochs): 980 × 3 = ~2,940 steps

	Time per step (T4): ~2-3 seconds
	Total time: 2,940 × 2.5s = ~7,350s = ~2 hours
	```

	### Cost Calculation

	```
	T4 GPU on HF Jobs: ~$0.60/hour
	Training time: ~2 hours
	Total cost: $0.60 × 2 = $1.20
	```

	Well under $10! ✅

	---

	## 🎓 Summary: Key Training Concepts

	\| Concept \| What It Is \| Why It Matters \|
	\|---------\|-----------\|----------------\|
	\| LoRA \| Tiny trainable matrices added to frozen layers \| Makes training affordable (5GB vs 24GB) \|
	\| SFT \| Teaching model with input→output examples \| Gives model tool-calling knowledge \|
	\| Loss \| Measure of how wrong predictions are \| Lower = better learning \|
	\| Learning Rate \| Size of weight updates \| Too high = chaos, too low = slow \|
	\| Batch Size \| Examples per weight update \| More = stable gradients, needs more memory \|
	\| Gradient Accumulation \| Fake larger batch sizes \| Memory-saving trick \|
	\| Epochs \| Times model sees full dataset \| 3 is sweet spot \|
	\| Warmup \| Gradual LR increase at start \| Prevents early instability \|
	\| Cosine Schedule \| LR high→low curve \| Aggressive middle, gentle end \|
	\| Gradient Checkpointing \| Recompute activations \| Saves ~40% memory \|

	---

	## 🔜 Next Step

	Read `05-dataset.md` to understand our training data — what we have, what's missing, and how to make it better.