# 04 โ€” Training Explained: LoRA, SFT & Hyperparameters ## ๐ŸŽ“ Why This Chapter Matters This is where we answer: *"How do we actually teach the model to use tools?"* By the end of this chapter, you'll understand: - What LoRA is and why it's magical for budget training - What SFT does step-by-step - What each hyperparameter controls - How to read training logs and know if it's working --- ## ๐Ÿง  Concept 1: Why Can't We Just Use the Base Model? **Qwen3-1.7B** is already a great model. It can chat, answer questions, write code. But it doesn't know how to use **tools** in a structured way. ### What Base Models Know Base model Qwen3-1.7B: - โœ… Understands English, can chat - โœ… Can write Python code - โœ… Can answer questions about the world - โŒ Doesn't know about your specific tool schemas - โŒ Doesn't output tool calls in correct JSON-RPC format - โŒ Doesn't plan multi-step tool chains - โŒ Doesn't ask clarifying questions - โŒ Doesn't refuse dangerous requests ### What Fine-Tuning Adds After training on 15,694 tool-calling examples: - โœ… Understands tool schemas ("Here's what this tool needs") - โœ… Generates correct JSON-RPC tool calls - โœ… Plans multi-step sequences ("First A, then B using A's result") - โœ… Asks when info is missing - โœ… Refuses harmful operations **Think of it like this:** - Base model = A smart person who knows how to talk but doesn't know your tools - Fine-tuned model = The same person after reading 15,000 instruction manuals --- ## ๐Ÿง  Concept 2: LoRA โ€” The Magic of Cheap Fine-Tuning ### The Problem: Full Fine-Tuning Is Expensive To fine-tune all 2 billion parameters of Qwen3-1.7B: | Component | Size | Why | |-----------|------|-----| | Model weights | 4 GB | 2B params ร— 2 bytes (fp16) | | Gradients | 4 GB | Need gradients for every parameter | | Optimizer states | 16 GB | Adam optimizer keeps 2 copies per param | | **Total** | **24 GB** | **Doesn't fit on T4 (16GB)!** | You'd need an **A100 GPU** (80GB) which costs **$3-4/hour**. ### The Solution: LoRA (Low-Rank Adaptation) Instead of updating ALL parameters, we add tiny matrices to each layer: ``` Original Layer (Frozen โ€” Never Changes) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ W (2048 ร— 2048) = 4.2M โ”‚ โ† 4 MILLION parameters โ”‚ parameters โ”‚ These stay FROZEN โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ input x โ–ผ y = W ร— x โ”‚ โ–ผ output LoRA Adapters (Trainable โ€” These Learn) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ A (2048 ร— 16) โ”‚โ”€โ”€โ”€โ–ถโ”‚ B (16 ร— 2048) โ”‚ โ”‚ = 32K params โ”‚ โ”‚ = 32K params โ”‚ โ”‚ (initialized โ”‚ โ”‚ (initialized to 0) โ”‚ โ”‚ randomly) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ–ผ h = A ร— x y' = B ร— h = B ร— (A ร— x) Final Output: y = W ร— x + B ร— A ร— x โ†‘ โ†‘ frozen trained ``` **Math:** - Original: W is 2048ร—2048 = 4,194,304 parameters - LoRA: A is 2048ร—16 = 32,768, B is 16ร—2048 = 32,768 - Total LoRA: 65,536 parameters (1.6% of original!) - Memory for training: ~5GB total (fits on T4!) ### Why This Works The idea: neural network weights often have **low-rank structure**. Even though W is 2048ร—2048, the "important directions" of change can be captured by much smaller matrices. Think of it like adjusting a steering wheel: - Full fine-tuning = Rebuilding the entire car to turn better - LoRA = Adding a small steering adjustment module (tiny, cheap, effective) ### Our LoRA Configuration ```python from peft import LoraConfig peft_config = LoraConfig( r=16, # Rank: "resolution" of the adapter lora_alpha=32, # Scaling: how strongly LoRA affects output target_modules="all-linear", # Apply to ALL linear layers lora_dropout=0.05, # Dropout: 5% random zeroing (prevents overfitting) bias="none", # Don't train bias terms (saves memory) task_type="CAUSAL_LM", # This is a language model ) ``` **r=16:** Think of this as the "resolution." Higher = more detail but more memory. For ~16K training examples, r=16 is the sweet spot. (TinyAgent used r=64 for 80K examples) **lora_alpha=32:** Scaling factor. Rule of thumb: 2ร— rank. Controls how much the LoRA output contributes to the final result. **target_modules="all-linear":** The "LoRA Without Regret" paper proved that applying LoRA to ALL linear layers (not just attention projections) matches full fine-tuning quality. This is our secret sauce. --- ## ๐Ÿง  Concept 3: SFT โ€” Supervised Fine-Tuning ### What Is SFT? SFT = **teaching by example.** We show the model: ``` Input: "Find all Python files" Output: {"tool": "shell_exec", "arguments": {"command": "find . -name '*.py'"}} Input: "Delete all files" Output: "I cannot help with that. Deleting all files is dangerous..." Input: "Clone the repo and find TODOs" Output: {"tool": "shell_exec", "arguments": {"command": "git clone https://... && grep -r 'TODO' ."}} ``` The model learns to predict the output given the input. ### How SFT Works Step by Step #### Step 1: Tokenize Convert text โ†’ numbers: ``` "Find Python files" โ†“ Tokenizer [4921, 12729, 4367, 8921, 1023] ``` Each number is an index in a vocabulary of ~100,000 tokens. #### Step 2: Forward Pass The model processes the tokenized input and predicts the next token at EACH position: ``` Input tokens: [4921, 12729, 4367, 8921] โ”‚ Predictions: [?, ?, ?, ? ] โ”€โ”€โ–ถ next token should be 1023 ``` The model outputs a probability distribution over all ~100,000 possible tokens. #### Step 3: Compute Loss (Cross-Entropy) ``` Predicted probabilities: [0.01, 0.03, 0.001, ..., 0.45, ..., 0.002] โ†‘ โ†‘ wrong correct (1023) Loss = -log(probability_of_correct_token) = -log(0.45) = 0.80 ``` **Lower loss = better prediction.** If the model predicted token 1023 with probability 0.45, loss is 0.80. If it predicted with probability 0.99, loss is 0.01 (much better!). #### Step 4: Backward Pass (Backpropagation) Compute gradients: which direction to adjust weights to reduce loss. ``` For each LoRA parameter: gradient = how much changing this parameter would change the loss ``` This is done automatically by PyTorch's autograd. #### Step 5: Update Weights (Adam Optimizer) ``` new_weight = old_weight - learning_rate ร— gradient ``` Adam is smarter โ€” it uses momentum and adaptive learning rates per parameter. #### Step 6: Repeat Do this for ALL examples in the dataset, then repeat for 3 epochs. --- ## ๐Ÿง  Concept 4: Hyperparameters โ€” The Recipe Think of training as cooking. Hyperparameters are your recipe. ### Learning Rate: 2e-4 **What it controls:** How big each weight update step is. ``` Learning Rate โ”‚ 1e-2โ”‚ โ•ณโ”€โ”€โ”€ Too high: loss oscillates, model never settles โ”‚ โ”‚ 2e-4โ”‚ โ—โ”€โ”€ Sweet spot for LoRA (10ร— higher than full fine-tuning) โ”‚ โ•ฒ 1e-5โ”‚ โ•ฒโ”€โ”€ Too low: barely moves, takes forever โ”‚ โ•ฒ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Steps ``` **Why 2e-4 for LoRA?** - Full fine-tuning typically uses 2e-5 - LoRA has 100ร— fewer parameters - Each parameter update needs 10ร— more impact - So: 2e-5 ร— 10 = **2e-4** ### Batch Size: 4 ร— 4 = 16 Effective **What it controls:** How many examples the model sees before updating weights. **Without Gradient Accumulation:** Process 4 examples โ†’ Compute gradients โ†’ Update weights โ†’ Next 4 **With Gradient Accumulation (what we do):** Process 4 examples โ†’ Compute gradients โ†’ SAVE gradients (don't update) Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients Now update weights (accumulated from 4 ร— 4 = 16 examples) **Why gradient accumulation?** - GPU can only fit 4 examples at once (memory limit) - But effective batch of 16 gives more stable gradients - It's a memory-saving trick **Trade-off:** Slower (4ร— more forward passes per update) but better quality. ### Epochs: 3 **What it controls:** How many times the model sees the entire dataset. **Epoch 1:** Sees all 15,694 examples โ†’ learns basic patterns **Epoch 2:** Sees all again โ†’ refines understanding **Epoch 3:** Sees all again โ†’ final tuning **Why 3?** - 1 epoch: Underfitting (hasn't seen enough) - 3 epochs: Sweet spot (learns patterns without memorizing) - 10 epochs: Overfitting (memorizes training data, fails on new data) ### Warmup Ratio: 0.1 (10%) **What it controls:** For the first 10% of training, learning rate starts at 0 and gradually ramps up to the full rate. **Why warmup?** - At the start, model knows NOTHING about tool-calling - Large updates could push weights in random bad directions - Warmup lets model "get its bearings" first ### Cosine LR Schedule After warmup, learning rate follows a cosine curve: ``` Learning Rate โ”‚ 2e-4โ”‚ โ•ฑโ”€โ”€โ•ฒ โ”‚ โ•ฑ โ•ฒ โ”‚ โ•ฑ โ•ฒ โ”‚ โ•ฑ โ•ฒ 0 โ”‚โ•ฑ โ•ฒโ”€โ”€โ”€โ”€ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ warmup end ``` **Why cosine?** - High in the middle: aggressive learning when model has basic understanding - Low at the end: fine-tuning details, settling into optimal weights - Prevents overshooting at the end of training ### Max Sequence Length: 2048 tokens **What it controls:** Maximum number of tokens per training example. ``` Example conversation: System prompt: ~500 tokens User message: ~100 tokens Assistant reply: ~300 tokens Total: ~900 tokens โ† Fits in 2048 โœ“ ``` **Why 2048?** - Covers all our examples (most are under 1000 tokens) - Fits in T4 memory (longer sequences = more memory) - Standard for instruction-tuned models ### Gradient Checkpointing: ON **What it does:** Saves memory by recomputing some values during backward pass. ``` Without checkpointing: Forward pass: Store all intermediate activations โ†’ Backward pass uses them Memory: 8 GB With checkpointing: Forward pass: Store only SOME activations Backward pass: Recompute missing ones on-the-fly Memory: 5 GB (saves ~40%) ``` **Trade-off:** Slower (needs extra computation) but fits on T4. --- ## ๐Ÿ“Š Reading Training Logs ### What You'll See ``` Step 10/245: loss=2.847, learning_rate=2.0e-05 Step 20/245: loss=2.654, learning_rate=4.0e-05 ... Step 100/245: loss=1.234, learning_rate=1.8e-04 ... Step 245/245: loss=0.876, learning_rate=1.2e-05 ``` ### How to Interpret | Observation | Meaning | |-------------|---------| | Loss going DOWN | โœ… Model is learning | | Loss going UP after going down | โš ๏ธ Overfitting โ€” stop early | | Loss stuck at ~3.0 | โŒ Not learning โ€” check data/format | | Loss drops fast then plateaus | โœ… Normal โ€” model learned basics | | Eval loss โ‰ˆ Train loss | โœ… Good generalization | | Eval loss >> Train loss | โŒ Overfitting โ€” model memorized training data | ### Target Numbers (for reference) - **Initial loss:** ~2.5-3.5 (random guessing among many tokens) - **Final loss:** ~0.8-1.2 (decent learning on 16K examples) - **Eval loss:** Should be within 0.1-0.3 of train loss --- ## ๐Ÿงฎ Training Math ### How Long Does It Take? ``` Dataset: 15,694 examples Batch size: 4 (per device) Gradient accumulation: 4 steps Effective batch: 4 ร— 4 = 16 Steps per epoch: 15,694 รท 16 = ~980 steps Total steps (3 epochs): 980 ร— 3 = ~2,940 steps Time per step (T4): ~2-3 seconds Total time: 2,940 ร— 2.5s = ~7,350s = ~2 hours ``` ### Cost Calculation ``` T4 GPU on HF Jobs: ~$0.60/hour Training time: ~2 hours Total cost: $0.60 ร— 2 = $1.20 ``` Well under $10! โœ… --- ## ๐ŸŽ“ Summary: Key Training Concepts | Concept | What It Is | Why It Matters | |---------|-----------|----------------| | **LoRA** | Tiny trainable matrices added to frozen layers | Makes training affordable (5GB vs 24GB) | | **SFT** | Teaching model with inputโ†’output examples | Gives model tool-calling knowledge | | **Loss** | Measure of how wrong predictions are | Lower = better learning | | **Learning Rate** | Size of weight updates | Too high = chaos, too low = slow | | **Batch Size** | Examples per weight update | More = stable gradients, needs more memory | | **Gradient Accumulation** | Fake larger batch sizes | Memory-saving trick | | **Epochs** | Times model sees full dataset | 3 is sweet spot | | **Warmup** | Gradual LR increase at start | Prevents early instability | | **Cosine Schedule** | LR highโ†’low curve | Aggressive middle, gentle end | | **Gradient Checkpointing** | Recompute activations | Saves ~40% memory | --- ## ๐Ÿ”œ Next Step Read `05-dataset.md` to understand our training data โ€” what we have, what's missing, and how to make it better.