| # 04 โ Training Explained: LoRA, SFT & Hyperparameters |
|
|
| ## ๐ Why This Chapter Matters |
|
|
| This is where we answer: *"How do we actually teach the model to use tools?"* |
|
|
| By the end of this chapter, you'll understand: |
| - What LoRA is and why it's magical for budget training |
| - What SFT does step-by-step |
| - What each hyperparameter controls |
| - How to read training logs and know if it's working |
|
|
| --- |
|
|
| ## ๐ง Concept 1: Why Can't We Just Use the Base Model? |
|
|
| **Qwen3-1.7B** is already a great model. It can chat, answer questions, write code. |
| But it doesn't know how to use **tools** in a structured way. |
|
|
| ### What Base Models Know |
|
|
| Base model Qwen3-1.7B: |
| - โ
Understands English, can chat |
| - โ
Can write Python code |
| - โ
Can answer questions about the world |
| - โ Doesn't know about your specific tool schemas |
| - โ Doesn't output tool calls in correct JSON-RPC format |
| - โ Doesn't plan multi-step tool chains |
| - โ Doesn't ask clarifying questions |
| - โ Doesn't refuse dangerous requests |
|
|
| ### What Fine-Tuning Adds |
|
|
| After training on 15,694 tool-calling examples: |
| - โ
Understands tool schemas ("Here's what this tool needs") |
| - โ
Generates correct JSON-RPC tool calls |
| - โ
Plans multi-step sequences ("First A, then B using A's result") |
| - โ
Asks when info is missing |
| - โ
Refuses harmful operations |
|
|
| **Think of it like this:** |
| - Base model = A smart person who knows how to talk but doesn't know your tools |
| - Fine-tuned model = The same person after reading 15,000 instruction manuals |
|
|
| --- |
|
|
| ## ๐ง Concept 2: LoRA โ The Magic of Cheap Fine-Tuning |
|
|
| ### The Problem: Full Fine-Tuning Is Expensive |
|
|
| To fine-tune all 2 billion parameters of Qwen3-1.7B: |
|
|
| | Component | Size | Why | |
| |-----------|------|-----| |
| | Model weights | 4 GB | 2B params ร 2 bytes (fp16) | |
| | Gradients | 4 GB | Need gradients for every parameter | |
| | Optimizer states | 16 GB | Adam optimizer keeps 2 copies per param | |
| | **Total** | **24 GB** | **Doesn't fit on T4 (16GB)!** | |
|
|
| You'd need an **A100 GPU** (80GB) which costs **$3-4/hour**. |
|
|
| ### The Solution: LoRA (Low-Rank Adaptation) |
|
|
| Instead of updating ALL parameters, we add tiny matrices to each layer: |
|
|
| ``` |
| Original Layer (Frozen โ Never Changes) |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ W (2048 ร 2048) = 4.2M โ โ 4 MILLION parameters |
| โ parameters โ These stay FROZEN |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ |
| โ input x |
| โผ |
| y = W ร x |
| โ |
| โผ |
| output |
| |
| LoRA Adapters (Trainable โ These Learn) |
| โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ |
| โ A (2048 ร 16) โโโโโถโ B (16 ร 2048) โ |
| โ = 32K params โ โ = 32K params โ |
| โ (initialized โ โ (initialized to 0) โ |
| โ randomly) โ โ โ |
| โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ |
| โ โ |
| โผ โผ |
| h = A ร x y' = B ร h |
| = B ร (A ร x) |
| |
| Final Output: |
| y = W ร x + B ร A ร x |
| โ โ |
| frozen trained |
| ``` |
|
|
| **Math:** |
| - Original: W is 2048ร2048 = 4,194,304 parameters |
| - LoRA: A is 2048ร16 = 32,768, B is 16ร2048 = 32,768 |
| - Total LoRA: 65,536 parameters (1.6% of original!) |
| - Memory for training: ~5GB total (fits on T4!) |
|
|
| ### Why This Works |
|
|
| The idea: neural network weights often have **low-rank structure**. |
| Even though W is 2048ร2048, the "important directions" of change can be |
| captured by much smaller matrices. |
|
|
| Think of it like adjusting a steering wheel: |
| - Full fine-tuning = Rebuilding the entire car to turn better |
| - LoRA = Adding a small steering adjustment module (tiny, cheap, effective) |
|
|
| ### Our LoRA Configuration |
|
|
| ```python |
| from peft import LoraConfig |
| |
| peft_config = LoraConfig( |
| r=16, # Rank: "resolution" of the adapter |
| lora_alpha=32, # Scaling: how strongly LoRA affects output |
| target_modules="all-linear", # Apply to ALL linear layers |
| lora_dropout=0.05, # Dropout: 5% random zeroing (prevents overfitting) |
| bias="none", # Don't train bias terms (saves memory) |
| task_type="CAUSAL_LM", # This is a language model |
| ) |
| ``` |
|
|
| **r=16:** Think of this as the "resolution." Higher = more detail but more memory. |
| For ~16K training examples, r=16 is the sweet spot. (TinyAgent used r=64 for 80K examples) |
|
|
| **lora_alpha=32:** Scaling factor. Rule of thumb: 2ร rank. Controls how much |
| the LoRA output contributes to the final result. |
| |
| **target_modules="all-linear":** The "LoRA Without Regret" paper proved that |
| applying LoRA to ALL linear layers (not just attention projections) matches |
| full fine-tuning quality. This is our secret sauce. |
|
|
| --- |
|
|
| ## ๐ง Concept 3: SFT โ Supervised Fine-Tuning |
|
|
| ### What Is SFT? |
|
|
| SFT = **teaching by example.** We show the model: |
|
|
| ``` |
| Input: "Find all Python files" |
| Output: {"tool": "shell_exec", "arguments": {"command": "find . -name '*.py'"}} |
| |
| Input: "Delete all files" |
| Output: "I cannot help with that. Deleting all files is dangerous..." |
| |
| Input: "Clone the repo and find TODOs" |
| Output: {"tool": "shell_exec", "arguments": {"command": "git clone https://... && grep -r 'TODO' ."}} |
| ``` |
|
|
| The model learns to predict the output given the input. |
|
|
| ### How SFT Works Step by Step |
|
|
| #### Step 1: Tokenize |
|
|
| Convert text โ numbers: |
|
|
| ``` |
| "Find Python files" |
| โ Tokenizer |
| [4921, 12729, 4367, 8921, 1023] |
| ``` |
|
|
| Each number is an index in a vocabulary of ~100,000 tokens. |
|
|
| #### Step 2: Forward Pass |
|
|
| The model processes the tokenized input and predicts the next token at EACH position: |
|
|
| ``` |
| Input tokens: [4921, 12729, 4367, 8921] |
| โ |
| Predictions: [?, ?, ?, ? ] โโโถ next token should be 1023 |
| ``` |
|
|
| The model outputs a probability distribution over all ~100,000 possible tokens. |
|
|
| #### Step 3: Compute Loss (Cross-Entropy) |
|
|
| ``` |
| Predicted probabilities: [0.01, 0.03, 0.001, ..., 0.45, ..., 0.002] |
| โ โ |
| wrong correct (1023) |
| |
| Loss = -log(probability_of_correct_token) |
| = -log(0.45) |
| = 0.80 |
| ``` |
|
|
| **Lower loss = better prediction.** |
|
|
| If the model predicted token 1023 with probability 0.45, loss is 0.80. |
| If it predicted with probability 0.99, loss is 0.01 (much better!). |
|
|
| #### Step 4: Backward Pass (Backpropagation) |
|
|
| Compute gradients: which direction to adjust weights to reduce loss. |
|
|
| ``` |
| For each LoRA parameter: |
| gradient = how much changing this parameter would change the loss |
| ``` |
|
|
| This is done automatically by PyTorch's autograd. |
|
|
| #### Step 5: Update Weights (Adam Optimizer) |
|
|
| ``` |
| new_weight = old_weight - learning_rate ร gradient |
| ``` |
|
|
| Adam is smarter โ it uses momentum and adaptive learning rates per parameter. |
|
|
| #### Step 6: Repeat |
|
|
| Do this for ALL examples in the dataset, then repeat for 3 epochs. |
|
|
| --- |
|
|
| ## ๐ง Concept 4: Hyperparameters โ The Recipe |
|
|
| Think of training as cooking. Hyperparameters are your recipe. |
|
|
| ### Learning Rate: 2e-4 |
|
|
| **What it controls:** How big each weight update step is. |
|
|
| ``` |
| Learning Rate |
| โ |
| 1e-2โ โณโโโ Too high: loss oscillates, model never settles |
| โ โ |
| 2e-4โ โโโ Sweet spot for LoRA (10ร higher than full fine-tuning) |
| โ โฒ |
| 1e-5โ โฒโโ Too low: barely moves, takes forever |
| โ โฒ |
| โโโโโโโโโโโโโโโโโโ |
| Steps |
| ``` |
|
|
| **Why 2e-4 for LoRA?** |
| - Full fine-tuning typically uses 2e-5 |
| - LoRA has 100ร fewer parameters |
| - Each parameter update needs 10ร more impact |
| - So: 2e-5 ร 10 = **2e-4** |
|
|
| ### Batch Size: 4 ร 4 = 16 Effective |
|
|
| **What it controls:** How many examples the model sees before updating weights. |
|
|
| **Without Gradient Accumulation:** |
| Process 4 examples โ Compute gradients โ Update weights โ Next 4 |
|
|
| **With Gradient Accumulation (what we do):** |
| Process 4 examples โ Compute gradients โ SAVE gradients (don't update) |
| Process 4 examples โ Compute gradients โ ADD to saved gradients |
| Process 4 examples โ Compute gradients โ ADD to saved gradients |
| Process 4 examples โ Compute gradients โ ADD to saved gradients |
| Now update weights (accumulated from 4 ร 4 = 16 examples) |
|
|
| **Why gradient accumulation?** |
| - GPU can only fit 4 examples at once (memory limit) |
| - But effective batch of 16 gives more stable gradients |
| - It's a memory-saving trick |
|
|
| **Trade-off:** Slower (4ร more forward passes per update) but better quality. |
|
|
| ### Epochs: 3 |
|
|
| **What it controls:** How many times the model sees the entire dataset. |
|
|
| **Epoch 1:** Sees all 15,694 examples โ learns basic patterns |
| **Epoch 2:** Sees all again โ refines understanding |
| **Epoch 3:** Sees all again โ final tuning |
|
|
| **Why 3?** |
| - 1 epoch: Underfitting (hasn't seen enough) |
| - 3 epochs: Sweet spot (learns patterns without memorizing) |
| - 10 epochs: Overfitting (memorizes training data, fails on new data) |
|
|
| ### Warmup Ratio: 0.1 (10%) |
|
|
| **What it controls:** For the first 10% of training, learning rate starts at 0 |
| and gradually ramps up to the full rate. |
|
|
| **Why warmup?** |
| - At the start, model knows NOTHING about tool-calling |
| - Large updates could push weights in random bad directions |
| - Warmup lets model "get its bearings" first |
|
|
| ### Cosine LR Schedule |
|
|
| After warmup, learning rate follows a cosine curve: |
| ``` |
| Learning Rate |
| โ |
| 2e-4โ โฑโโโฒ |
| โ โฑ โฒ |
| โ โฑ โฒ |
| โ โฑ โฒ |
| 0 โโฑ โฒโโโโ |
| โโโโโโโโโโโโโโโโโโ |
| warmup end |
| ``` |
|
|
| **Why cosine?** |
| - High in the middle: aggressive learning when model has basic understanding |
| - Low at the end: fine-tuning details, settling into optimal weights |
| - Prevents overshooting at the end of training |
|
|
| ### Max Sequence Length: 2048 tokens |
|
|
| **What it controls:** Maximum number of tokens per training example. |
|
|
| ``` |
| Example conversation: |
| System prompt: ~500 tokens |
| User message: ~100 tokens |
| Assistant reply: ~300 tokens |
| Total: ~900 tokens โ Fits in 2048 โ |
| ``` |
|
|
| **Why 2048?** |
| - Covers all our examples (most are under 1000 tokens) |
| - Fits in T4 memory (longer sequences = more memory) |
| - Standard for instruction-tuned models |
|
|
| ### Gradient Checkpointing: ON |
|
|
| **What it does:** Saves memory by recomputing some values during backward pass. |
|
|
| ``` |
| Without checkpointing: |
| Forward pass: Store all intermediate activations โ Backward pass uses them |
| Memory: 8 GB |
| |
| With checkpointing: |
| Forward pass: Store only SOME activations |
| Backward pass: Recompute missing ones on-the-fly |
| Memory: 5 GB (saves ~40%) |
| ``` |
|
|
| **Trade-off:** Slower (needs extra computation) but fits on T4. |
|
|
| --- |
|
|
| ## ๐ Reading Training Logs |
|
|
| ### What You'll See |
|
|
| ``` |
| Step 10/245: loss=2.847, learning_rate=2.0e-05 |
| Step 20/245: loss=2.654, learning_rate=4.0e-05 |
| ... |
| Step 100/245: loss=1.234, learning_rate=1.8e-04 |
| ... |
| Step 245/245: loss=0.876, learning_rate=1.2e-05 |
| ``` |
|
|
| ### How to Interpret |
|
|
| | Observation | Meaning | |
| |-------------|---------| |
| | Loss going DOWN | โ
Model is learning | |
| | Loss going UP after going down | โ ๏ธ Overfitting โ stop early | |
| | Loss stuck at ~3.0 | โ Not learning โ check data/format | |
| | Loss drops fast then plateaus | โ
Normal โ model learned basics | |
| | Eval loss โ Train loss | โ
Good generalization | |
| | Eval loss >> Train loss | โ Overfitting โ model memorized training data | |
|
|
| ### Target Numbers (for reference) |
|
|
| - **Initial loss:** ~2.5-3.5 (random guessing among many tokens) |
| - **Final loss:** ~0.8-1.2 (decent learning on 16K examples) |
| - **Eval loss:** Should be within 0.1-0.3 of train loss |
|
|
| --- |
|
|
| ## ๐งฎ Training Math |
|
|
| ### How Long Does It Take? |
|
|
| ``` |
| Dataset: 15,694 examples |
| Batch size: 4 (per device) |
| Gradient accumulation: 4 steps |
| Effective batch: 4 ร 4 = 16 |
| |
| Steps per epoch: 15,694 รท 16 = ~980 steps |
| Total steps (3 epochs): 980 ร 3 = ~2,940 steps |
| |
| Time per step (T4): ~2-3 seconds |
| Total time: 2,940 ร 2.5s = ~7,350s = ~2 hours |
| ``` |
|
|
| ### Cost Calculation |
|
|
| ``` |
| T4 GPU on HF Jobs: ~$0.60/hour |
| Training time: ~2 hours |
| Total cost: $0.60 ร 2 = $1.20 |
| ``` |
|
|
| Well under $10! โ
|
|
|
| --- |
|
|
| ## ๐ Summary: Key Training Concepts |
|
|
| | Concept | What It Is | Why It Matters | |
| |---------|-----------|----------------| |
| | **LoRA** | Tiny trainable matrices added to frozen layers | Makes training affordable (5GB vs 24GB) | |
| | **SFT** | Teaching model with inputโoutput examples | Gives model tool-calling knowledge | |
| | **Loss** | Measure of how wrong predictions are | Lower = better learning | |
| | **Learning Rate** | Size of weight updates | Too high = chaos, too low = slow | |
| | **Batch Size** | Examples per weight update | More = stable gradients, needs more memory | |
| | **Gradient Accumulation** | Fake larger batch sizes | Memory-saving trick | |
| | **Epochs** | Times model sees full dataset | 3 is sweet spot | |
| | **Warmup** | Gradual LR increase at start | Prevents early instability | |
| | **Cosine Schedule** | LR highโlow curve | Aggressive middle, gentle end | |
| | **Gradient Checkpointing** | Recompute activations | Saves ~40% memory | |
|
|
| --- |
|
|
| ## ๐ Next Step |
|
|
| Read `05-dataset.md` to understand our training data โ what we have, what's missing, and how to make it better. |
|
|