04 โ Training Explained: LoRA, SFT & Hyperparameters
๐ Why This Chapter Matters
This is where we answer: "How do we actually teach the model to use tools?"
By the end of this chapter, you'll understand:
- What LoRA is and why it's magical for budget training
- What SFT does step-by-step
- What each hyperparameter controls
- How to read training logs and know if it's working
๐ง Concept 1: Why Can't We Just Use the Base Model?
Qwen3-1.7B is already a great model. It can chat, answer questions, write code. But it doesn't know how to use tools in a structured way.
What Base Models Know
Base model Qwen3-1.7B:
- โ Understands English, can chat
- โ Can write Python code
- โ Can answer questions about the world
- โ Doesn't know about your specific tool schemas
- โ Doesn't output tool calls in correct JSON-RPC format
- โ Doesn't plan multi-step tool chains
- โ Doesn't ask clarifying questions
- โ Doesn't refuse dangerous requests
What Fine-Tuning Adds
After training on 15,694 tool-calling examples:
- โ Understands tool schemas ("Here's what this tool needs")
- โ Generates correct JSON-RPC tool calls
- โ Plans multi-step sequences ("First A, then B using A's result")
- โ Asks when info is missing
- โ Refuses harmful operations
Think of it like this:
- Base model = A smart person who knows how to talk but doesn't know your tools
- Fine-tuned model = The same person after reading 15,000 instruction manuals
๐ง Concept 2: LoRA โ The Magic of Cheap Fine-Tuning
The Problem: Full Fine-Tuning Is Expensive
To fine-tune all 2 billion parameters of Qwen3-1.7B:
| Component | Size | Why |
|---|---|---|
| Model weights | 4 GB | 2B params ร 2 bytes (fp16) |
| Gradients | 4 GB | Need gradients for every parameter |
| Optimizer states | 16 GB | Adam optimizer keeps 2 copies per param |
| Total | 24 GB | Doesn't fit on T4 (16GB)! |
You'd need an A100 GPU (80GB) which costs $3-4/hour.
The Solution: LoRA (Low-Rank Adaptation)
Instead of updating ALL parameters, we add tiny matrices to each layer:
Original Layer (Frozen โ Never Changes)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ W (2048 ร 2048) = 4.2M โ โ 4 MILLION parameters
โ parameters โ These stay FROZEN
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ input x
โผ
y = W ร x
โ
โผ
output
LoRA Adapters (Trainable โ These Learn)
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ A (2048 ร 16) โโโโโถโ B (16 ร 2048) โ
โ = 32K params โ โ = 32K params โ
โ (initialized โ โ (initialized to 0) โ
โ randomly) โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
h = A ร x y' = B ร h
= B ร (A ร x)
Final Output:
y = W ร x + B ร A ร x
โ โ
frozen trained
Math:
- Original: W is 2048ร2048 = 4,194,304 parameters
- LoRA: A is 2048ร16 = 32,768, B is 16ร2048 = 32,768
- Total LoRA: 65,536 parameters (1.6% of original!)
- Memory for training: ~5GB total (fits on T4!)
Why This Works
The idea: neural network weights often have low-rank structure. Even though W is 2048ร2048, the "important directions" of change can be captured by much smaller matrices.
Think of it like adjusting a steering wheel:
- Full fine-tuning = Rebuilding the entire car to turn better
- LoRA = Adding a small steering adjustment module (tiny, cheap, effective)
Our LoRA Configuration
from peft import LoraConfig
peft_config = LoraConfig(
r=16, # Rank: "resolution" of the adapter
lora_alpha=32, # Scaling: how strongly LoRA affects output
target_modules="all-linear", # Apply to ALL linear layers
lora_dropout=0.05, # Dropout: 5% random zeroing (prevents overfitting)
bias="none", # Don't train bias terms (saves memory)
task_type="CAUSAL_LM", # This is a language model
)
r=16: Think of this as the "resolution." Higher = more detail but more memory. For ~16K training examples, r=16 is the sweet spot. (TinyAgent used r=64 for 80K examples)
lora_alpha=32: Scaling factor. Rule of thumb: 2ร rank. Controls how much the LoRA output contributes to the final result.
target_modules="all-linear": The "LoRA Without Regret" paper proved that applying LoRA to ALL linear layers (not just attention projections) matches full fine-tuning quality. This is our secret sauce.
๐ง Concept 3: SFT โ Supervised Fine-Tuning
What Is SFT?
SFT = teaching by example. We show the model:
Input: "Find all Python files"
Output: {"tool": "shell_exec", "arguments": {"command": "find . -name '*.py'"}}
Input: "Delete all files"
Output: "I cannot help with that. Deleting all files is dangerous..."
Input: "Clone the repo and find TODOs"
Output: {"tool": "shell_exec", "arguments": {"command": "git clone https://... && grep -r 'TODO' ."}}
The model learns to predict the output given the input.
How SFT Works Step by Step
Step 1: Tokenize
Convert text โ numbers:
"Find Python files"
โ Tokenizer
[4921, 12729, 4367, 8921, 1023]
Each number is an index in a vocabulary of ~100,000 tokens.
Step 2: Forward Pass
The model processes the tokenized input and predicts the next token at EACH position:
Input tokens: [4921, 12729, 4367, 8921]
โ
Predictions: [?, ?, ?, ? ] โโโถ next token should be 1023
The model outputs a probability distribution over all ~100,000 possible tokens.
Step 3: Compute Loss (Cross-Entropy)
Predicted probabilities: [0.01, 0.03, 0.001, ..., 0.45, ..., 0.002]
โ โ
wrong correct (1023)
Loss = -log(probability_of_correct_token)
= -log(0.45)
= 0.80
Lower loss = better prediction.
If the model predicted token 1023 with probability 0.45, loss is 0.80. If it predicted with probability 0.99, loss is 0.01 (much better!).
Step 4: Backward Pass (Backpropagation)
Compute gradients: which direction to adjust weights to reduce loss.
For each LoRA parameter:
gradient = how much changing this parameter would change the loss
This is done automatically by PyTorch's autograd.
Step 5: Update Weights (Adam Optimizer)
new_weight = old_weight - learning_rate ร gradient
Adam is smarter โ it uses momentum and adaptive learning rates per parameter.
Step 6: Repeat
Do this for ALL examples in the dataset, then repeat for 3 epochs.
๐ง Concept 4: Hyperparameters โ The Recipe
Think of training as cooking. Hyperparameters are your recipe.
Learning Rate: 2e-4
What it controls: How big each weight update step is.
Learning Rate
โ
1e-2โ โณโโโ Too high: loss oscillates, model never settles
โ โ
2e-4โ โโโ Sweet spot for LoRA (10ร higher than full fine-tuning)
โ โฒ
1e-5โ โฒโโ Too low: barely moves, takes forever
โ โฒ
โโโโโโโโโโโโโโโโโโ
Steps
Why 2e-4 for LoRA?
- Full fine-tuning typically uses 2e-5
- LoRA has 100ร fewer parameters
- Each parameter update needs 10ร more impact
- So: 2e-5 ร 10 = 2e-4
Batch Size: 4 ร 4 = 16 Effective
What it controls: How many examples the model sees before updating weights.
Without Gradient Accumulation: Process 4 examples โ Compute gradients โ Update weights โ Next 4
With Gradient Accumulation (what we do): Process 4 examples โ Compute gradients โ SAVE gradients (don't update) Process 4 examples โ Compute gradients โ ADD to saved gradients Process 4 examples โ Compute gradients โ ADD to saved gradients Process 4 examples โ Compute gradients โ ADD to saved gradients Now update weights (accumulated from 4 ร 4 = 16 examples)
Why gradient accumulation?
- GPU can only fit 4 examples at once (memory limit)
- But effective batch of 16 gives more stable gradients
- It's a memory-saving trick
Trade-off: Slower (4ร more forward passes per update) but better quality.
Epochs: 3
What it controls: How many times the model sees the entire dataset.
Epoch 1: Sees all 15,694 examples โ learns basic patterns Epoch 2: Sees all again โ refines understanding Epoch 3: Sees all again โ final tuning
Why 3?
- 1 epoch: Underfitting (hasn't seen enough)
- 3 epochs: Sweet spot (learns patterns without memorizing)
- 10 epochs: Overfitting (memorizes training data, fails on new data)
Warmup Ratio: 0.1 (10%)
What it controls: For the first 10% of training, learning rate starts at 0 and gradually ramps up to the full rate.
Why warmup?
- At the start, model knows NOTHING about tool-calling
- Large updates could push weights in random bad directions
- Warmup lets model "get its bearings" first
Cosine LR Schedule
After warmup, learning rate follows a cosine curve:
Learning Rate
โ
2e-4โ โฑโโโฒ
โ โฑ โฒ
โ โฑ โฒ
โ โฑ โฒ
0 โโฑ โฒโโโโ
โโโโโโโโโโโโโโโโโโ
warmup end
Why cosine?
- High in the middle: aggressive learning when model has basic understanding
- Low at the end: fine-tuning details, settling into optimal weights
- Prevents overshooting at the end of training
Max Sequence Length: 2048 tokens
What it controls: Maximum number of tokens per training example.
Example conversation:
System prompt: ~500 tokens
User message: ~100 tokens
Assistant reply: ~300 tokens
Total: ~900 tokens โ Fits in 2048 โ
Why 2048?
- Covers all our examples (most are under 1000 tokens)
- Fits in T4 memory (longer sequences = more memory)
- Standard for instruction-tuned models
Gradient Checkpointing: ON
What it does: Saves memory by recomputing some values during backward pass.
Without checkpointing:
Forward pass: Store all intermediate activations โ Backward pass uses them
Memory: 8 GB
With checkpointing:
Forward pass: Store only SOME activations
Backward pass: Recompute missing ones on-the-fly
Memory: 5 GB (saves ~40%)
Trade-off: Slower (needs extra computation) but fits on T4.
๐ Reading Training Logs
What You'll See
Step 10/245: loss=2.847, learning_rate=2.0e-05
Step 20/245: loss=2.654, learning_rate=4.0e-05
...
Step 100/245: loss=1.234, learning_rate=1.8e-04
...
Step 245/245: loss=0.876, learning_rate=1.2e-05
How to Interpret
| Observation | Meaning |
|---|---|
| Loss going DOWN | โ Model is learning |
| Loss going UP after going down | โ ๏ธ Overfitting โ stop early |
| Loss stuck at ~3.0 | โ Not learning โ check data/format |
| Loss drops fast then plateaus | โ Normal โ model learned basics |
| Eval loss โ Train loss | โ Good generalization |
| Eval loss >> Train loss | โ Overfitting โ model memorized training data |
Target Numbers (for reference)
- Initial loss: ~2.5-3.5 (random guessing among many tokens)
- Final loss: ~0.8-1.2 (decent learning on 16K examples)
- Eval loss: Should be within 0.1-0.3 of train loss
๐งฎ Training Math
How Long Does It Take?
Dataset: 15,694 examples
Batch size: 4 (per device)
Gradient accumulation: 4 steps
Effective batch: 4 ร 4 = 16
Steps per epoch: 15,694 รท 16 = ~980 steps
Total steps (3 epochs): 980 ร 3 = ~2,940 steps
Time per step (T4): ~2-3 seconds
Total time: 2,940 ร 2.5s = ~7,350s = ~2 hours
Cost Calculation
T4 GPU on HF Jobs: ~$0.60/hour
Training time: ~2 hours
Total cost: $0.60 ร 2 = $1.20
Well under $10! โ
๐ Summary: Key Training Concepts
| Concept | What It Is | Why It Matters |
|---|---|---|
| LoRA | Tiny trainable matrices added to frozen layers | Makes training affordable (5GB vs 24GB) |
| SFT | Teaching model with inputโoutput examples | Gives model tool-calling knowledge |
| Loss | Measure of how wrong predictions are | Lower = better learning |
| Learning Rate | Size of weight updates | Too high = chaos, too low = slow |
| Batch Size | Examples per weight update | More = stable gradients, needs more memory |
| Gradient Accumulation | Fake larger batch sizes | Memory-saving trick |
| Epochs | Times model sees full dataset | 3 is sweet spot |
| Warmup | Gradual LR increase at start | Prevents early instability |
| Cosine Schedule | LR highโlow curve | Aggressive middle, gentle end |
| Gradient Checkpointing | Recompute activations | Saves ~40% memory |
๐ Next Step
Read 05-dataset.md to understand our training data โ what we have, what's missing, and how to make it better.