File size: 13,517 Bytes
40d8168
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
# 04 โ€” Training Explained: LoRA, SFT & Hyperparameters

## ๐ŸŽ“ Why This Chapter Matters

This is where we answer: *"How do we actually teach the model to use tools?"*

By the end of this chapter, you'll understand:
- What LoRA is and why it's magical for budget training
- What SFT does step-by-step
- What each hyperparameter controls
- How to read training logs and know if it's working

---

## ๐Ÿง  Concept 1: Why Can't We Just Use the Base Model?

**Qwen3-1.7B** is already a great model. It can chat, answer questions, write code. 
But it doesn't know how to use **tools** in a structured way.

### What Base Models Know

Base model Qwen3-1.7B:
- โœ… Understands English, can chat
- โœ… Can write Python code
- โœ… Can answer questions about the world
- โŒ Doesn't know about your specific tool schemas
- โŒ Doesn't output tool calls in correct JSON-RPC format
- โŒ Doesn't plan multi-step tool chains
- โŒ Doesn't ask clarifying questions
- โŒ Doesn't refuse dangerous requests

### What Fine-Tuning Adds

After training on 15,694 tool-calling examples:
- โœ… Understands tool schemas ("Here's what this tool needs")
- โœ… Generates correct JSON-RPC tool calls
- โœ… Plans multi-step sequences ("First A, then B using A's result")
- โœ… Asks when info is missing
- โœ… Refuses harmful operations

**Think of it like this:**
- Base model = A smart person who knows how to talk but doesn't know your tools
- Fine-tuned model = The same person after reading 15,000 instruction manuals

---

## ๐Ÿง  Concept 2: LoRA โ€” The Magic of Cheap Fine-Tuning

### The Problem: Full Fine-Tuning Is Expensive

To fine-tune all 2 billion parameters of Qwen3-1.7B:

| Component | Size | Why |
|-----------|------|-----|
| Model weights | 4 GB | 2B params ร— 2 bytes (fp16) |
| Gradients | 4 GB | Need gradients for every parameter |
| Optimizer states | 16 GB | Adam optimizer keeps 2 copies per param |
| **Total** | **24 GB** | **Doesn't fit on T4 (16GB)!** |

You'd need an **A100 GPU** (80GB) which costs **$3-4/hour**.

### The Solution: LoRA (Low-Rank Adaptation)

Instead of updating ALL parameters, we add tiny matrices to each layer:

```
Original Layer (Frozen โ€” Never Changes)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  W (2048 ร— 2048) = 4.2M    โ”‚  โ† 4 MILLION parameters
โ”‚  parameters                 โ”‚     These stay FROZEN
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ”‚ input x
         โ–ผ
    y = W ร— x
         โ”‚
         โ–ผ
    output

LoRA Adapters (Trainable โ€” These Learn)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  A (2048 ร— 16)      โ”‚โ”€โ”€โ”€โ–ถโ”‚  B (16 ร— 2048)      โ”‚
โ”‚  = 32K params       โ”‚    โ”‚  = 32K params       โ”‚
โ”‚  (initialized       โ”‚    โ”‚  (initialized to 0) โ”‚
โ”‚   randomly)         โ”‚    โ”‚                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                         โ”‚
         โ–ผ                         โ–ผ
    h = A ร— x                  y' = B ร— h
                                   = B ร— (A ร— x)

Final Output:
y = W ร— x + B ร— A ร— x
    โ†‘           โ†‘
  frozen     trained
```

**Math:**
- Original: W is 2048ร—2048 = 4,194,304 parameters
- LoRA: A is 2048ร—16 = 32,768, B is 16ร—2048 = 32,768
- Total LoRA: 65,536 parameters (1.6% of original!)
- Memory for training: ~5GB total (fits on T4!)

### Why This Works

The idea: neural network weights often have **low-rank structure**.
Even though W is 2048ร—2048, the "important directions" of change can be 
captured by much smaller matrices.

Think of it like adjusting a steering wheel:
- Full fine-tuning = Rebuilding the entire car to turn better
- LoRA = Adding a small steering adjustment module (tiny, cheap, effective)

### Our LoRA Configuration

```python
from peft import LoraConfig

peft_config = LoraConfig(
    r=16,                    # Rank: "resolution" of the adapter
    lora_alpha=32,          # Scaling: how strongly LoRA affects output
    target_modules="all-linear",  # Apply to ALL linear layers
    lora_dropout=0.05,      # Dropout: 5% random zeroing (prevents overfitting)
    bias="none",             # Don't train bias terms (saves memory)
    task_type="CAUSAL_LM",   # This is a language model
)
```

**r=16:** Think of this as the "resolution." Higher = more detail but more memory. 
For ~16K training examples, r=16 is the sweet spot. (TinyAgent used r=64 for 80K examples)

**lora_alpha=32:** Scaling factor. Rule of thumb: 2ร— rank. Controls how much 
the LoRA output contributes to the final result.

**target_modules="all-linear":** The "LoRA Without Regret" paper proved that 
applying LoRA to ALL linear layers (not just attention projections) matches 
full fine-tuning quality. This is our secret sauce.

---

## ๐Ÿง  Concept 3: SFT โ€” Supervised Fine-Tuning

### What Is SFT?

SFT = **teaching by example.** We show the model:

```
Input:  "Find all Python files"
Output: {"tool": "shell_exec", "arguments": {"command": "find . -name '*.py'"}}

Input:  "Delete all files"
Output: "I cannot help with that. Deleting all files is dangerous..."

Input:  "Clone the repo and find TODOs"
Output: {"tool": "shell_exec", "arguments": {"command": "git clone https://... && grep -r 'TODO' ."}}
```

The model learns to predict the output given the input.

### How SFT Works Step by Step

#### Step 1: Tokenize

Convert text โ†’ numbers:

```
"Find Python files"
โ†“ Tokenizer
[4921, 12729, 4367, 8921, 1023]
```

Each number is an index in a vocabulary of ~100,000 tokens.

#### Step 2: Forward Pass

The model processes the tokenized input and predicts the next token at EACH position:

```
Input tokens:  [4921, 12729, 4367, 8921]
                                    โ”‚
Predictions:   [?,    ?,    ?,    ?  ] โ”€โ”€โ–ถ next token should be 1023
```

The model outputs a probability distribution over all ~100,000 possible tokens.

#### Step 3: Compute Loss (Cross-Entropy)

```
Predicted probabilities:  [0.01, 0.03, 0.001, ..., 0.45, ..., 0.002]
                              โ†‘                    โ†‘
                           wrong                 correct (1023)

Loss = -log(probability_of_correct_token)
     = -log(0.45)
     = 0.80
```

**Lower loss = better prediction.**

If the model predicted token 1023 with probability 0.45, loss is 0.80.
If it predicted with probability 0.99, loss is 0.01 (much better!).

#### Step 4: Backward Pass (Backpropagation)

Compute gradients: which direction to adjust weights to reduce loss.

```
For each LoRA parameter:
  gradient = how much changing this parameter would change the loss
```

This is done automatically by PyTorch's autograd.

#### Step 5: Update Weights (Adam Optimizer)

```
new_weight = old_weight - learning_rate ร— gradient
```

Adam is smarter โ€” it uses momentum and adaptive learning rates per parameter.

#### Step 6: Repeat

Do this for ALL examples in the dataset, then repeat for 3 epochs.

---

## ๐Ÿง  Concept 4: Hyperparameters โ€” The Recipe

Think of training as cooking. Hyperparameters are your recipe.

### Learning Rate: 2e-4

**What it controls:** How big each weight update step is.

```
Learning Rate
   โ”‚
1e-2โ”‚  โ•ณโ”€โ”€โ”€ Too high: loss oscillates, model never settles
   โ”‚   โ”‚
2e-4โ”‚    โ—โ”€โ”€ Sweet spot for LoRA (10ร— higher than full fine-tuning)
   โ”‚      โ•ฒ
1e-5โ”‚       โ•ฒโ”€โ”€ Too low: barely moves, takes forever
   โ”‚         โ•ฒ
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        Steps
```

**Why 2e-4 for LoRA?**
- Full fine-tuning typically uses 2e-5
- LoRA has 100ร— fewer parameters
- Each parameter update needs 10ร— more impact
- So: 2e-5 ร— 10 = **2e-4**

### Batch Size: 4 ร— 4 = 16 Effective

**What it controls:** How many examples the model sees before updating weights.

**Without Gradient Accumulation:**
  Process 4 examples โ†’ Compute gradients โ†’ Update weights โ†’ Next 4

**With Gradient Accumulation (what we do):**
  Process 4 examples โ†’ Compute gradients โ†’ SAVE gradients (don't update)
  Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients
  Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients
  Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients
  Now update weights (accumulated from 4 ร— 4 = 16 examples)

**Why gradient accumulation?**
- GPU can only fit 4 examples at once (memory limit)
- But effective batch of 16 gives more stable gradients
- It's a memory-saving trick

**Trade-off:** Slower (4ร— more forward passes per update) but better quality.

### Epochs: 3

**What it controls:** How many times the model sees the entire dataset.

**Epoch 1:** Sees all 15,694 examples โ†’ learns basic patterns
**Epoch 2:** Sees all again โ†’ refines understanding
**Epoch 3:** Sees all again โ†’ final tuning

**Why 3?**
- 1 epoch: Underfitting (hasn't seen enough)
- 3 epochs: Sweet spot (learns patterns without memorizing)
- 10 epochs: Overfitting (memorizes training data, fails on new data)

### Warmup Ratio: 0.1 (10%)

**What it controls:** For the first 10% of training, learning rate starts at 0 
and gradually ramps up to the full rate.

**Why warmup?**
- At the start, model knows NOTHING about tool-calling
- Large updates could push weights in random bad directions
- Warmup lets model "get its bearings" first

### Cosine LR Schedule

After warmup, learning rate follows a cosine curve:
```
Learning Rate
   โ”‚
2e-4โ”‚    โ•ฑโ”€โ”€โ•ฒ
   โ”‚   โ•ฑ    โ•ฒ
   โ”‚  โ•ฑ      โ•ฒ
   โ”‚ โ•ฑ        โ•ฒ
  0 โ”‚โ•ฑ          โ•ฒโ”€โ”€โ”€โ”€
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
     warmup    end
```

**Why cosine?**
- High in the middle: aggressive learning when model has basic understanding
- Low at the end: fine-tuning details, settling into optimal weights
- Prevents overshooting at the end of training

### Max Sequence Length: 2048 tokens

**What it controls:** Maximum number of tokens per training example.

```
Example conversation:
  System prompt:  ~500 tokens
  User message:   ~100 tokens
  Assistant reply: ~300 tokens
  Total:          ~900 tokens  โ† Fits in 2048 โœ“
```

**Why 2048?**
- Covers all our examples (most are under 1000 tokens)
- Fits in T4 memory (longer sequences = more memory)
- Standard for instruction-tuned models

### Gradient Checkpointing: ON

**What it does:** Saves memory by recomputing some values during backward pass.

```
Without checkpointing:
  Forward pass: Store all intermediate activations โ†’ Backward pass uses them
  Memory: 8 GB

With checkpointing:
  Forward pass: Store only SOME activations
  Backward pass: Recompute missing ones on-the-fly
  Memory: 5 GB (saves ~40%)
```

**Trade-off:** Slower (needs extra computation) but fits on T4.

---

## ๐Ÿ“Š Reading Training Logs

### What You'll See

```
Step 10/245: loss=2.847, learning_rate=2.0e-05
Step 20/245: loss=2.654, learning_rate=4.0e-05
...
Step 100/245: loss=1.234, learning_rate=1.8e-04
...
Step 245/245: loss=0.876, learning_rate=1.2e-05
```

### How to Interpret

| Observation | Meaning |
|-------------|---------|
| Loss going DOWN | โœ… Model is learning |
| Loss going UP after going down | โš ๏ธ Overfitting โ€” stop early |
| Loss stuck at ~3.0 | โŒ Not learning โ€” check data/format |
| Loss drops fast then plateaus | โœ… Normal โ€” model learned basics |
| Eval loss โ‰ˆ Train loss | โœ… Good generalization |
| Eval loss >> Train loss | โŒ Overfitting โ€” model memorized training data |

### Target Numbers (for reference)

- **Initial loss:** ~2.5-3.5 (random guessing among many tokens)
- **Final loss:** ~0.8-1.2 (decent learning on 16K examples)
- **Eval loss:** Should be within 0.1-0.3 of train loss

---

## ๐Ÿงฎ Training Math

### How Long Does It Take?

```
Dataset: 15,694 examples
Batch size: 4 (per device)
Gradient accumulation: 4 steps
Effective batch: 4 ร— 4 = 16

Steps per epoch: 15,694 รท 16 = ~980 steps
Total steps (3 epochs): 980 ร— 3 = ~2,940 steps

Time per step (T4): ~2-3 seconds
Total time: 2,940 ร— 2.5s = ~7,350s = ~2 hours
```

### Cost Calculation

```
T4 GPU on HF Jobs: ~$0.60/hour
Training time: ~2 hours
Total cost: $0.60 ร— 2 = $1.20
```

Well under $10! โœ…

---

## ๐ŸŽ“ Summary: Key Training Concepts

| Concept | What It Is | Why It Matters |
|---------|-----------|----------------|
| **LoRA** | Tiny trainable matrices added to frozen layers | Makes training affordable (5GB vs 24GB) |
| **SFT** | Teaching model with inputโ†’output examples | Gives model tool-calling knowledge |
| **Loss** | Measure of how wrong predictions are | Lower = better learning |
| **Learning Rate** | Size of weight updates | Too high = chaos, too low = slow |
| **Batch Size** | Examples per weight update | More = stable gradients, needs more memory |
| **Gradient Accumulation** | Fake larger batch sizes | Memory-saving trick |
| **Epochs** | Times model sees full dataset | 3 is sweet spot |
| **Warmup** | Gradual LR increase at start | Prevents early instability |
| **Cosine Schedule** | LR highโ†’low curve | Aggressive middle, gentle end |
| **Gradient Checkpointing** | Recompute activations | Saves ~40% memory |

---

## ๐Ÿ”œ Next Step

Read `05-dataset.md` to understand our training data โ€” what we have, what's missing, and how to make it better.