muhammadtlha944 commited on
Commit
40d8168
ยท
verified ยท
1 Parent(s): 678cba6

Upload docs/04-training.md

Browse files
Files changed (1) hide show
  1. docs/04-training.md +435 -0
docs/04-training.md ADDED
@@ -0,0 +1,435 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 04 โ€” Training Explained: LoRA, SFT & Hyperparameters
2
+
3
+ ## ๐ŸŽ“ Why This Chapter Matters
4
+
5
+ This is where we answer: *"How do we actually teach the model to use tools?"*
6
+
7
+ By the end of this chapter, you'll understand:
8
+ - What LoRA is and why it's magical for budget training
9
+ - What SFT does step-by-step
10
+ - What each hyperparameter controls
11
+ - How to read training logs and know if it's working
12
+
13
+ ---
14
+
15
+ ## ๐Ÿง  Concept 1: Why Can't We Just Use the Base Model?
16
+
17
+ **Qwen3-1.7B** is already a great model. It can chat, answer questions, write code.
18
+ But it doesn't know how to use **tools** in a structured way.
19
+
20
+ ### What Base Models Know
21
+
22
+ Base model Qwen3-1.7B:
23
+ - โœ… Understands English, can chat
24
+ - โœ… Can write Python code
25
+ - โœ… Can answer questions about the world
26
+ - โŒ Doesn't know about your specific tool schemas
27
+ - โŒ Doesn't output tool calls in correct JSON-RPC format
28
+ - โŒ Doesn't plan multi-step tool chains
29
+ - โŒ Doesn't ask clarifying questions
30
+ - โŒ Doesn't refuse dangerous requests
31
+
32
+ ### What Fine-Tuning Adds
33
+
34
+ After training on 15,694 tool-calling examples:
35
+ - โœ… Understands tool schemas ("Here's what this tool needs")
36
+ - โœ… Generates correct JSON-RPC tool calls
37
+ - โœ… Plans multi-step sequences ("First A, then B using A's result")
38
+ - โœ… Asks when info is missing
39
+ - โœ… Refuses harmful operations
40
+
41
+ **Think of it like this:**
42
+ - Base model = A smart person who knows how to talk but doesn't know your tools
43
+ - Fine-tuned model = The same person after reading 15,000 instruction manuals
44
+
45
+ ---
46
+
47
+ ## ๐Ÿง  Concept 2: LoRA โ€” The Magic of Cheap Fine-Tuning
48
+
49
+ ### The Problem: Full Fine-Tuning Is Expensive
50
+
51
+ To fine-tune all 2 billion parameters of Qwen3-1.7B:
52
+
53
+ | Component | Size | Why |
54
+ |-----------|------|-----|
55
+ | Model weights | 4 GB | 2B params ร— 2 bytes (fp16) |
56
+ | Gradients | 4 GB | Need gradients for every parameter |
57
+ | Optimizer states | 16 GB | Adam optimizer keeps 2 copies per param |
58
+ | **Total** | **24 GB** | **Doesn't fit on T4 (16GB)!** |
59
+
60
+ You'd need an **A100 GPU** (80GB) which costs **$3-4/hour**.
61
+
62
+ ### The Solution: LoRA (Low-Rank Adaptation)
63
+
64
+ Instead of updating ALL parameters, we add tiny matrices to each layer:
65
+
66
+ ```
67
+ Original Layer (Frozen โ€” Never Changes)
68
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
69
+ โ”‚ W (2048 ร— 2048) = 4.2M โ”‚ โ† 4 MILLION parameters
70
+ โ”‚ parameters โ”‚ These stay FROZEN
71
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
72
+ โ”‚
73
+ โ”‚ input x
74
+ โ–ผ
75
+ y = W ร— x
76
+ โ”‚
77
+ โ–ผ
78
+ output
79
+
80
+ LoRA Adapters (Trainable โ€” These Learn)
81
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
82
+ โ”‚ A (2048 ร— 16) โ”‚โ”€โ”€โ”€โ–ถโ”‚ B (16 ร— 2048) โ”‚
83
+ โ”‚ = 32K params โ”‚ โ”‚ = 32K params โ”‚
84
+ โ”‚ (initialized โ”‚ โ”‚ (initialized to 0) โ”‚
85
+ โ”‚ randomly) โ”‚ โ”‚ โ”‚
86
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
87
+ โ”‚ โ”‚
88
+ โ–ผ โ–ผ
89
+ h = A ร— x y' = B ร— h
90
+ = B ร— (A ร— x)
91
+
92
+ Final Output:
93
+ y = W ร— x + B ร— A ร— x
94
+ โ†‘ โ†‘
95
+ frozen trained
96
+ ```
97
+
98
+ **Math:**
99
+ - Original: W is 2048ร—2048 = 4,194,304 parameters
100
+ - LoRA: A is 2048ร—16 = 32,768, B is 16ร—2048 = 32,768
101
+ - Total LoRA: 65,536 parameters (1.6% of original!)
102
+ - Memory for training: ~5GB total (fits on T4!)
103
+
104
+ ### Why This Works
105
+
106
+ The idea: neural network weights often have **low-rank structure**.
107
+ Even though W is 2048ร—2048, the "important directions" of change can be
108
+ captured by much smaller matrices.
109
+
110
+ Think of it like adjusting a steering wheel:
111
+ - Full fine-tuning = Rebuilding the entire car to turn better
112
+ - LoRA = Adding a small steering adjustment module (tiny, cheap, effective)
113
+
114
+ ### Our LoRA Configuration
115
+
116
+ ```python
117
+ from peft import LoraConfig
118
+
119
+ peft_config = LoraConfig(
120
+ r=16, # Rank: "resolution" of the adapter
121
+ lora_alpha=32, # Scaling: how strongly LoRA affects output
122
+ target_modules="all-linear", # Apply to ALL linear layers
123
+ lora_dropout=0.05, # Dropout: 5% random zeroing (prevents overfitting)
124
+ bias="none", # Don't train bias terms (saves memory)
125
+ task_type="CAUSAL_LM", # This is a language model
126
+ )
127
+ ```
128
+
129
+ **r=16:** Think of this as the "resolution." Higher = more detail but more memory.
130
+ For ~16K training examples, r=16 is the sweet spot. (TinyAgent used r=64 for 80K examples)
131
+
132
+ **lora_alpha=32:** Scaling factor. Rule of thumb: 2ร— rank. Controls how much
133
+ the LoRA output contributes to the final result.
134
+
135
+ **target_modules="all-linear":** The "LoRA Without Regret" paper proved that
136
+ applying LoRA to ALL linear layers (not just attention projections) matches
137
+ full fine-tuning quality. This is our secret sauce.
138
+
139
+ ---
140
+
141
+ ## ๐Ÿง  Concept 3: SFT โ€” Supervised Fine-Tuning
142
+
143
+ ### What Is SFT?
144
+
145
+ SFT = **teaching by example.** We show the model:
146
+
147
+ ```
148
+ Input: "Find all Python files"
149
+ Output: {"tool": "shell_exec", "arguments": {"command": "find . -name '*.py'"}}
150
+
151
+ Input: "Delete all files"
152
+ Output: "I cannot help with that. Deleting all files is dangerous..."
153
+
154
+ Input: "Clone the repo and find TODOs"
155
+ Output: {"tool": "shell_exec", "arguments": {"command": "git clone https://... && grep -r 'TODO' ."}}
156
+ ```
157
+
158
+ The model learns to predict the output given the input.
159
+
160
+ ### How SFT Works Step by Step
161
+
162
+ #### Step 1: Tokenize
163
+
164
+ Convert text โ†’ numbers:
165
+
166
+ ```
167
+ "Find Python files"
168
+ โ†“ Tokenizer
169
+ [4921, 12729, 4367, 8921, 1023]
170
+ ```
171
+
172
+ Each number is an index in a vocabulary of ~100,000 tokens.
173
+
174
+ #### Step 2: Forward Pass
175
+
176
+ The model processes the tokenized input and predicts the next token at EACH position:
177
+
178
+ ```
179
+ Input tokens: [4921, 12729, 4367, 8921]
180
+ โ”‚
181
+ Predictions: [?, ?, ?, ? ] โ”€โ”€โ–ถ next token should be 1023
182
+ ```
183
+
184
+ The model outputs a probability distribution over all ~100,000 possible tokens.
185
+
186
+ #### Step 3: Compute Loss (Cross-Entropy)
187
+
188
+ ```
189
+ Predicted probabilities: [0.01, 0.03, 0.001, ..., 0.45, ..., 0.002]
190
+ โ†‘ โ†‘
191
+ wrong correct (1023)
192
+
193
+ Loss = -log(probability_of_correct_token)
194
+ = -log(0.45)
195
+ = 0.80
196
+ ```
197
+
198
+ **Lower loss = better prediction.**
199
+
200
+ If the model predicted token 1023 with probability 0.45, loss is 0.80.
201
+ If it predicted with probability 0.99, loss is 0.01 (much better!).
202
+
203
+ #### Step 4: Backward Pass (Backpropagation)
204
+
205
+ Compute gradients: which direction to adjust weights to reduce loss.
206
+
207
+ ```
208
+ For each LoRA parameter:
209
+ gradient = how much changing this parameter would change the loss
210
+ ```
211
+
212
+ This is done automatically by PyTorch's autograd.
213
+
214
+ #### Step 5: Update Weights (Adam Optimizer)
215
+
216
+ ```
217
+ new_weight = old_weight - learning_rate ร— gradient
218
+ ```
219
+
220
+ Adam is smarter โ€” it uses momentum and adaptive learning rates per parameter.
221
+
222
+ #### Step 6: Repeat
223
+
224
+ Do this for ALL examples in the dataset, then repeat for 3 epochs.
225
+
226
+ ---
227
+
228
+ ## ๐Ÿง  Concept 4: Hyperparameters โ€” The Recipe
229
+
230
+ Think of training as cooking. Hyperparameters are your recipe.
231
+
232
+ ### Learning Rate: 2e-4
233
+
234
+ **What it controls:** How big each weight update step is.
235
+
236
+ ```
237
+ Learning Rate
238
+ โ”‚
239
+ 1e-2โ”‚ โ•ณโ”€โ”€โ”€ Too high: loss oscillates, model never settles
240
+ โ”‚ โ”‚
241
+ 2e-4โ”‚ โ—โ”€โ”€ Sweet spot for LoRA (10ร— higher than full fine-tuning)
242
+ โ”‚ โ•ฒ
243
+ 1e-5โ”‚ โ•ฒโ”€โ”€ Too low: barely moves, takes forever
244
+ โ”‚ โ•ฒ
245
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
246
+ Steps
247
+ ```
248
+
249
+ **Why 2e-4 for LoRA?**
250
+ - Full fine-tuning typically uses 2e-5
251
+ - LoRA has 100ร— fewer parameters
252
+ - Each parameter update needs 10ร— more impact
253
+ - So: 2e-5 ร— 10 = **2e-4**
254
+
255
+ ### Batch Size: 4 ร— 4 = 16 Effective
256
+
257
+ **What it controls:** How many examples the model sees before updating weights.
258
+
259
+ **Without Gradient Accumulation:**
260
+ Process 4 examples โ†’ Compute gradients โ†’ Update weights โ†’ Next 4
261
+
262
+ **With Gradient Accumulation (what we do):**
263
+ Process 4 examples โ†’ Compute gradients โ†’ SAVE gradients (don't update)
264
+ Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients
265
+ Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients
266
+ Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients
267
+ Now update weights (accumulated from 4 ร— 4 = 16 examples)
268
+
269
+ **Why gradient accumulation?**
270
+ - GPU can only fit 4 examples at once (memory limit)
271
+ - But effective batch of 16 gives more stable gradients
272
+ - It's a memory-saving trick
273
+
274
+ **Trade-off:** Slower (4ร— more forward passes per update) but better quality.
275
+
276
+ ### Epochs: 3
277
+
278
+ **What it controls:** How many times the model sees the entire dataset.
279
+
280
+ **Epoch 1:** Sees all 15,694 examples โ†’ learns basic patterns
281
+ **Epoch 2:** Sees all again โ†’ refines understanding
282
+ **Epoch 3:** Sees all again โ†’ final tuning
283
+
284
+ **Why 3?**
285
+ - 1 epoch: Underfitting (hasn't seen enough)
286
+ - 3 epochs: Sweet spot (learns patterns without memorizing)
287
+ - 10 epochs: Overfitting (memorizes training data, fails on new data)
288
+
289
+ ### Warmup Ratio: 0.1 (10%)
290
+
291
+ **What it controls:** For the first 10% of training, learning rate starts at 0
292
+ and gradually ramps up to the full rate.
293
+
294
+ **Why warmup?**
295
+ - At the start, model knows NOTHING about tool-calling
296
+ - Large updates could push weights in random bad directions
297
+ - Warmup lets model "get its bearings" first
298
+
299
+ ### Cosine LR Schedule
300
+
301
+ After warmup, learning rate follows a cosine curve:
302
+ ```
303
+ Learning Rate
304
+ โ”‚
305
+ 2e-4โ”‚ โ•ฑโ”€โ”€โ•ฒ
306
+ โ”‚ โ•ฑ โ•ฒ
307
+ โ”‚ โ•ฑ โ•ฒ
308
+ โ”‚ โ•ฑ โ•ฒ
309
+ 0 โ”‚โ•ฑ โ•ฒโ”€โ”€โ”€โ”€
310
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
311
+ warmup end
312
+ ```
313
+
314
+ **Why cosine?**
315
+ - High in the middle: aggressive learning when model has basic understanding
316
+ - Low at the end: fine-tuning details, settling into optimal weights
317
+ - Prevents overshooting at the end of training
318
+
319
+ ### Max Sequence Length: 2048 tokens
320
+
321
+ **What it controls:** Maximum number of tokens per training example.
322
+
323
+ ```
324
+ Example conversation:
325
+ System prompt: ~500 tokens
326
+ User message: ~100 tokens
327
+ Assistant reply: ~300 tokens
328
+ Total: ~900 tokens โ† Fits in 2048 โœ“
329
+ ```
330
+
331
+ **Why 2048?**
332
+ - Covers all our examples (most are under 1000 tokens)
333
+ - Fits in T4 memory (longer sequences = more memory)
334
+ - Standard for instruction-tuned models
335
+
336
+ ### Gradient Checkpointing: ON
337
+
338
+ **What it does:** Saves memory by recomputing some values during backward pass.
339
+
340
+ ```
341
+ Without checkpointing:
342
+ Forward pass: Store all intermediate activations โ†’ Backward pass uses them
343
+ Memory: 8 GB
344
+
345
+ With checkpointing:
346
+ Forward pass: Store only SOME activations
347
+ Backward pass: Recompute missing ones on-the-fly
348
+ Memory: 5 GB (saves ~40%)
349
+ ```
350
+
351
+ **Trade-off:** Slower (needs extra computation) but fits on T4.
352
+
353
+ ---
354
+
355
+ ## ๐Ÿ“Š Reading Training Logs
356
+
357
+ ### What You'll See
358
+
359
+ ```
360
+ Step 10/245: loss=2.847, learning_rate=2.0e-05
361
+ Step 20/245: loss=2.654, learning_rate=4.0e-05
362
+ ...
363
+ Step 100/245: loss=1.234, learning_rate=1.8e-04
364
+ ...
365
+ Step 245/245: loss=0.876, learning_rate=1.2e-05
366
+ ```
367
+
368
+ ### How to Interpret
369
+
370
+ | Observation | Meaning |
371
+ |-------------|---------|
372
+ | Loss going DOWN | โœ… Model is learning |
373
+ | Loss going UP after going down | โš ๏ธ Overfitting โ€” stop early |
374
+ | Loss stuck at ~3.0 | โŒ Not learning โ€” check data/format |
375
+ | Loss drops fast then plateaus | โœ… Normal โ€” model learned basics |
376
+ | Eval loss โ‰ˆ Train loss | โœ… Good generalization |
377
+ | Eval loss >> Train loss | โŒ Overfitting โ€” model memorized training data |
378
+
379
+ ### Target Numbers (for reference)
380
+
381
+ - **Initial loss:** ~2.5-3.5 (random guessing among many tokens)
382
+ - **Final loss:** ~0.8-1.2 (decent learning on 16K examples)
383
+ - **Eval loss:** Should be within 0.1-0.3 of train loss
384
+
385
+ ---
386
+
387
+ ## ๐Ÿงฎ Training Math
388
+
389
+ ### How Long Does It Take?
390
+
391
+ ```
392
+ Dataset: 15,694 examples
393
+ Batch size: 4 (per device)
394
+ Gradient accumulation: 4 steps
395
+ Effective batch: 4 ร— 4 = 16
396
+
397
+ Steps per epoch: 15,694 รท 16 = ~980 steps
398
+ Total steps (3 epochs): 980 ร— 3 = ~2,940 steps
399
+
400
+ Time per step (T4): ~2-3 seconds
401
+ Total time: 2,940 ร— 2.5s = ~7,350s = ~2 hours
402
+ ```
403
+
404
+ ### Cost Calculation
405
+
406
+ ```
407
+ T4 GPU on HF Jobs: ~$0.60/hour
408
+ Training time: ~2 hours
409
+ Total cost: $0.60 ร— 2 = $1.20
410
+ ```
411
+
412
+ Well under $10! โœ…
413
+
414
+ ---
415
+
416
+ ## ๐ŸŽ“ Summary: Key Training Concepts
417
+
418
+ | Concept | What It Is | Why It Matters |
419
+ |---------|-----------|----------------|
420
+ | **LoRA** | Tiny trainable matrices added to frozen layers | Makes training affordable (5GB vs 24GB) |
421
+ | **SFT** | Teaching model with inputโ†’output examples | Gives model tool-calling knowledge |
422
+ | **Loss** | Measure of how wrong predictions are | Lower = better learning |
423
+ | **Learning Rate** | Size of weight updates | Too high = chaos, too low = slow |
424
+ | **Batch Size** | Examples per weight update | More = stable gradients, needs more memory |
425
+ | **Gradient Accumulation** | Fake larger batch sizes | Memory-saving trick |
426
+ | **Epochs** | Times model sees full dataset | 3 is sweet spot |
427
+ | **Warmup** | Gradual LR increase at start | Prevents early instability |
428
+ | **Cosine Schedule** | LR highโ†’low curve | Aggressive middle, gentle end |
429
+ | **Gradient Checkpointing** | Recompute activations | Saves ~40% memory |
430
+
431
+ ---
432
+
433
+ ## ๐Ÿ”œ Next Step
434
+
435
+ Read `05-dataset.md` to understand our training data โ€” what we have, what's missing, and how to make it better.