MCP-Agent-1.7B / docs /04-training.md
muhammadtlha944's picture
Upload docs/04-training.md
40d8168 verified

04 โ€” Training Explained: LoRA, SFT & Hyperparameters

๐ŸŽ“ Why This Chapter Matters

This is where we answer: "How do we actually teach the model to use tools?"

By the end of this chapter, you'll understand:

  • What LoRA is and why it's magical for budget training
  • What SFT does step-by-step
  • What each hyperparameter controls
  • How to read training logs and know if it's working

๐Ÿง  Concept 1: Why Can't We Just Use the Base Model?

Qwen3-1.7B is already a great model. It can chat, answer questions, write code. But it doesn't know how to use tools in a structured way.

What Base Models Know

Base model Qwen3-1.7B:

  • โœ… Understands English, can chat
  • โœ… Can write Python code
  • โœ… Can answer questions about the world
  • โŒ Doesn't know about your specific tool schemas
  • โŒ Doesn't output tool calls in correct JSON-RPC format
  • โŒ Doesn't plan multi-step tool chains
  • โŒ Doesn't ask clarifying questions
  • โŒ Doesn't refuse dangerous requests

What Fine-Tuning Adds

After training on 15,694 tool-calling examples:

  • โœ… Understands tool schemas ("Here's what this tool needs")
  • โœ… Generates correct JSON-RPC tool calls
  • โœ… Plans multi-step sequences ("First A, then B using A's result")
  • โœ… Asks when info is missing
  • โœ… Refuses harmful operations

Think of it like this:

  • Base model = A smart person who knows how to talk but doesn't know your tools
  • Fine-tuned model = The same person after reading 15,000 instruction manuals

๐Ÿง  Concept 2: LoRA โ€” The Magic of Cheap Fine-Tuning

The Problem: Full Fine-Tuning Is Expensive

To fine-tune all 2 billion parameters of Qwen3-1.7B:

Component Size Why
Model weights 4 GB 2B params ร— 2 bytes (fp16)
Gradients 4 GB Need gradients for every parameter
Optimizer states 16 GB Adam optimizer keeps 2 copies per param
Total 24 GB Doesn't fit on T4 (16GB)!

You'd need an A100 GPU (80GB) which costs $3-4/hour.

The Solution: LoRA (Low-Rank Adaptation)

Instead of updating ALL parameters, we add tiny matrices to each layer:

Original Layer (Frozen โ€” Never Changes)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  W (2048 ร— 2048) = 4.2M    โ”‚  โ† 4 MILLION parameters
โ”‚  parameters                 โ”‚     These stay FROZEN
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ”‚ input x
         โ–ผ
    y = W ร— x
         โ”‚
         โ–ผ
    output

LoRA Adapters (Trainable โ€” These Learn)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  A (2048 ร— 16)      โ”‚โ”€โ”€โ”€โ–ถโ”‚  B (16 ร— 2048)      โ”‚
โ”‚  = 32K params       โ”‚    โ”‚  = 32K params       โ”‚
โ”‚  (initialized       โ”‚    โ”‚  (initialized to 0) โ”‚
โ”‚   randomly)         โ”‚    โ”‚                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                         โ”‚
         โ–ผ                         โ–ผ
    h = A ร— x                  y' = B ร— h
                                   = B ร— (A ร— x)

Final Output:
y = W ร— x + B ร— A ร— x
    โ†‘           โ†‘
  frozen     trained

Math:

  • Original: W is 2048ร—2048 = 4,194,304 parameters
  • LoRA: A is 2048ร—16 = 32,768, B is 16ร—2048 = 32,768
  • Total LoRA: 65,536 parameters (1.6% of original!)
  • Memory for training: ~5GB total (fits on T4!)

Why This Works

The idea: neural network weights often have low-rank structure. Even though W is 2048ร—2048, the "important directions" of change can be captured by much smaller matrices.

Think of it like adjusting a steering wheel:

  • Full fine-tuning = Rebuilding the entire car to turn better
  • LoRA = Adding a small steering adjustment module (tiny, cheap, effective)

Our LoRA Configuration

from peft import LoraConfig

peft_config = LoraConfig(
    r=16,                    # Rank: "resolution" of the adapter
    lora_alpha=32,          # Scaling: how strongly LoRA affects output
    target_modules="all-linear",  # Apply to ALL linear layers
    lora_dropout=0.05,      # Dropout: 5% random zeroing (prevents overfitting)
    bias="none",             # Don't train bias terms (saves memory)
    task_type="CAUSAL_LM",   # This is a language model
)

r=16: Think of this as the "resolution." Higher = more detail but more memory. For ~16K training examples, r=16 is the sweet spot. (TinyAgent used r=64 for 80K examples)

lora_alpha=32: Scaling factor. Rule of thumb: 2ร— rank. Controls how much the LoRA output contributes to the final result.

target_modules="all-linear": The "LoRA Without Regret" paper proved that applying LoRA to ALL linear layers (not just attention projections) matches full fine-tuning quality. This is our secret sauce.


๐Ÿง  Concept 3: SFT โ€” Supervised Fine-Tuning

What Is SFT?

SFT = teaching by example. We show the model:

Input:  "Find all Python files"
Output: {"tool": "shell_exec", "arguments": {"command": "find . -name '*.py'"}}

Input:  "Delete all files"
Output: "I cannot help with that. Deleting all files is dangerous..."

Input:  "Clone the repo and find TODOs"
Output: {"tool": "shell_exec", "arguments": {"command": "git clone https://... && grep -r 'TODO' ."}}

The model learns to predict the output given the input.

How SFT Works Step by Step

Step 1: Tokenize

Convert text โ†’ numbers:

"Find Python files"
โ†“ Tokenizer
[4921, 12729, 4367, 8921, 1023]

Each number is an index in a vocabulary of ~100,000 tokens.

Step 2: Forward Pass

The model processes the tokenized input and predicts the next token at EACH position:

Input tokens:  [4921, 12729, 4367, 8921]
                                    โ”‚
Predictions:   [?,    ?,    ?,    ?  ] โ”€โ”€โ–ถ next token should be 1023

The model outputs a probability distribution over all ~100,000 possible tokens.

Step 3: Compute Loss (Cross-Entropy)

Predicted probabilities:  [0.01, 0.03, 0.001, ..., 0.45, ..., 0.002]
                              โ†‘                    โ†‘
                           wrong                 correct (1023)

Loss = -log(probability_of_correct_token)
     = -log(0.45)
     = 0.80

Lower loss = better prediction.

If the model predicted token 1023 with probability 0.45, loss is 0.80. If it predicted with probability 0.99, loss is 0.01 (much better!).

Step 4: Backward Pass (Backpropagation)

Compute gradients: which direction to adjust weights to reduce loss.

For each LoRA parameter:
  gradient = how much changing this parameter would change the loss

This is done automatically by PyTorch's autograd.

Step 5: Update Weights (Adam Optimizer)

new_weight = old_weight - learning_rate ร— gradient

Adam is smarter โ€” it uses momentum and adaptive learning rates per parameter.

Step 6: Repeat

Do this for ALL examples in the dataset, then repeat for 3 epochs.


๐Ÿง  Concept 4: Hyperparameters โ€” The Recipe

Think of training as cooking. Hyperparameters are your recipe.

Learning Rate: 2e-4

What it controls: How big each weight update step is.

Learning Rate
   โ”‚
1e-2โ”‚  โ•ณโ”€โ”€โ”€ Too high: loss oscillates, model never settles
   โ”‚   โ”‚
2e-4โ”‚    โ—โ”€โ”€ Sweet spot for LoRA (10ร— higher than full fine-tuning)
   โ”‚      โ•ฒ
1e-5โ”‚       โ•ฒโ”€โ”€ Too low: barely moves, takes forever
   โ”‚         โ•ฒ
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        Steps

Why 2e-4 for LoRA?

  • Full fine-tuning typically uses 2e-5
  • LoRA has 100ร— fewer parameters
  • Each parameter update needs 10ร— more impact
  • So: 2e-5 ร— 10 = 2e-4

Batch Size: 4 ร— 4 = 16 Effective

What it controls: How many examples the model sees before updating weights.

Without Gradient Accumulation: Process 4 examples โ†’ Compute gradients โ†’ Update weights โ†’ Next 4

With Gradient Accumulation (what we do): Process 4 examples โ†’ Compute gradients โ†’ SAVE gradients (don't update) Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients Process 4 examples โ†’ Compute gradients โ†’ ADD to saved gradients Now update weights (accumulated from 4 ร— 4 = 16 examples)

Why gradient accumulation?

  • GPU can only fit 4 examples at once (memory limit)
  • But effective batch of 16 gives more stable gradients
  • It's a memory-saving trick

Trade-off: Slower (4ร— more forward passes per update) but better quality.

Epochs: 3

What it controls: How many times the model sees the entire dataset.

Epoch 1: Sees all 15,694 examples โ†’ learns basic patterns Epoch 2: Sees all again โ†’ refines understanding Epoch 3: Sees all again โ†’ final tuning

Why 3?

  • 1 epoch: Underfitting (hasn't seen enough)
  • 3 epochs: Sweet spot (learns patterns without memorizing)
  • 10 epochs: Overfitting (memorizes training data, fails on new data)

Warmup Ratio: 0.1 (10%)

What it controls: For the first 10% of training, learning rate starts at 0 and gradually ramps up to the full rate.

Why warmup?

  • At the start, model knows NOTHING about tool-calling
  • Large updates could push weights in random bad directions
  • Warmup lets model "get its bearings" first

Cosine LR Schedule

After warmup, learning rate follows a cosine curve:

Learning Rate
   โ”‚
2e-4โ”‚    โ•ฑโ”€โ”€โ•ฒ
   โ”‚   โ•ฑ    โ•ฒ
   โ”‚  โ•ฑ      โ•ฒ
   โ”‚ โ•ฑ        โ•ฒ
  0 โ”‚โ•ฑ          โ•ฒโ”€โ”€โ”€โ”€
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
     warmup    end

Why cosine?

  • High in the middle: aggressive learning when model has basic understanding
  • Low at the end: fine-tuning details, settling into optimal weights
  • Prevents overshooting at the end of training

Max Sequence Length: 2048 tokens

What it controls: Maximum number of tokens per training example.

Example conversation:
  System prompt:  ~500 tokens
  User message:   ~100 tokens
  Assistant reply: ~300 tokens
  Total:          ~900 tokens  โ† Fits in 2048 โœ“

Why 2048?

  • Covers all our examples (most are under 1000 tokens)
  • Fits in T4 memory (longer sequences = more memory)
  • Standard for instruction-tuned models

Gradient Checkpointing: ON

What it does: Saves memory by recomputing some values during backward pass.

Without checkpointing:
  Forward pass: Store all intermediate activations โ†’ Backward pass uses them
  Memory: 8 GB

With checkpointing:
  Forward pass: Store only SOME activations
  Backward pass: Recompute missing ones on-the-fly
  Memory: 5 GB (saves ~40%)

Trade-off: Slower (needs extra computation) but fits on T4.


๐Ÿ“Š Reading Training Logs

What You'll See

Step 10/245: loss=2.847, learning_rate=2.0e-05
Step 20/245: loss=2.654, learning_rate=4.0e-05
...
Step 100/245: loss=1.234, learning_rate=1.8e-04
...
Step 245/245: loss=0.876, learning_rate=1.2e-05

How to Interpret

Observation Meaning
Loss going DOWN โœ… Model is learning
Loss going UP after going down โš ๏ธ Overfitting โ€” stop early
Loss stuck at ~3.0 โŒ Not learning โ€” check data/format
Loss drops fast then plateaus โœ… Normal โ€” model learned basics
Eval loss โ‰ˆ Train loss โœ… Good generalization
Eval loss >> Train loss โŒ Overfitting โ€” model memorized training data

Target Numbers (for reference)

  • Initial loss: ~2.5-3.5 (random guessing among many tokens)
  • Final loss: ~0.8-1.2 (decent learning on 16K examples)
  • Eval loss: Should be within 0.1-0.3 of train loss

๐Ÿงฎ Training Math

How Long Does It Take?

Dataset: 15,694 examples
Batch size: 4 (per device)
Gradient accumulation: 4 steps
Effective batch: 4 ร— 4 = 16

Steps per epoch: 15,694 รท 16 = ~980 steps
Total steps (3 epochs): 980 ร— 3 = ~2,940 steps

Time per step (T4): ~2-3 seconds
Total time: 2,940 ร— 2.5s = ~7,350s = ~2 hours

Cost Calculation

T4 GPU on HF Jobs: ~$0.60/hour
Training time: ~2 hours
Total cost: $0.60 ร— 2 = $1.20

Well under $10! โœ…


๐ŸŽ“ Summary: Key Training Concepts

Concept What It Is Why It Matters
LoRA Tiny trainable matrices added to frozen layers Makes training affordable (5GB vs 24GB)
SFT Teaching model with inputโ†’output examples Gives model tool-calling knowledge
Loss Measure of how wrong predictions are Lower = better learning
Learning Rate Size of weight updates Too high = chaos, too low = slow
Batch Size Examples per weight update More = stable gradients, needs more memory
Gradient Accumulation Fake larger batch sizes Memory-saving trick
Epochs Times model sees full dataset 3 is sweet spot
Warmup Gradual LR increase at start Prevents early instability
Cosine Schedule LR highโ†’low curve Aggressive middle, gentle end
Gradient Checkpointing Recompute activations Saves ~40% memory

๐Ÿ”œ Next Step

Read 05-dataset.md to understand our training data โ€” what we have, what's missing, and how to make it better.