Master Deep Learning
from Neurons to Production
+ A comprehensive, hands-on reference covering neural network theory, architectures, training techniques, and real-world deployment — from first principles to state-of-the-art models.
+Learning Path
+What You'll Learn
+Architecture Comparison
+| Architecture | Best For | Key Innovation | Parameters | Year |
|---|---|---|---|---|
| MLP | Tabular data, classification | Universal approximator | Thousands | 1986 |
| CNN | Image, video, audio spectrograms | Weight sharing, local connectivity | Millions | 1998 |
| LSTM | Time series, NLP sequences | Gated memory cells | Millions | 1997 |
| Transformer | NLP, vision, multimodal | Self-attention, parallelisation | Billions | 2017 |
| GAN | Image synthesis, data augmentation | Adversarial training | Millions–billions | 2014 |
| Diffusion | Image/video/audio generation | Denoising score matching | Billions | 2020 |
Mathematical Foundations
+The core maths every deep learning practitioner must understand — from tensors to gradients.
+ +Tensors
+Tensors are the fundamental data structure in deep learning — generalisations of scalars, vectors, and matrices to arbitrary dimensions (ranks).
+| Rank | Name | Example Shape | DL Use |
|---|---|---|---|
| 0 | Scalar | () | Loss value, learning rate |
| 1 | Vector | (512,) | Embedding, bias |
| 2 | Matrix | (64, 512) | Weight matrix, batch |
| 3 | 3D Tensor | (32, 128, 512) | Batch of sequences |
| 4 | 4D Tensor | (32, 3, 224, 224) | Batch of images (NCHW) |
Essential Operations
++ C = A @ B → shape: (m,n) @ (n,p) = (m,p) +
import torch
+
+# Creating tensors
+x = torch.tensor([[1.0, 2.0], [3.0, 4.0]]) # from list
+zeros = torch.zeros(3, 4) # shape (3,4)
+rand = torch.randn(32, 512) # normal dist
+
+# Fundamental ops
+W = torch.randn(512, 256)
+b = torch.zeros(256)
+out = rand @ W + b # (32,512)@(512,256)+256 → (32,256)
+
+# Reshape, transpose, squeeze
+t = torch.arange(24).reshape(2, 3, 4)
+t_T = t.transpose(1, 2) # (2,4,3)
+flat = t.flatten(1) # (2,12)
+
+# GPU transfer
+device = "cuda" if torch.cuda.is_available() else "cpu"
+x = x.to(device)
+ The Chain Rule — Heart of Backprop
++ For composition f(g(x)):
+ df/dx = (df/dg) · (dg/dx) +
+ where η = learning rate
+ ∇_�� L = gradient of loss w.r.t. θ +
Partial Derivatives in Layers
+For a linear layer y = Wx + b and loss L:
+ ∂L/∂b = ∂L/∂y
+ ∂L/∂x = Wᵀ · (∂L/∂y) +
+ For vector → vector functions
+ Shape: (dim_y × dim_x) +
import torch
+
+# Automatic differentiation with requires_grad
+x = torch.tensor([2.0, 3.0], requires_grad=True)
+W = torch.randn(2, 2, requires_grad=True)
+
+# Forward pass — builds computation graph
+y = x @ W # (2,) @ (2,2) → (2,)
+loss = y.sum() # scalar loss
+
+# Backward pass — computes gradients via chain rule
+loss.backward()
+print(x.grad) # ∂loss/∂x
+print(W.grad) # ∂loss/∂W
+
+# Manual gradient check
+with torch.no_grad():
+ W -= 0.01 * W.grad # gradient descent step
+ W.grad.zero_() # must zero before next backward()
+ Key Distributions in DL
+| Distribution | Use in DL |
|---|---|
| Normal N(μ,σ²) | Weight init, noise injection, VAE latent |
| Bernoulli | Binary classification output, dropout |
| Categorical | Multi-class softmax output, token prediction |
| Uniform | Xavier init, random sampling |
| Dirichlet | Topic models, mixture models |
Loss Functions as Likelihoods
++ = −log P(true class | input) +
+ = MLE under Gaussian noise assumption +
Information Theory
++ Measures uncertainty / information content +
+ How much Y tells us about X +
Neural Networks
+From the biological neuron to deep multi-layer perceptrons — theory, math, and interactive visualisation.
+ +Architecture
+-
+
- 1Input LayerReceives raw features. No computation — passes values forward. Each node = one feature.
+
- 2Hidden LayersEach neuron computes z = Wx + b, then applies activation σ(z). Multiple hidden layers = "deep" network.
+
- 3Output LayerProduces predictions. Activation depends on task: sigmoid (binary), softmax (multiclass), linear (regression).
+
- 4Forward PassData flows input→output. Loss is computed comparing prediction to ground truth.
+
- 5BackpropagationGradients flow output→input via chain rule. Each weight updated: w ← w − η·∂L/∂w.
+
Activation Functions
+| Function | Formula | Range | Use Case | Drawback |
|---|---|---|---|---|
| Sigmoid | 1/(1+e⁻ˣ) | (0,1) | Binary output | Vanishing gradient |
| Tanh | (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ) | (-1,1) | Hidden layers (old) | Vanishing gradient |
| ReLU | max(0, x) | [0,∞) | Most hidden layers | Dying ReLU |
| Leaky ReLU | max(αx, x) | (-∞,∞) | Fixes dying ReLU | Extra hyperparameter |
| GELU | x·Φ(x) | ≈(-0.17,∞) | Transformers (BERT, GPT) | More compute |
| Swish | x·sigmoid(x) | (-∞,∞) | EfficientNet | More compute |
Complete MLP Implementation
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader, TensorDataset
+
+# ─── Define MLP ───────────────────────────────────────────
+class MLP(nn.Module):
+ def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
+ super().__init__()
+ layers = []
+ dims = [input_dim] + hidden_dims
+ for i in range(len(dims) - 1):
+ layers += [
+ nn.Linear(dims[i], dims[i+1]),
+ nn.BatchNorm1d(dims[i+1]), # normalise activations
+ nn.GELU(), # smooth non-linearity
+ nn.Dropout(dropout), # regularisation
+ ]
+ layers.append(nn.Linear(dims[-1], output_dim))
+ self.net = nn.Sequential(*layers)
+
+ def forward(self, x):
+ return self.net(x)
+
+# ─── Training Loop ─────────────────────────────────────────
+def train(model, loader, criterion, optimizer, device):
+ model.train()
+ total_loss = 0
+ for X, y in loader:
+ X, y = X.to(device), y.to(device)
+ optimizer.zero_grad() # clear previous gradients
+ logits = model(X) # forward pass
+ loss = criterion(logits, y) # compute loss
+ loss.backward() # backpropagation
+ nn.utils.clip_grad_norm_(model.parameters(), 1.0) # gradient clip
+ optimizer.step() # update weights
+ total_loss += loss.item()
+ return total_loss / len(loader)
+
+# ─── Instantiate and run ────────────────────────────────────
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = MLP(input_dim=784, hidden_dims=[512, 256, 128], output_dim=10).to(device)
+optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
+scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
+criterion = nn.CrossEntropyLoss()
+ Convolutional Neural Networks
+Spatial pattern recognition through learned filters — the foundation of computer vision.
+ +Core Concepts
++ Output size = ⌊(N + 2P − F)/S⌋ + 1
+ N=input, P=padding, F=filter, S=stride +
Architecture Milestones
+ResNet Skip Connection
++ F(x) = Conv → BN → ReLU → Conv → BN
+ Output = F(x) + x (identity shortcut)
+ Gradient: ∂L/∂x = ∂L/∂y · (∂F/∂x + 1) — always ≥ 1, preventing vanishing +
import torch.nn as nn
+
+class ResidualBlock(nn.Module):
+ def __init__(self, channels, stride=1):
+ super().__init__()
+ self.conv1 = nn.Conv2d(channels, channels, 3, stride=stride, padding=1, bias=False)
+ self.bn1 = nn.BatchNorm2d(channels)
+ self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
+ self.bn2 = nn.BatchNorm2d(channels)
+ self.relu = nn.ReLU(inplace=True)
+ # Shortcut if stride changes spatial dims
+ self.shortcut = nn.Sequential(
+ nn.Conv2d(channels, channels, 1, stride=stride, bias=False),
+ nn.BatchNorm2d(channels)
+ ) if stride != 1 else nn.Identity()
+
+ def forward(self, x):
+ out = self.relu(self.bn1(self.conv1(x)))
+ out = self.bn2(self.conv2(out))
+ out += self.shortcut(x) # ← skip connection
+ return self.relu(out)
+
+# Transfer learning with pretrained ResNet
+import torchvision.models as models
+backbone = models.resnet50(weights="IMAGENET1K_V1")
+backbone.fc = nn.Linear(2048, num_classes) # replace head
+
+# Freeze backbone, fine-tune head only
+for p in backbone.parameters():
+ p.requires_grad = False
+for p in backbone.fc.parameters():
+ p.requires_grad = True
+ RNNs & LSTMs
+Modelling sequential dependencies — from simple recurrent nets to gated memory architectures.
+ +Recurrent Networks
++ yₜ = Wᵧ·hₜ + bᵧ
+ hₜ = hidden state at time t
+ xₜ = input at time t +
GRU (Gated Recurrent Unit)
++ rₜ = σ(Wr·[hₜ₋₁, xₜ]) — reset gate
+ h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ]) — candidate
+ hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ +
LSTM Architecture
++ iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi) — input gate
+ C̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc) — candidate
+ Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ — cell state
+ oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo) — output gate
+ hₜ = oₜ⊙tanh(Cₜ) +
import torch
+import torch.nn as nn
+
+class BiLSTMClassifier(nn.Module):
+ def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2):
+ super().__init__()
+ self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
+ self.lstm = nn.LSTM(
+ embed_dim, hidden_dim,
+ num_layers=num_layers,
+ batch_first=True,
+ bidirectional=True, # forward + backward
+ dropout=0.3
+ )
+ self.classifier = nn.Sequential(
+ nn.Linear(hidden_dim * 2, hidden_dim), # *2 for bidir
+ nn.ReLU(),
+ nn.Dropout(0.3),
+ nn.Linear(hidden_dim, num_classes)
+ )
+
+ def forward(self, x, lengths):
+ emb = self.embedding(x) # (B, T, E)
+ # Pack for variable-length sequences
+ packed = nn.utils.rnn.pack_padded_sequence(emb, lengths, batch_first=True, enforce_sorted=False)
+ out, (hn, _) = self.lstm(packed)
+ # Concat last forward + backward hidden states
+ last_hidden = torch.cat([hn[-2], hn[-1]], dim=1) # (B, H*2)
+ return self.classifier(last_hidden)
+ Transformers
+The architecture that redefined AI — self-attention, positional encoding, and the models built on top.
+ +Self-Attention
++ Q = XWᴼ, K = XWᴷ, V = XWᵛ
+ dₖ = key dimension (scale prevents small gradients) +
+ headᵢ = Attention(QWᵢᴼ, KWᵢᴷ, VWᵢᵛ)
+ Each head learns different relationship types +
Encoder-Decoder Structure
+-
+
- 1Input Embedding + PEToken IDs → embeddings. Add sinusoidal or learnable positional encoding to inject sequence order.
+
- 2Encoder BlockMulti-Head Self-Attention → Add & Norm → Feed Forward → Add & Norm. Repeated N times.
+
- 3Decoder BlockMasked Self-Attention → Cross-Attention (attends to encoder) → FFN. Generates one token at a time.
+
- 4Output ProjectionLinear + Softmax over vocabulary. At inference: greedy / beam search / nucleus sampling.
+
Popular Variants
+| Model | Type | Params | Key Use | Innovation |
|---|---|---|---|---|
| BERT | Encoder-only | 110M–340M | Classification, NER, QA | Masked language modelling (MLM) |
| GPT-4 | Decoder-only | ~1.8T | Text generation, chat | RLHF + MoE scaling |
| T5 | Encoder-Decoder | 11B | Summarisation, translation | Text-to-text framing |
| ViT | Encoder-only | 86M–632M | Image classification | Patch embeddings replace CNN |
| Llama 3 | Decoder-only | 8B–70B | Open-source LLM | GQA, RoPE, SwiGLU |
| Whisper | Encoder-Decoder | 39M–1.5B | Speech recognition | Multitask audio transformer |
import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+
+class MultiHeadAttention(nn.Module):
+ def __init__(self, d_model, num_heads):
+ super().__init__()
+ assert d_model % num_heads == 0
+ self.d_k = d_model // num_heads
+ self.h = num_heads
+ self.Wq = nn.Linear(d_model, d_model)
+ self.Wk = nn.Linear(d_model, d_model)
+ self.Wv = nn.Linear(d_model, d_model)
+ self.Wo = nn.Linear(d_model, d_model)
+
+ def forward(self, q, k, v, mask=None):
+ B, T, D = q.shape
+ Q = self.Wq(q).view(B, T, self.h, self.d_k).transpose(1,2)
+ K = self.Wk(k).view(B, -1, self.h, self.d_k).transpose(1,2)
+ V = self.Wv(v).view(B, -1, self.h, self.d_k).transpose(1,2)
+
+ # Scaled dot-product attention
+ scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k)
+ if mask is not None:
+ scores = scores.masked_fill(mask == 0, -1e9)
+ attn = F.softmax(scores, dim=-1)
+ out = (attn @ V).transpose(1,2).reshape(B, T, D)
+ return self.Wo(out), attn
+ GANs & Generative Models
+Adversarial training, VAEs, and diffusion — teaching machines to create.
+ +GAN Framework
++ E[log D(x)] + E[log(1 − D(G(z)))]
+ D(x) → 1 for real, 0 for fake
+ G(z) → fool D into D(G(z)) → 1 +
GAN Variants
+| Variant | Innovation |
|---|---|
| DCGAN | Conv layers, batch norm — stable training |
| WGAN-GP | Wasserstein loss + gradient penalty |
| StyleGAN 3 | Alias-free generation, style mixing |
| CycleGAN | Unpaired image translation |
| Pix2Pix | Paired image-to-image translation |
VAE vs. GAN vs. Diffusion
+import torch.nn as nn
+
+class DCGANGenerator(nn.Module):
+ def __init__(self, latent_dim=100, channels=3):
+ super().__init__()
+ def block(in_c, out_c, stride=2, padding=1):
+ return [
+ nn.ConvTranspose2d(in_c, out_c, 4, stride, padding, bias=False),
+ nn.BatchNorm2d(out_c),
+ nn.ReLU(True)
+ ]
+ self.net = nn.Sequential(
+ # 1×1 → 4×4
+ nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False),
+ nn.BatchNorm2d(512), nn.ReLU(True),
+ *block(512, 256), # 8×8
+ *block(256, 128), # 16×16
+ *block(128, 64), # 32×32
+ nn.ConvTranspose2d(64, channels, 4, 2, 1, bias=False),
+ nn.Tanh() # 64×64, range [-1,1]
+ )
+ def forward(self, z):
+ return self.net(z.view(-1, z.shape[1], 1, 1))
+ Training Deep Learning Models
+Optimisers, regularisation, hyperparameter tuning, and tricks that separate good models from great ones.
+ +Optimisers
+Regularisation Techniques
+Effectiveness score (higher = more commonly beneficial across task types)
+ +Batch vs. Layer Normalisation
+| Type | Normalises Over | Best For |
|---|---|---|
| BatchNorm | Batch dimension | CNNs, large batches |
| LayerNorm | Feature dimension | Transformers, NLP, RNNs |
| GroupNorm | Groups of channels | Small batch sizes |
| RMSNorm | Feature dim (simpler) | Modern LLMs (Llama) |
import torch
+from torch.cuda.amp import autocast, GradScaler
+
+scaler = GradScaler() # handles FP16 loss scaling
+
+def train_step(model, batch, optimizer, criterion):
+ X, y = batch
+ optimizer.zero_grad()
+
+ with autocast(): # FP16 forward pass
+ logits = model(X)
+ loss = criterion(logits, y)
+
+ scaler.scale(loss).backward() # scaled gradients
+ scaler.unscale_(optimizer) # unscale before clip
+ torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+ scaler.step(optimizer) # update weights
+ scaler.update() # adjust scale factor
+ return loss.item()
+
+# LR warmup + cosine decay (Transformers best practice)
+import math
+
+def get_lr(step, d_model, warmup_steps):
+ if step == 0: return 0.0
+ scale = min(step ** -0.5, step * warmup_steps ** -1.5)
+ return d_model ** -0.5 * scale # original transformer formula
+ Deployment & MLOps
+Getting models from experiment to production — export, serve, containerise, and monitor.
+ +Export Pipeline
+-
+
- 1Train & ValidateAchieve target metrics. Save checkpoint with torch.save() or Hugging Face safetensors.
+
- 2Export to ONNXtorch.onnx.export() converts the model to a framework-agnostic graph for cross-platform inference.
+
- 3Optimise with TensorRTtrtexec or ONNX-TensorRT converts ONNX to a TensorRT engine. 3–10× faster on NVIDIA GPUs.
+
- 4Serve via FastAPIWrap inference in a REST endpoint. Use ONNX Runtime for lightweight CPU/GPU serving.
+
- 5ContaineriseDocker image with model weights + FastAPI. Push to ECR / ACR / GCR and deploy to Kubernetes.
+
- 6Monitor with MLflowTrack experiments, model versions, metrics drift. Set up alerts for data/concept drift.
+
Optimisation Techniques
+import torch
+import onnxruntime as ort
+from fastapi import FastAPI
+from pydantic import BaseModel
+import numpy as np
+
+# ─── Export to ONNX ─────────────────────────────────────────
+model.eval()
+dummy_input = torch.randn(1, 3, 224, 224)
+torch.onnx.export(
+ model, dummy_input, "model.onnx",
+ opset_version=17,
+ input_names=["image"], output_names=["logits"],
+ dynamic_axes={"image": {0: "batch"}} # variable batch size
+)
+
+# ─── ONNX Runtime Inference ─────────────────────────────────
+sess_opts = ort.SessionOptions()
+sess_opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
+session = ort.InferenceSession("model.onnx", sess_options=sess_opts,
+ providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
+
+# ─── FastAPI endpoint ────────────────────────────────────────
+app = FastAPI(title="Deep Learning Model API")
+
+class PredictRequest(BaseModel):
+ image: list[list[list[list[float]]]] # NCHW float array
+
+@app.post("/predict")
+async def predict(req: PredictRequest):
+ x = np.array(req.image, dtype=np.float32)
+ logits = session.run(["logits"], {"image": x})[0]
+ probs = np.exp(logits) / np.exp(logits).sum(-1, keepdims=True)
+ top_k = np.argsort(probs[0])[::-1][:5]
+ return {"top5_classes": top_k.tolist(), "probs": probs[0][top_k].tolist()}
+ Code Lab
+Complete, production-quality code examples for the most common deep learning tasks.
+ +import torch
+import torch.nn as nn
+import torch.optim as optim
+from torchvision import datasets, transforms
+from torch.utils.data import DataLoader
+
+# ─── Data ────────────────────────────────────────────────────
+transform = transforms.Compose([
+ transforms.ToTensor(),
+ transforms.Normalize((0.1307,), (0.3081,))
+])
+train_ds = datasets.MNIST("./data", train=True, download=True, transform=transform)
+test_ds = datasets.MNIST("./data", train=False, transform=transform)
+train_dl = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4)
+test_dl = DataLoader(test_ds, batch_size=256)
+
+# ─── Model ────────────────────────────────────────────────────
+class ConvNet(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.features = nn.Sequential(
+ nn.Conv2d(1, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
+ nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
+ nn.MaxPool2d(2), # 28→14
+ nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
+ nn.AdaptiveAvgPool2d((4, 4))
+ )
+ self.classifier = nn.Sequential(
+ nn.Flatten(),
+ nn.Linear(128*4*4, 256), nn.ReLU(), nn.Dropout(0.5),
+ nn.Linear(256, 10)
+ )
+ def forward(self, x): return self.classifier(self.features(x))
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = ConvNet().to(device)
+opt = optim.AdamW(model.parameters(), lr=1e-3)
+sched = optim.lr_scheduler.OneCycleLR(opt, max_lr=1e-2, steps_per_epoch=len(train_dl), epochs=10)
+crit = nn.CrossEntropyLoss()
+
+# ─── Train ────────────────────────────────────────────────────
+for epoch in range(10):
+ model.train()
+ for X, y in train_dl:
+ X, y = X.to(device), y.to(device)
+ opt.zero_grad()
+ loss = crit(model(X), y)
+ loss.backward()
+ opt.step(); sched.step()
+ model.eval()
+ correct = sum((model(X.to(device)).argmax(1) == y.to(device)).sum().item() for X,y in test_dl)
+ print(f"Epoch {epoch+1}: acc={correct/len(test_ds)*100:.2f}%")
+ import torch
+import torchvision.models as models
+from torchvision import transforms, datasets
+from torch.utils.data import DataLoader
+import torch.nn as nn
+
+# ─── Augmentation pipeline ────────────────────────────────────
+train_tf = transforms.Compose([
+ transforms.RandomResizedCrop(224),
+ transforms.RandomHorizontalFlip(),
+ transforms.ColorJitter(0.2, 0.2, 0.2),
+ transforms.ToTensor(),
+ transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+])
+
+# ─── Load pretrained EfficientNet-B2 ─────────────────────────
+backbone = models.efficientnet_b2(weights="IMAGENET1K_V1")
+num_ftrs = backbone.classifier[1].in_features
+backbone.classifier = nn.Sequential(
+ nn.Dropout(0.4),
+ nn.Linear(num_ftrs, num_classes)
+)
+
+# Phase 1: train only head (frozen backbone)
+for p in backbone.features.parameters(): p.requires_grad = False
+opt1 = torch.optim.AdamW(backbone.classifier.parameters(), lr=3e-3)
+
+# Phase 2: unfreeze and fine-tune all layers
+for p in backbone.parameters(): p.requires_grad = True
+opt2 = torch.optim.AdamW(backbone.parameters(), lr=3e-5) # low LR!
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
+from transformers import TrainingArguments, Trainer
+from datasets import load_dataset
+import numpy as np
+from sklearn.metrics import accuracy_score, f1_score
+
+# ─── Load model & tokeniser ───────────────────────────────────
+model_name = "bert-base-uncased"
+tokeniser = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
+
+# ─── Tokenise dataset ─────────────────────────────────────────
+dataset = load_dataset("imdb")
+def tokenise(batch):
+ return tokeniser(batch["text"], truncation=True, max_length=512, padding="max_length")
+dataset = dataset.map(tokenise, batched=True)
+
+# ─── Training ─────────────────────────────────────────────────
+args = TrainingArguments(
+ output_dir="./bert-imdb",
+ num_train_epochs=3,
+ per_device_train_batch_size=16,
+ learning_rate=2e-5, # low LR for fine-tuning
+ warmup_ratio=0.06,
+ weight_decay=0.01,
+ evaluation_strategy="epoch",
+ fp16=True, # mixed precision
+ logging_steps=100,
+)
+
+def compute_metrics(eval_pred):
+ logits, labels = eval_pred
+ preds = np.argmax(logits, axis=-1)
+ return {"accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds)}
+
+trainer = Trainer(model=model, args=args,
+ train_dataset=dataset["train"], eval_dataset=dataset["test"],
+ compute_metrics=compute_metrics)
+trainer.train()
+ from torch.utils.data import Dataset, DataLoader
+from torchvision import transforms
+from PIL import Image
+import pandas as pd, os
+
+class ImageDataset(Dataset):
+ def __init__(self, csv_path, img_dir, transform=None):
+ self.df = pd.read_csv(csv_path) # columns: filename, label
+ self.img_dir = img_dir
+ self.transform = transform
+
+ def __len__(self): return len(self.df)
+
+ def __getitem__(self, idx):
+ row = self.df.iloc[idx]
+ img = Image.open(os.path.join(self.img_dir, row.filename)).convert("RGB")
+ label = row.label
+ if self.transform: img = self.transform(img)
+ return img, label
+
+# Heavy augmentation for training
+train_transform = transforms.Compose([
+ transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
+ transforms.RandomHorizontalFlip(),
+ transforms.RandomRotation(15),
+ transforms.ColorJitter(brightness=0.3, contrast=0.3),
+ transforms.RandomGrayscale(p=0.1),
+ transforms.ToTensor(),
+ transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+])
+
+ds = ImageDataset("train.csv", "./images", transform=train_transform)
+loader = DataLoader(ds, batch_size=32, shuffle=True, num_workers=8, pin_memory=True)
+