diff --git "a/DL/DeepLearning.html" "b/DL/DeepLearning.html" new file mode 100644--- /dev/null +++ "b/DL/DeepLearning.html" @@ -0,0 +1,1745 @@ + + + + + + Deep Learning — In-Depth Tutorial + + + + + + + +
+ + + + + + + + +
+ + +
+
+
Complete Deep Learning Curriculum
+

Master Deep Learning
from Neurons to Production

+

A comprehensive, hands-on reference covering neural network theory, architectures, training techniques, and real-world deployment — from first principles to state-of-the-art models.

+
+
10
Modules
+
50+
Code Examples
+
8
Architectures
+
Learning
+
+
+ +
+
+

Learning Path

+
+
1. Foundations
Linear algebra, calculus, probability — the math powering every model
+
2. Neural Networks
Perceptrons → MLPs, activation functions, forward/backprop
+
3. CNNs
Convolutions, pooling, ResNet, EfficientNet for vision tasks
+
4. RNNs & LSTMs
Sequence modeling, vanishing gradients, gated architectures
+
5. Transformers
Attention mechanism, BERT, GPT, ViT — modern AI backbone
+
6. GANs & Diffusion
Generative models, adversarial training, image synthesis
+
7. Training Mastery
Optimizers, regularisation, hyperparameter tuning, mixed precision
+
8. Production
ONNX, TensorRT, FastAPI, Docker, MLflow, monitoring
+
+
+
+

What You'll Learn

+
+
🧮
Mathematical Foundations
Tensors, matrix ops, chain rule, probability distributions — the core maths behind every DL algorithm.
+
⚙️
Architecture Design
When to choose CNN vs. RNN vs. Transformer. Design intuitions with real trade-off tables.
+
🔬
Training Techniques
Adam, batch norm, dropout, learning-rate scheduling, gradient clipping and more.
+
🚀
Production Deployment
Export models to ONNX, serve with TorchServe/FastAPI, containerise with Docker, monitor with MLflow.
+
+
+
+ +

Architecture Comparison

+
+ + + + + + + + + + +
ArchitectureBest ForKey InnovationParametersYear
MLPTabular data, classificationUniversal approximatorThousands1986
CNNImage, video, audio spectrogramsWeight sharing, local connectivityMillions1998
LSTMTime series, NLP sequencesGated memory cellsMillions1997
TransformerNLP, vision, multimodalSelf-attention, parallelisationBillions2017
GANImage synthesis, data augmentationAdversarial trainingMillions–billions2014
DiffusionImage/video/audio generationDenoising score matchingBillions2020
+
+
+ + +
+

Mathematical Foundations

+

The core maths every deep learning practitioner must understand — from tensors to gradients.

+ +
+ + + + +
+ +
+
+
+

Tensors

+

Tensors are the fundamental data structure in deep learning — generalisations of scalars, vectors, and matrices to arbitrary dimensions (ranks).

+
+ + + + + + + + + +
RankNameExample ShapeDL Use
0Scalar()Loss value, learning rate
1Vector(512,)Embedding, bias
2Matrix(64, 512)Weight matrix, batch
33D Tensor(32, 128, 512)Batch of sequences
44D Tensor(32, 3, 224, 224)Batch of images (NCHW)
+
+
+
+

Essential Operations

+
+
Matrix Multiplication
+ C[i,j] = Σ_k A[i,k] · B[k,j]
+ C = A @ B → shape: (m,n) @ (n,p) = (m,p) +
+
+
Dot Product / Inner Product
+ a · b = Σᵢ aᵢbᵢ = |a||b|cos(θ) +
+
+
Hadamard (Element-wise)
+ (A ⊙ B)[i,j] = A[i,j] · B[i,j] +
+
+
Broadcast Rule
+ Dims aligned right; size-1 dims expand to match +
+
+
+
+
Python · PyTorch
+
import torch
+
+# Creating tensors
+x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # from list
+zeros = torch.zeros(3, 4)                    # shape (3,4)
+rand  = torch.randn(32, 512)                # normal dist
+
+# Fundamental ops
+W = torch.randn(512, 256)
+b = torch.zeros(256)
+out = rand @ W + b   # (32,512)@(512,256)+256 → (32,256)
+
+# Reshape, transpose, squeeze
+t = torch.arange(24).reshape(2, 3, 4)
+t_T = t.transpose(1, 2)           # (2,4,3)
+flat = t.flatten(1)               # (2,12)
+
+# GPU transfer
+device = "cuda" if torch.cuda.is_available() else "cpu"
+x = x.to(device)
+
+
+ +
+
+
+

The Chain Rule — Heart of Backprop

+
+
Chain Rule
+ dL/dw = (dL/dy) · (dy/dw)

+ For composition f(g(x)):
+ df/dx = (df/dg) · (dg/dx) +
+
+
Gradient Descent Update
+ θ ← θ − η · ∇_θ L(θ)

+ where η = learning rate
+ ∇_�� L = gradient of loss w.r.t. θ +
+
💡
Key InsightThe gradient tells us the direction of steepest ascent in loss space. We subtract it to descend toward lower loss.
+
+
+

Partial Derivatives in Layers

+

For a linear layer y = Wx + b and loss L:

+
+
Gradients of Linear Layer
+ ∂L/∂W = (∂L/∂y) · xᵀ
+ ∂L/∂b = ∂L/∂y
+ ∂L/∂x = Wᵀ · (∂L/∂y) +
+
+
Jacobian Matrix
+ J[i,j] = ∂yᵢ/∂xⱼ

+ For vector → vector functions
+ Shape: (dim_y × dim_x) +
+
+
+
+
Python · Autograd
+
import torch
+
+# Automatic differentiation with requires_grad
+x = torch.tensor([2.0, 3.0], requires_grad=True)
+W = torch.randn(2, 2, requires_grad=True)
+
+# Forward pass — builds computation graph
+y = x @ W            # (2,) @ (2,2) → (2,)
+loss = y.sum()       # scalar loss
+
+# Backward pass — computes gradients via chain rule
+loss.backward()
+print(x.grad)        # ∂loss/∂x
+print(W.grad)        # ∂loss/∂W
+
+# Manual gradient check
+with torch.no_grad():
+    W -= 0.01 * W.grad  # gradient descent step
+    W.grad.zero_()       # must zero before next backward()
+
+
+ +
+
+
+

Key Distributions in DL

+
+ + + + + + + + + +
DistributionUse in DL
Normal N(μ,σ²)Weight init, noise injection, VAE latent
BernoulliBinary classification output, dropout
CategoricalMulti-class softmax output, token prediction
UniformXavier init, random sampling
DirichletTopic models, mixture models
+
+
+
+

Loss Functions as Likelihoods

+
+
Cross-Entropy Loss (Classification)
+ L = −Σᵢ yᵢ · log(ŷᵢ)

+ = −log P(true class | input) +
+
+
MSE Loss (Regression)
+ L = (1/n) Σᵢ (yᵢ − ŷᵢ)²

+ = MLE under Gaussian noise assumption +
+
+
KL Divergence (VAE)
+ KL(P‖Q) = Σᵢ P(x) log[P(x)/Q(x)] +
+
+
+
+ +
+
+
+

Information Theory

+
+
Shannon Entropy
+ H(X) = −Σᵢ P(xᵢ) · log₂P(xᵢ)
+ Measures uncertainty / information content +
+
+
Mutual Information
+ I(X;Y) = H(X) − H(X|Y)
+ How much Y tells us about X +
+
+
+
ℹ️
Why It MattersCross-entropy loss is just the negative log-likelihood, which minimises KL divergence between predicted and true distributions — directly rooted in information theory.
+
🔑
Softmax TemperatureDividing logits by temperature T before softmax controls sharpness: T→0 = argmax, T→∞ = uniform. Used in knowledge distillation and sampling from LLMs.
+
+
+
+
+ + +
+

Neural Networks

+

From the biological neuron to deep multi-layer perceptrons — theory, math, and interactive visualisation.

+ +
+
+

Architecture

+
    +
  • 1
    Input Layer
    Receives raw features. No computation — passes values forward. Each node = one feature.
  • +
  • 2
    Hidden Layers
    Each neuron computes z = Wx + b, then applies activation σ(z). Multiple hidden layers = "deep" network.
  • +
  • 3
    Output Layer
    Produces predictions. Activation depends on task: sigmoid (binary), softmax (multiclass), linear (regression).
  • +
  • 4
    Forward Pass
    Data flows input→output. Loss is computed comparing prediction to ground truth.
  • +
  • 5
    Backpropagation
    Gradients flow output→input via chain rule. Each weight updated: w ← w − η·∂L/∂w.
  • +
+
+
+
+
+ Interactive Neural Network — click neurons + 4-4-3-2 layers +
+ +
+
+
+ +

Activation Functions

+
+
Activation Functions ComparisonVisualisation
+ +
+
+ + + + + + + + + + +
FunctionFormulaRangeUse CaseDrawback
Sigmoid1/(1+e⁻ˣ)(0,1)Binary outputVanishing gradient
Tanh(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)(-1,1)Hidden layers (old)Vanishing gradient
ReLUmax(0, x)[0,∞)Most hidden layersDying ReLU
Leaky ReLUmax(αx, x)(-∞,∞)Fixes dying ReLUExtra hyperparameter
GELUx·Φ(x)≈(-0.17,∞)Transformers (BERT, GPT)More compute
Swishx·sigmoid(x)(-∞,∞)EfficientNetMore compute
+
+ +

Complete MLP Implementation

+
+
Python · PyTorch
+
import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader, TensorDataset
+
+# ─── Define MLP ───────────────────────────────────────────
+class MLP(nn.Module):
+    def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
+        super().__init__()
+        layers = []
+        dims = [input_dim] + hidden_dims
+        for i in range(len(dims) - 1):
+            layers += [
+                nn.Linear(dims[i], dims[i+1]),
+                nn.BatchNorm1d(dims[i+1]),  # normalise activations
+                nn.GELU(),                  # smooth non-linearity
+                nn.Dropout(dropout),        # regularisation
+            ]
+        layers.append(nn.Linear(dims[-1], output_dim))
+        self.net = nn.Sequential(*layers)
+
+    def forward(self, x):
+        return self.net(x)
+
+# ─── Training Loop ─────────────────────────────────────────
+def train(model, loader, criterion, optimizer, device):
+    model.train()
+    total_loss = 0
+    for X, y in loader:
+        X, y = X.to(device), y.to(device)
+        optimizer.zero_grad()        # clear previous gradients
+        logits = model(X)            # forward pass
+        loss = criterion(logits, y)  # compute loss
+        loss.backward()             # backpropagation
+        nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # gradient clip
+        optimizer.step()            # update weights
+        total_loss += loss.item()
+    return total_loss / len(loader)
+
+# ─── Instantiate and run ────────────────────────────────────
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = MLP(input_dim=784, hidden_dims=[512, 256, 128], output_dim=10).to(device)
+optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
+scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
+criterion = nn.CrossEntropyLoss()
+
+
+ + +
+

Convolutional Neural Networks

+

Spatial pattern recognition through learned filters — the foundation of computer vision.

+ +
+
+

Core Concepts

+
+
2D Convolution
+ (I * K)[i,j] = Σₘ Σₙ I[i+m, j+n] · K[m,n]

+ Output size = ⌊(N + 2P − F)/S⌋ + 1
+ N=input, P=padding, F=filter, S=stride +
+
+
+ +
Small weight matrices (e.g. 3×3, 5×5) that slide over the input, computing dot products. Each filter learns to detect a specific pattern — edges, textures, shapes. Stacking multiple filters creates channels (depth) in the feature map.
+
+
+ +
Max pooling keeps the strongest activation per region. Average pooling takes the mean. Both reduce spatial dimensions while retaining important features. Modern CNNs use global average pooling before the classifier head.
+
+
+ +
The region of input space that a neuron "sees". Stacking 3×3 convolutions: 1 layer → 3×3, 2 layers → 5×5, 3 layers → 7×7. Deep CNNs build massive receptive fields from small kernels — efficient and powerful.
+
+
+ +
Layer 1: edges, gradients. Layer 2: textures, corners. Layer 3: object parts. Layer 4-5: entire objects, semantic concepts. This hierarchical representation is why CNNs transfer well across domains.
+
+
+
+
+

Architecture Milestones

+
+
LeNet-5 (1998)
First practical CNN — handwritten digit recognition. Established conv→pool→fc pattern.
+
AlexNet (2012)
Sparked the DL revolution. ReLU, dropout, GPU training. ImageNet top-5: 15.3% error.
+
VGG-16 (2014)
Deep, uniform 3×3 conv stacks. Simple and effective. Still popular for transfer learning.
+
ResNet (2015)
Residual connections solved vanishing gradients. Enabled 152-layer networks. Skip connections = game changer.
+
EfficientNet (2019)
Compound scaling of width, depth, resolution. SOTA accuracy with 8× fewer params than ResNet-50.
+
ConvNeXt (2022)
Modernised ResNet design inspired by Transformers. Competitive with ViT on ImageNet.
+
+
+
+ +

ResNet Skip Connection

+
+
Residual Block
+ y = F(x, {Wᵢ}) + x

+ F(x) = Conv → BN → ReLU → Conv → BN
+ Output = F(x) + x (identity shortcut)
+ Gradient: ∂L/∂x = ∂L/∂y · (∂F/∂x + 1) — always ≥ 1, preventing vanishing +
+ +
+
Python · ResNet Block
+
import torch.nn as nn
+
+class ResidualBlock(nn.Module):
+    def __init__(self, channels, stride=1):
+        super().__init__()
+        self.conv1 = nn.Conv2d(channels, channels, 3, stride=stride, padding=1, bias=False)
+        self.bn1   = nn.BatchNorm2d(channels)
+        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
+        self.bn2   = nn.BatchNorm2d(channels)
+        self.relu  = nn.ReLU(inplace=True)
+        # Shortcut if stride changes spatial dims
+        self.shortcut = nn.Sequential(
+            nn.Conv2d(channels, channels, 1, stride=stride, bias=False),
+            nn.BatchNorm2d(channels)
+        ) if stride != 1 else nn.Identity()
+
+    def forward(self, x):
+        out = self.relu(self.bn1(self.conv1(x)))
+        out = self.bn2(self.conv2(out))
+        out += self.shortcut(x)   # ← skip connection
+        return self.relu(out)
+
+# Transfer learning with pretrained ResNet
+import torchvision.models as models
+backbone = models.resnet50(weights="IMAGENET1K_V1")
+backbone.fc = nn.Linear(2048, num_classes)   # replace head
+
+# Freeze backbone, fine-tune head only
+for p in backbone.parameters():
+    p.requires_grad = False
+for p in backbone.fc.parameters():
+    p.requires_grad = True
+
+
+ + +
+

RNNs & LSTMs

+

Modelling sequential dependencies — from simple recurrent nets to gated memory architectures.

+ +
+
+

Recurrent Networks

+
+
Vanilla RNN
+ hₜ = tanh(Wₕ·hₜ₋₁ + Wₓ·xₜ + b)
+ yₜ = Wᵧ·hₜ + bᵧ

+ hₜ = hidden state at time t
+ xₜ = input at time t +
+
⚠️
Vanishing Gradient ProblemIn deep unrolled RNNs, gradients can shrink to ~0 over long sequences: ∂h₁₀₀/∂h₁ ≈ (∂hₜ/∂hₜ₋₁)¹⁰⁰ → 0 if |∂hₜ/∂hₜ₋₁| < 1. LSTMs and GRUs solve this with gating.
+

GRU (Gated Recurrent Unit)

+
+ zₜ = σ(Wz·[hₜ₋₁, xₜ]) — update gate
+ rₜ = σ(Wr·[hₜ₋₁, xₜ]) — reset gate
+ h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ]) — candidate
+ hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ +
+
+
+

LSTM Architecture

+
+
LSTM Gates
+ fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) — forget gate
+ iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi) — input gate
+ C̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc) — candidate
+ Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ — cell state
+ oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo) — output gate
+ hₜ = oₜ⊙tanh(Cₜ) +
+
💡
Cell State = HighwayThe cell state Cₜ runs along the top of the LSTM with only minor linear interactions. Gradients can flow through it almost unchanged over hundreds of steps — solving vanishing gradients.
+
+
+ +
+
Python · Bidirectional LSTM
+
import torch
+import torch.nn as nn
+
+class BiLSTMClassifier(nn.Module):
+    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2):
+        super().__init__()
+        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
+        self.lstm = nn.LSTM(
+            embed_dim, hidden_dim,
+            num_layers=num_layers,
+            batch_first=True,
+            bidirectional=True,   # forward + backward
+            dropout=0.3
+        )
+        self.classifier = nn.Sequential(
+            nn.Linear(hidden_dim * 2, hidden_dim),  # *2 for bidir
+            nn.ReLU(),
+            nn.Dropout(0.3),
+            nn.Linear(hidden_dim, num_classes)
+        )
+
+    def forward(self, x, lengths):
+        emb = self.embedding(x)           # (B, T, E)
+        # Pack for variable-length sequences
+        packed = nn.utils.rnn.pack_padded_sequence(emb, lengths, batch_first=True, enforce_sorted=False)
+        out, (hn, _) = self.lstm(packed)
+        # Concat last forward + backward hidden states
+        last_hidden = torch.cat([hn[-2], hn[-1]], dim=1)  # (B, H*2)
+        return self.classifier(last_hidden)
+
+
+ + +
+

Transformers

+

The architecture that redefined AI — self-attention, positional encoding, and the models built on top.

+ +
+
+

Self-Attention

+
+
Scaled Dot-Product Attention
+ Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V

+ Q = XWᴼ, K = XWᴷ, V = XWᵛ
+ dₖ = key dimension (scale prevents small gradients) +
+
+
Multi-Head Attention
+ MHA(Q,K,V) = Concat(head₁,...,headₕ)Wᴼ
+ headᵢ = Attention(QWᵢᴼ, KWᵢᴷ, VWᵢᵛ)

+ Each head learns different relationship types +
+
🔑
Why Attention WorksUnlike RNNs, attention computes relationships between ALL pairs of tokens in O(n²) — but fully in parallel. Long-range dependencies cost the same as short-range ones.
+
+
+

Encoder-Decoder Structure

+
    +
  • 1
    Input Embedding + PE
    Token IDs → embeddings. Add sinusoidal or learnable positional encoding to inject sequence order.
  • +
  • 2
    Encoder Block
    Multi-Head Self-Attention → Add & Norm → Feed Forward → Add & Norm. Repeated N times.
  • +
  • 3
    Decoder Block
    Masked Self-Attention → Cross-Attention (attends to encoder) → FFN. Generates one token at a time.
  • +
  • 4
    Output Projection
    Linear + Softmax over vocabulary. At inference: greedy / beam search / nucleus sampling.
  • +
+
+
+ +

Popular Variants

+
+ + + + + + + + + + +
ModelTypeParamsKey UseInnovation
BERTEncoder-only110M–340MClassification, NER, QAMasked language modelling (MLM)
GPT-4Decoder-only~1.8TText generation, chatRLHF + MoE scaling
T5Encoder-Decoder11BSummarisation, translationText-to-text framing
ViTEncoder-only86M–632MImage classificationPatch embeddings replace CNN
Llama 3Decoder-only8B–70BOpen-source LLMGQA, RoPE, SwiGLU
WhisperEncoder-Decoder39M–1.5BSpeech recognitionMultitask audio transformer
+
+ +
+
Python · Self-Attention from Scratch
+
import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_model, num_heads):
+        super().__init__()
+        assert d_model % num_heads == 0
+        self.d_k = d_model // num_heads
+        self.h   = num_heads
+        self.Wq  = nn.Linear(d_model, d_model)
+        self.Wk  = nn.Linear(d_model, d_model)
+        self.Wv  = nn.Linear(d_model, d_model)
+        self.Wo  = nn.Linear(d_model, d_model)
+
+    def forward(self, q, k, v, mask=None):
+        B, T, D = q.shape
+        Q = self.Wq(q).view(B, T, self.h, self.d_k).transpose(1,2)
+        K = self.Wk(k).view(B, -1, self.h, self.d_k).transpose(1,2)
+        V = self.Wv(v).view(B, -1, self.h, self.d_k).transpose(1,2)
+
+        # Scaled dot-product attention
+        scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k)
+        if mask is not None:
+            scores = scores.masked_fill(mask == 0, -1e9)
+        attn = F.softmax(scores, dim=-1)
+        out  = (attn @ V).transpose(1,2).reshape(B, T, D)
+        return self.Wo(out), attn
+
+
+ + +
+

GANs & Generative Models

+

Adversarial training, VAEs, and diffusion — teaching machines to create.

+ +
+
+

GAN Framework

+
+
Minimax Objective
+ min_G max_D V(D,G) =
+ E[log D(x)] + E[log(1 − D(G(z)))]

+ D(x) → 1 for real, 0 for fake
+ G(z) → fool D into D(G(z)) → 1 +
+
⚠️
Training InstabilityGANs suffer from mode collapse (G generates only a few modes) and vanishing gradients when D is too strong. Solutions: WGAN-GP, spectral norm, minibatch discrimination, progressive growing.
+

GAN Variants

+
+ + + + + + + + + +
VariantInnovation
DCGANConv layers, batch norm — stable training
WGAN-GPWasserstein loss + gradient penalty
StyleGAN 3Alias-free generation, style mixing
CycleGANUnpaired image translation
Pix2PixPaired image-to-image translation
+
+
+
+

VAE vs. GAN vs. Diffusion

+
+
🧮
+
VAE (Variational Autoencoder)
+
Encodes input to latent distribution N(μ,σ²). Maximises ELBO = reconstruction − KL(q‖p). Smooth latent space. Blurry outputs.
+
+
+
⚔️
+
GAN
+
Generator vs. discriminator adversarial game. Sharp, photorealistic outputs. Hard to train, mode collapse risk.
+
+
+
❄️
+
Diffusion Models
+
Gradually add Gaussian noise to data, train a U-Net to predict and reverse the noise. State-of-the-art quality. Slower inference (many steps). Stable Diffusion, DALL-E 3, Imagen.
+
+
+
+ +
+
Python · DCGAN Generator
+
import torch.nn as nn
+
+class DCGANGenerator(nn.Module):
+    def __init__(self, latent_dim=100, channels=3):
+        super().__init__()
+        def block(in_c, out_c, stride=2, padding=1):
+            return [
+                nn.ConvTranspose2d(in_c, out_c, 4, stride, padding, bias=False),
+                nn.BatchNorm2d(out_c),
+                nn.ReLU(True)
+            ]
+        self.net = nn.Sequential(
+            # 1×1 → 4×4
+            nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False),
+            nn.BatchNorm2d(512), nn.ReLU(True),
+            *block(512, 256),  # 8×8
+            *block(256, 128),  # 16×16
+            *block(128, 64),   # 32×32
+            nn.ConvTranspose2d(64, channels, 4, 2, 1, bias=False),
+            nn.Tanh()          # 64×64, range [-1,1]
+        )
+    def forward(self, z):
+        return self.net(z.view(-1, z.shape[1], 1, 1))
+
+
+ + +
+

Training Deep Learning Models

+

Optimisers, regularisation, hyperparameter tuning, and tricks that separate good models from great ones.

+ +
+
+

Optimisers

+
+
+ +
vₜ = β·vₜ₋₁ + ∇L; θ ← θ − η·vₜ. Momentum β≈0.9 dampens oscillations and accelerates convergence. Still best for CNNs with careful LR scheduling. Nesterov variant: look-ahead gradient.
+
+
+ +
mₜ = β₁mₜ₋₁ + (1-β₁)g; vₜ = β₂vₜ₋₁ + (1-β₂)g². θ ← θ − η·m̂ₜ/√(v̂ₜ+ε). Defaults: β₁=0.9, β₂=0.999, ε=1e-8, η=3e-4. Robust default for most tasks.
+
+
+ +
Decouples weight decay from gradient update. θ ← θ − η·(m̂/√v̂ + λθ). Preferred over Adam for transformers and LLMs. Use weight_decay=0.01-0.1.
+
+
+ +
Cosine Annealing: η oscillates from ηₘₐₓ to ηₘᵢₙ. Warmup + Cosine (transformers): linearly ramp LR for first N steps then cosine decay. OneCycleLR: super-convergence with very high max LR. ReduceLROnPlateau: adaptive decay on metric stagnation.
+
+
+
+
+

Regularisation Techniques

+
Dropout (p=0.5)
82%
+
Batch Normalisation
91%
+
Weight Decay (L2)
78%
+
Data Augmentation
88%
+
Early Stopping
74%
+
Label Smoothing
70%
+

Effectiveness score (higher = more commonly beneficial across task types)

+ +

Batch vs. Layer Normalisation

+
+ + + + + + + + +
TypeNormalises OverBest For
BatchNormBatch dimensionCNNs, large batches
LayerNormFeature dimensionTransformers, NLP, RNNs
GroupNormGroups of channelsSmall batch sizes
RMSNormFeature dim (simpler)Modern LLMs (Llama)
+
+
+
+ +
+
Python · Mixed Precision + Gradient Scaling
+
import torch
+from torch.cuda.amp import autocast, GradScaler
+
+scaler = GradScaler()   # handles FP16 loss scaling
+
+def train_step(model, batch, optimizer, criterion):
+    X, y = batch
+    optimizer.zero_grad()
+
+    with autocast():                  # FP16 forward pass
+        logits = model(X)
+        loss   = criterion(logits, y)
+
+    scaler.scale(loss).backward()    # scaled gradients
+    scaler.unscale_(optimizer)         # unscale before clip
+    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+    scaler.step(optimizer)             # update weights
+    scaler.update()                    # adjust scale factor
+    return loss.item()
+
+# LR warmup + cosine decay (Transformers best practice)
+import math
+
+def get_lr(step, d_model, warmup_steps):
+    if step == 0: return 0.0
+    scale = min(step ** -0.5, step * warmup_steps ** -1.5)
+    return d_model ** -0.5 * scale   # original transformer formula
+
+
+ + +
+

Deployment & MLOps

+

Getting models from experiment to production — export, serve, containerise, and monitor.

+ +
+
+

Export Pipeline

+
    +
  • 1
    Train & Validate
    Achieve target metrics. Save checkpoint with torch.save() or Hugging Face safetensors.
  • +
  • 2
    Export to ONNX
    torch.onnx.export() converts the model to a framework-agnostic graph for cross-platform inference.
  • +
  • 3
    Optimise with TensorRT
    trtexec or ONNX-TensorRT converts ONNX to a TensorRT engine. 3–10× faster on NVIDIA GPUs.
  • +
  • 4
    Serve via FastAPI
    Wrap inference in a REST endpoint. Use ONNX Runtime for lightweight CPU/GPU serving.
  • +
  • 5
    Containerise
    Docker image with model weights + FastAPI. Push to ECR / ACR / GCR and deploy to Kubernetes.
  • +
  • 6
    Monitor with MLflow
    Track experiments, model versions, metrics drift. Set up alerts for data/concept drift.
  • +
+
+
+

Optimisation Techniques

+
+
✂️
Quantisation (INT8/FP16)
Reduce precision of weights/activations. 2–4× memory reduction, 2–3× speedup with minimal accuracy loss. Post-training quantisation (PTQ) or QAT.
+
🪄
Pruning
Remove low-magnitude weights (unstructured) or entire filters/heads (structured). 40–80% parameter reduction with retraining.
+
🎓
Knowledge Distillation
Train small student to mimic large teacher's soft probability outputs. DistilBERT = 40% smaller, 60% faster, 97% of BERT's performance.
+
+
+
+ +
+
Python · ONNX Export + FastAPI Server
+
import torch
+import onnxruntime as ort
+from fastapi import FastAPI
+from pydantic import BaseModel
+import numpy as np
+
+# ─── Export to ONNX ─────────────────────────────────────────
+model.eval()
+dummy_input = torch.randn(1, 3, 224, 224)
+torch.onnx.export(
+    model, dummy_input, "model.onnx",
+    opset_version=17,
+    input_names=["image"], output_names=["logits"],
+    dynamic_axes={"image": {0: "batch"}}   # variable batch size
+)
+
+# ─── ONNX Runtime Inference ─────────────────────────────────
+sess_opts = ort.SessionOptions()
+sess_opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
+session = ort.InferenceSession("model.onnx", sess_options=sess_opts,
+                                providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
+
+# ─── FastAPI endpoint ────────────────────────────────────────
+app = FastAPI(title="Deep Learning Model API")
+
+class PredictRequest(BaseModel):
+    image: list[list[list[list[float]]]]  # NCHW float array
+
+@app.post("/predict")
+async def predict(req: PredictRequest):
+    x = np.array(req.image, dtype=np.float32)
+    logits = session.run(["logits"], {"image": x})[0]
+    probs  = np.exp(logits) / np.exp(logits).sum(-1, keepdims=True)
+    top_k  = np.argsort(probs[0])[::-1][:5]
+    return {"top5_classes": top_k.tolist(), "probs": probs[0][top_k].tolist()}
+
+
+ + +
+

Code Lab

+

Complete, production-quality code examples for the most common deep learning tasks.

+ +
+ + + + +
+ +
+
+
Python · Complete MNIST Training
+
import torch
+import torch.nn as nn
+import torch.optim as optim
+from torchvision import datasets, transforms
+from torch.utils.data import DataLoader
+
+# ─── Data ────────────────────────────────────────────────────
+transform = transforms.Compose([
+    transforms.ToTensor(),
+    transforms.Normalize((0.1307,), (0.3081,))
+])
+train_ds = datasets.MNIST("./data", train=True, download=True, transform=transform)
+test_ds  = datasets.MNIST("./data", train=False, transform=transform)
+train_dl = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4)
+test_dl  = DataLoader(test_ds,  batch_size=256)
+
+# ─── Model ────────────────────────────────────────────────────
+class ConvNet(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.features = nn.Sequential(
+            nn.Conv2d(1, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
+            nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
+            nn.MaxPool2d(2),   # 28→14
+            nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
+            nn.AdaptiveAvgPool2d((4, 4))
+        )
+        self.classifier = nn.Sequential(
+            nn.Flatten(),
+            nn.Linear(128*4*4, 256), nn.ReLU(), nn.Dropout(0.5),
+            nn.Linear(256, 10)
+        )
+    def forward(self, x): return self.classifier(self.features(x))
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = ConvNet().to(device)
+opt = optim.AdamW(model.parameters(), lr=1e-3)
+sched = optim.lr_scheduler.OneCycleLR(opt, max_lr=1e-2, steps_per_epoch=len(train_dl), epochs=10)
+crit = nn.CrossEntropyLoss()
+
+# ─── Train ────────────────────────────────────────────────────
+for epoch in range(10):
+    model.train()
+    for X, y in train_dl:
+        X, y = X.to(device), y.to(device)
+        opt.zero_grad()
+        loss = crit(model(X), y)
+        loss.backward()
+        opt.step(); sched.step()
+    model.eval()
+    correct = sum((model(X.to(device)).argmax(1) == y.to(device)).sum().item() for X,y in test_dl)
+    print(f"Epoch {epoch+1}: acc={correct/len(test_ds)*100:.2f}%")
+
+
+ +
+
+
Python · Transfer Learning (EfficientNet)
+
import torch
+import torchvision.models as models
+from torchvision import transforms, datasets
+from torch.utils.data import DataLoader
+import torch.nn as nn
+
+# ─── Augmentation pipeline ────────────────────────────────────
+train_tf = transforms.Compose([
+    transforms.RandomResizedCrop(224),
+    transforms.RandomHorizontalFlip(),
+    transforms.ColorJitter(0.2, 0.2, 0.2),
+    transforms.ToTensor(),
+    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+])
+
+# ─── Load pretrained EfficientNet-B2 ─────────────────────────
+backbone = models.efficientnet_b2(weights="IMAGENET1K_V1")
+num_ftrs  = backbone.classifier[1].in_features
+backbone.classifier = nn.Sequential(
+    nn.Dropout(0.4),
+    nn.Linear(num_ftrs, num_classes)
+)
+
+# Phase 1: train only head (frozen backbone)
+for p in backbone.features.parameters(): p.requires_grad = False
+opt1 = torch.optim.AdamW(backbone.classifier.parameters(), lr=3e-3)
+
+# Phase 2: unfreeze and fine-tune all layers
+for p in backbone.parameters(): p.requires_grad = True
+opt2 = torch.optim.AdamW(backbone.parameters(), lr=3e-5)  # low LR!
+
+
+ +
+
+
Python · BERT Fine-tuning (Hugging Face)
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
+from transformers import TrainingArguments, Trainer
+from datasets import load_dataset
+import numpy as np
+from sklearn.metrics import accuracy_score, f1_score
+
+# ─── Load model & tokeniser ───────────────────────────────────
+model_name = "bert-base-uncased"
+tokeniser  = AutoTokenizer.from_pretrained(model_name)
+model      = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
+
+# ─── Tokenise dataset ─────────────────────────────────────────
+dataset = load_dataset("imdb")
+def tokenise(batch):
+    return tokeniser(batch["text"], truncation=True, max_length=512, padding="max_length")
+dataset = dataset.map(tokenise, batched=True)
+
+# ─── Training ─────────────────────────────────────────────────
+args = TrainingArguments(
+    output_dir="./bert-imdb",
+    num_train_epochs=3,
+    per_device_train_batch_size=16,
+    learning_rate=2e-5,            # low LR for fine-tuning
+    warmup_ratio=0.06,
+    weight_decay=0.01,
+    evaluation_strategy="epoch",
+    fp16=True,                      # mixed precision
+    logging_steps=100,
+)
+
+def compute_metrics(eval_pred):
+    logits, labels = eval_pred
+    preds = np.argmax(logits, axis=-1)
+    return {"accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds)}
+
+trainer = Trainer(model=model, args=args,
+                  train_dataset=dataset["train"], eval_dataset=dataset["test"],
+                  compute_metrics=compute_metrics)
+trainer.train()
+
+
+ +
+
+
Python · Custom Dataset + Augmentation
+
from torch.utils.data import Dataset, DataLoader
+from torchvision import transforms
+from PIL import Image
+import pandas as pd, os
+
+class ImageDataset(Dataset):
+    def __init__(self, csv_path, img_dir, transform=None):
+        self.df        = pd.read_csv(csv_path)   # columns: filename, label
+        self.img_dir   = img_dir
+        self.transform = transform
+
+    def __len__(self): return len(self.df)
+
+    def __getitem__(self, idx):
+        row   = self.df.iloc[idx]
+        img   = Image.open(os.path.join(self.img_dir, row.filename)).convert("RGB")
+        label = row.label
+        if self.transform: img = self.transform(img)
+        return img, label
+
+# Heavy augmentation for training
+train_transform = transforms.Compose([
+    transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
+    transforms.RandomHorizontalFlip(),
+    transforms.RandomRotation(15),
+    transforms.ColorJitter(brightness=0.3, contrast=0.3),
+    transforms.RandomGrayscale(p=0.1),
+    transforms.ToTensor(),
+    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+])
+
+ds     = ImageDataset("train.csv", "./images", transform=train_transform)
+loader = DataLoader(ds, batch_size=32, shuffle=True, num_workers=8, pin_memory=True)
+
+
+
+ +
+ + +
+ + + +