Complete Deep Learning Curriculum

Master Deep Learning
from Neurons to Production

A comprehensive, hands-on reference covering neural network theory, architectures, training techniques, and real-world deployment — from first principles to state-of-the-art models.

10
Modules
50+
Code Examples
8
Architectures
Learning

Learning Path

1. Foundations
Linear algebra, calculus, probability — the math powering every model
2. Neural Networks
Perceptrons → MLPs, activation functions, forward/backprop
3. CNNs
Convolutions, pooling, ResNet, EfficientNet for vision tasks
4. RNNs & LSTMs
Sequence modeling, vanishing gradients, gated architectures
5. Transformers
Attention mechanism, BERT, GPT, ViT — modern AI backbone
6. GANs & Diffusion
Generative models, adversarial training, image synthesis
7. Training Mastery
Optimizers, regularisation, hyperparameter tuning, mixed precision
8. Production
ONNX, TensorRT, FastAPI, Docker, MLflow, monitoring

What You'll Learn

🧮
Mathematical Foundations
Tensors, matrix ops, chain rule, probability distributions — the core maths behind every DL algorithm.
⚙️
Architecture Design
When to choose CNN vs. RNN vs. Transformer. Design intuitions with real trade-off tables.
🔬
Training Techniques
Adam, batch norm, dropout, learning-rate scheduling, gradient clipping and more.
🚀
Production Deployment
Export models to ONNX, serve with TorchServe/FastAPI, containerise with Docker, monitor with MLflow.

Architecture Comparison

ArchitectureBest ForKey InnovationParametersYear
MLPTabular data, classificationUniversal approximatorThousands1986
CNNImage, video, audio spectrogramsWeight sharing, local connectivityMillions1998
LSTMTime series, NLP sequencesGated memory cellsMillions1997
TransformerNLP, vision, multimodalSelf-attention, parallelisationBillions2017
GANImage synthesis, data augmentationAdversarial trainingMillions–billions2014
DiffusionImage/video/audio generationDenoising score matchingBillions2020

Mathematical Foundations

The core maths every deep learning practitioner must understand — from tensors to gradients.

Tensors

Tensors are the fundamental data structure in deep learning — generalisations of scalars, vectors, and matrices to arbitrary dimensions (ranks).

RankNameExample ShapeDL Use
0Scalar()Loss value, learning rate
1Vector(512,)Embedding, bias
2Matrix(64, 512)Weight matrix, batch
33D Tensor(32, 128, 512)Batch of sequences
44D Tensor(32, 3, 224, 224)Batch of images (NCHW)

Essential Operations

Matrix Multiplication
C[i,j] = Σ_k A[i,k] · B[k,j]
C = A @ B → shape: (m,n) @ (n,p) = (m,p)
Dot Product / Inner Product
a · b = Σᵢ aᵢbᵢ = |a||b|cos(θ)
Hadamard (Element-wise)
(A ⊙ B)[i,j] = A[i,j] · B[i,j]
Broadcast Rule
Dims aligned right; size-1 dims expand to match
Python · PyTorch
import torch

# Creating tensors
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # from list
zeros = torch.zeros(3, 4)                    # shape (3,4)
rand  = torch.randn(32, 512)                # normal dist

# Fundamental ops
W = torch.randn(512, 256)
b = torch.zeros(256)
out = rand @ W + b   # (32,512)@(512,256)+256 → (32,256)

# Reshape, transpose, squeeze
t = torch.arange(24).reshape(2, 3, 4)
t_T = t.transpose(1, 2)           # (2,4,3)
flat = t.flatten(1)               # (2,12)

# GPU transfer
device = "cuda" if torch.cuda.is_available() else "cpu"
x = x.to(device)

The Chain Rule — Heart of Backprop

Chain Rule
dL/dw = (dL/dy) · (dy/dw)

For composition f(g(x)):
df/dx = (df/dg) · (dg/dx)
Gradient Descent Update
θ ← θ − η · ∇_θ L(θ)

where η = learning rate
∇_θ L = gradient of loss w.r.t. θ
💡
Key InsightThe gradient tells us the direction of steepest ascent in loss space. We subtract it to descend toward lower loss.

Partial Derivatives in Layers

For a linear layer y = Wx + b and loss L:

Gradients of Linear Layer
∂L/∂W = (∂L/∂y) · xᵀ
∂L/∂b = ∂L/∂y
∂L/∂x = Wᵀ · (∂L/∂y)
Jacobian Matrix
J[i,j] = ∂yᵢ/∂xⱼ

For vector → vector functions
Shape: (dim_y × dim_x)
Python · Autograd
import torch

# Automatic differentiation with requires_grad
x = torch.tensor([2.0, 3.0], requires_grad=True)
W = torch.randn(2, 2, requires_grad=True)

# Forward pass — builds computation graph
y = x @ W            # (2,) @ (2,2) → (2,)
loss = y.sum()       # scalar loss

# Backward pass — computes gradients via chain rule
loss.backward()
print(x.grad)        # ∂loss/∂x
print(W.grad)        # ∂loss/∂W

# Manual gradient check
with torch.no_grad():
    W -= 0.01 * W.grad  # gradient descent step
    W.grad.zero_()       # must zero before next backward()

Key Distributions in DL

DistributionUse in DL
Normal N(μ,σ²)Weight init, noise injection, VAE latent
BernoulliBinary classification output, dropout
CategoricalMulti-class softmax output, token prediction
UniformXavier init, random sampling
DirichletTopic models, mixture models

Loss Functions as Likelihoods

Cross-Entropy Loss (Classification)
L = −Σᵢ yᵢ · log(ŷᵢ)

= −log P(true class | input)
MSE Loss (Regression)
L = (1/n) Σᵢ (yᵢ − ŷᵢ)²

= MLE under Gaussian noise assumption
KL Divergence (VAE)
KL(P‖Q) = Σᵢ P(x) log[P(x)/Q(x)]

Information Theory

Shannon Entropy
H(X) = −Σᵢ P(xᵢ) · log₂P(xᵢ)
Measures uncertainty / information content
Mutual Information
I(X;Y) = H(X) − H(X|Y)
How much Y tells us about X
ℹ️
Why It MattersCross-entropy loss is just the negative log-likelihood, which minimises KL divergence between predicted and true distributions — directly rooted in information theory.
🔑
Softmax TemperatureDividing logits by temperature T before softmax controls sharpness: T→0 = argmax, T→∞ = uniform. Used in knowledge distillation and sampling from LLMs.

Neural Networks

From the biological neuron to deep multi-layer perceptrons — theory, math, and interactive visualisation.

Architecture

  • 1
    Input Layer
    Receives raw features. No computation — passes values forward. Each node = one feature.
  • 2
    Hidden Layers
    Each neuron computes z = Wx + b, then applies activation σ(z). Multiple hidden layers = "deep" network.
  • 3
    Output Layer
    Produces predictions. Activation depends on task: sigmoid (binary), softmax (multiclass), linear (regression).
  • 4
    Forward Pass
    Data flows input→output. Loss is computed comparing prediction to ground truth.
  • 5
    Backpropagation
    Gradients flow output→input via chain rule. Each weight updated: w ← w − η·∂L/∂w.
Interactive Neural Network — click neurons 4-4-3-2 layers

Activation Functions

Activation Functions ComparisonVisualisation
FunctionFormulaRangeUse CaseDrawback
Sigmoid1/(1+e⁻ˣ)(0,1)Binary outputVanishing gradient
Tanh(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)(-1,1)Hidden layers (old)Vanishing gradient
ReLUmax(0, x)[0,∞)Most hidden layersDying ReLU
Leaky ReLUmax(αx, x)(-∞,∞)Fixes dying ReLUExtra hyperparameter
GELUx·Φ(x)≈(-0.17,∞)Transformers (BERT, GPT)More compute
Swishx·sigmoid(x)(-∞,∞)EfficientNetMore compute

Complete MLP Implementation

Python · PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# ─── Define MLP ───────────────────────────────────────────
class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
        super().__init__()
        layers = []
        dims = [input_dim] + hidden_dims
        for i in range(len(dims) - 1):
            layers += [
                nn.Linear(dims[i], dims[i+1]),
                nn.BatchNorm1d(dims[i+1]),  # normalise activations
                nn.GELU(),                  # smooth non-linearity
                nn.Dropout(dropout),        # regularisation
            ]
        layers.append(nn.Linear(dims[-1], output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

# ─── Training Loop ─────────────────────────────────────────
def train(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    for X, y in loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()        # clear previous gradients
        logits = model(X)            # forward pass
        loss = criterion(logits, y)  # compute loss
        loss.backward()             # backpropagation
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # gradient clip
        optimizer.step()            # update weights
        total_loss += loss.item()
    return total_loss / len(loader)

# ─── Instantiate and run ────────────────────────────────────
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MLP(input_dim=784, hidden_dims=[512, 256, 128], output_dim=10).to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
criterion = nn.CrossEntropyLoss()

Convolutional Neural Networks

Spatial pattern recognition through learned filters — the foundation of computer vision.

Core Concepts

2D Convolution
(I * K)[i,j] = Σₘ Σₙ I[i+m, j+n] · K[m,n]

Output size = ⌊(N + 2P − F)/S⌋ + 1
N=input, P=padding, F=filter, S=stride
Small weight matrices (e.g. 3×3, 5×5) that slide over the input, computing dot products. Each filter learns to detect a specific pattern — edges, textures, shapes. Stacking multiple filters creates channels (depth) in the feature map.
Max pooling keeps the strongest activation per region. Average pooling takes the mean. Both reduce spatial dimensions while retaining important features. Modern CNNs use global average pooling before the classifier head.
The region of input space that a neuron "sees". Stacking 3×3 convolutions: 1 layer → 3×3, 2 layers → 5×5, 3 layers → 7×7. Deep CNNs build massive receptive fields from small kernels — efficient and powerful.
Layer 1: edges, gradients. Layer 2: textures, corners. Layer 3: object parts. Layer 4-5: entire objects, semantic concepts. This hierarchical representation is why CNNs transfer well across domains.

Architecture Milestones

LeNet-5 (1998)
First practical CNN — handwritten digit recognition. Established conv→pool→fc pattern.
AlexNet (2012)
Sparked the DL revolution. ReLU, dropout, GPU training. ImageNet top-5: 15.3% error.
VGG-16 (2014)
Deep, uniform 3×3 conv stacks. Simple and effective. Still popular for transfer learning.
ResNet (2015)
Residual connections solved vanishing gradients. Enabled 152-layer networks. Skip connections = game changer.
EfficientNet (2019)
Compound scaling of width, depth, resolution. SOTA accuracy with 8× fewer params than ResNet-50.
ConvNeXt (2022)
Modernised ResNet design inspired by Transformers. Competitive with ViT on ImageNet.

ResNet Skip Connection

Residual Block
y = F(x, {Wᵢ}) + x

F(x) = Conv → BN → ReLU → Conv → BN
Output = F(x) + x (identity shortcut)
Gradient: ∂L/∂x = ∂L/∂y · (∂F/∂x + 1) — always ≥ 1, preventing vanishing
Python · ResNet Block
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(channels)
        self.relu  = nn.ReLU(inplace=True)
        # Shortcut if stride changes spatial dims
        self.shortcut = nn.Sequential(
            nn.Conv2d(channels, channels, 1, stride=stride, bias=False),
            nn.BatchNorm2d(channels)
        ) if stride != 1 else nn.Identity()

    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)   # ← skip connection
        return self.relu(out)

# Transfer learning with pretrained ResNet
import torchvision.models as models
backbone = models.resnet50(weights="IMAGENET1K_V1")
backbone.fc = nn.Linear(2048, num_classes)   # replace head

# Freeze backbone, fine-tune head only
for p in backbone.parameters():
    p.requires_grad = False
for p in backbone.fc.parameters():
    p.requires_grad = True

RNNs & LSTMs

Modelling sequential dependencies — from simple recurrent nets to gated memory architectures.

Recurrent Networks

Vanilla RNN
hₜ = tanh(Wₕ·hₜ₋₁ + Wₓ·xₜ + b)
yₜ = Wᵧ·hₜ + bᵧ

hₜ = hidden state at time t
xₜ = input at time t
⚠️
Vanishing Gradient ProblemIn deep unrolled RNNs, gradients can shrink to ~0 over long sequences: ∂h₁₀₀/∂h₁ ≈ (∂hₜ/∂hₜ₋₁)¹⁰⁰ → 0 if |∂hₜ/∂hₜ₋₁| < 1. LSTMs and GRUs solve this with gating.

GRU (Gated Recurrent Unit)

zₜ = σ(Wz·[hₜ₋₁, xₜ]) — update gate
rₜ = σ(Wr·[hₜ₋₁, xₜ]) — reset gate
h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ]) — candidate
hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ

LSTM Architecture

LSTM Gates
fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) — forget gate
iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi) — input gate
C̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc) — candidate
Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ — cell state
oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo) — output gate
hₜ = oₜ⊙tanh(Cₜ)
💡
Cell State = HighwayThe cell state Cₜ runs along the top of the LSTM with only minor linear interactions. Gradients can flow through it almost unchanged over hundreds of steps — solving vanishing gradients.
Python · Bidirectional LSTM
import torch
import torch.nn as nn

class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,   # forward + backward
            dropout=0.3
        )
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),  # *2 for bidir
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, x, lengths):
        emb = self.embedding(x)           # (B, T, E)
        # Pack for variable-length sequences
        packed = nn.utils.rnn.pack_padded_sequence(emb, lengths, batch_first=True, enforce_sorted=False)
        out, (hn, _) = self.lstm(packed)
        # Concat last forward + backward hidden states
        last_hidden = torch.cat([hn[-2], hn[-1]], dim=1)  # (B, H*2)
        return self.classifier(last_hidden)

Transformers

The architecture that redefined AI — self-attention, positional encoding, and the models built on top.

Self-Attention

Scaled Dot-Product Attention
Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V

Q = XWᴼ, K = XWᴷ, V = XWᵛ
dₖ = key dimension (scale prevents small gradients)
Multi-Head Attention
MHA(Q,K,V) = Concat(head₁,...,headₕ)Wᴼ
headᵢ = Attention(QWᵢᴼ, KWᵢᴷ, VWᵢᵛ)

Each head learns different relationship types
🔑
Why Attention WorksUnlike RNNs, attention computes relationships between ALL pairs of tokens in O(n²) — but fully in parallel. Long-range dependencies cost the same as short-range ones.

Encoder-Decoder Structure

  • 1
    Input Embedding + PE
    Token IDs → embeddings. Add sinusoidal or learnable positional encoding to inject sequence order.
  • 2
    Encoder Block
    Multi-Head Self-Attention → Add & Norm → Feed Forward → Add & Norm. Repeated N times.
  • 3
    Decoder Block
    Masked Self-Attention → Cross-Attention (attends to encoder) → FFN. Generates one token at a time.
  • 4
    Output Projection
    Linear + Softmax over vocabulary. At inference: greedy / beam search / nucleus sampling.

Popular Variants

ModelTypeParamsKey UseInnovation
BERTEncoder-only110M–340MClassification, NER, QAMasked language modelling (MLM)
GPT-4Decoder-only~1.8TText generation, chatRLHF + MoE scaling
T5Encoder-Decoder11BSummarisation, translationText-to-text framing
ViTEncoder-only86M–632MImage classificationPatch embeddings replace CNN
Llama 3Decoder-only8B–70BOpen-source LLMGQA, RoPE, SwiGLU
WhisperEncoder-Decoder39M–1.5BSpeech recognitionMultitask audio transformer
Python · Self-Attention from Scratch
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.h   = num_heads
        self.Wq  = nn.Linear(d_model, d_model)
        self.Wk  = nn.Linear(d_model, d_model)
        self.Wv  = nn.Linear(d_model, d_model)
        self.Wo  = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        B, T, D = q.shape
        Q = self.Wq(q).view(B, T, self.h, self.d_k).transpose(1,2)
        K = self.Wk(k).view(B, -1, self.h, self.d_k).transpose(1,2)
        V = self.Wv(v).view(B, -1, self.h, self.d_k).transpose(1,2)

        # Scaled dot-product attention
        scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn = F.softmax(scores, dim=-1)
        out  = (attn @ V).transpose(1,2).reshape(B, T, D)
        return self.Wo(out), attn

GANs & Generative Models

Adversarial training, VAEs, and diffusion — teaching machines to create.

GAN Framework

Minimax Objective
min_G max_D V(D,G) =
E[log D(x)] + E[log(1 − D(G(z)))]

D(x) → 1 for real, 0 for fake
G(z) → fool D into D(G(z)) → 1
⚠️
Training InstabilityGANs suffer from mode collapse (G generates only a few modes) and vanishing gradients when D is too strong. Solutions: WGAN-GP, spectral norm, minibatch discrimination, progressive growing.

GAN Variants

VariantInnovation
DCGANConv layers, batch norm — stable training
WGAN-GPWasserstein loss + gradient penalty
StyleGAN 3Alias-free generation, style mixing
CycleGANUnpaired image translation
Pix2PixPaired image-to-image translation

VAE vs. GAN vs. Diffusion

🧮
VAE (Variational Autoencoder)
Encodes input to latent distribution N(μ,σ²). Maximises ELBO = reconstruction − KL(q‖p). Smooth latent space. Blurry outputs.
⚔️
GAN
Generator vs. discriminator adversarial game. Sharp, photorealistic outputs. Hard to train, mode collapse risk.
❄️
Diffusion Models
Gradually add Gaussian noise to data, train a U-Net to predict and reverse the noise. State-of-the-art quality. Slower inference (many steps). Stable Diffusion, DALL-E 3, Imagen.
Python · DCGAN Generator
import torch.nn as nn

class DCGANGenerator(nn.Module):
    def __init__(self, latent_dim=100, channels=3):
        super().__init__()
        def block(in_c, out_c, stride=2, padding=1):
            return [
                nn.ConvTranspose2d(in_c, out_c, 4, stride, padding, bias=False),
                nn.BatchNorm2d(out_c),
                nn.ReLU(True)
            ]
        self.net = nn.Sequential(
            # 1×1 → 4×4
            nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False),
            nn.BatchNorm2d(512), nn.ReLU(True),
            *block(512, 256),  # 8×8
            *block(256, 128),  # 16×16
            *block(128, 64),   # 32×32
            nn.ConvTranspose2d(64, channels, 4, 2, 1, bias=False),
            nn.Tanh()          # 64×64, range [-1,1]
        )
    def forward(self, z):
        return self.net(z.view(-1, z.shape[1], 1, 1))

Training Deep Learning Models

Optimisers, regularisation, hyperparameter tuning, and tricks that separate good models from great ones.

Optimisers

vₜ = β·vₜ₋₁ + ∇L; θ ← θ − η·vₜ. Momentum β≈0.9 dampens oscillations and accelerates convergence. Still best for CNNs with careful LR scheduling. Nesterov variant: look-ahead gradient.
mₜ = β₁mₜ₋₁ + (1-β₁)g; vₜ = β₂vₜ₋₁ + (1-β₂)g². θ ← θ − η·m̂ₜ/√(v̂ₜ+ε). Defaults: β₁=0.9, β₂=0.999, ε=1e-8, η=3e-4. Robust default for most tasks.
Decouples weight decay from gradient update. θ ← θ − η·(m̂/√v̂ + λθ). Preferred over Adam for transformers and LLMs. Use weight_decay=0.01-0.1.
Cosine Annealing: η oscillates from ηₘₐₓ to ηₘᵢₙ. Warmup + Cosine (transformers): linearly ramp LR for first N steps then cosine decay. OneCycleLR: super-convergence with very high max LR. ReduceLROnPlateau: adaptive decay on metric stagnation.

Regularisation Techniques

Dropout (p=0.5)
82%
Batch Normalisation
91%
Weight Decay (L2)
78%
Data Augmentation
88%
Early Stopping
74%
Label Smoothing
70%

Effectiveness score (higher = more commonly beneficial across task types)

Batch vs. Layer Normalisation

TypeNormalises OverBest For
BatchNormBatch dimensionCNNs, large batches
LayerNormFeature dimensionTransformers, NLP, RNNs
GroupNormGroups of channelsSmall batch sizes
RMSNormFeature dim (simpler)Modern LLMs (Llama)
Python · Mixed Precision + Gradient Scaling
import torch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()   # handles FP16 loss scaling

def train_step(model, batch, optimizer, criterion):
    X, y = batch
    optimizer.zero_grad()

    with autocast():                  # FP16 forward pass
        logits = model(X)
        loss   = criterion(logits, y)

    scaler.scale(loss).backward()    # scaled gradients
    scaler.unscale_(optimizer)         # unscale before clip
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)             # update weights
    scaler.update()                    # adjust scale factor
    return loss.item()

# LR warmup + cosine decay (Transformers best practice)
import math

def get_lr(step, d_model, warmup_steps):
    if step == 0: return 0.0
    scale = min(step ** -0.5, step * warmup_steps ** -1.5)
    return d_model ** -0.5 * scale   # original transformer formula

Deployment & MLOps

Getting models from experiment to production — export, serve, containerise, and monitor.

Export Pipeline

  • 1
    Train & Validate
    Achieve target metrics. Save checkpoint with torch.save() or Hugging Face safetensors.
  • 2
    Export to ONNX
    torch.onnx.export() converts the model to a framework-agnostic graph for cross-platform inference.
  • 3
    Optimise with TensorRT
    trtexec or ONNX-TensorRT converts ONNX to a TensorRT engine. 3–10× faster on NVIDIA GPUs.
  • 4
    Serve via FastAPI
    Wrap inference in a REST endpoint. Use ONNX Runtime for lightweight CPU/GPU serving.
  • 5
    Containerise
    Docker image with model weights + FastAPI. Push to ECR / ACR / GCR and deploy to Kubernetes.
  • 6
    Monitor with MLflow
    Track experiments, model versions, metrics drift. Set up alerts for data/concept drift.

Optimisation Techniques

✂️
Quantisation (INT8/FP16)
Reduce precision of weights/activations. 2–4× memory reduction, 2–3× speedup with minimal accuracy loss. Post-training quantisation (PTQ) or QAT.
🪄
Pruning
Remove low-magnitude weights (unstructured) or entire filters/heads (structured). 40–80% parameter reduction with retraining.
🎓
Knowledge Distillation
Train small student to mimic large teacher's soft probability outputs. DistilBERT = 40% smaller, 60% faster, 97% of BERT's performance.
Python · ONNX Export + FastAPI Server
import torch
import onnxruntime as ort
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np

# ─── Export to ONNX ─────────────────────────────────────────
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model, dummy_input, "model.onnx",
    opset_version=17,
    input_names=["image"], output_names=["logits"],
    dynamic_axes={"image": {0: "batch"}}   # variable batch size
)

# ─── ONNX Runtime Inference ─────────────────────────────────
sess_opts = ort.SessionOptions()
sess_opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession("model.onnx", sess_options=sess_opts,
                                providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

# ─── FastAPI endpoint ────────────────────────────────────────
app = FastAPI(title="Deep Learning Model API")

class PredictRequest(BaseModel):
    image: list[list[list[list[float]]]]  # NCHW float array

@app.post("/predict")
async def predict(req: PredictRequest):
    x = np.array(req.image, dtype=np.float32)
    logits = session.run(["logits"], {"image": x})[0]
    probs  = np.exp(logits) / np.exp(logits).sum(-1, keepdims=True)
    top_k  = np.argsort(probs[0])[::-1][:5]
    return {"top5_classes": top_k.tolist(), "probs": probs[0][top_k].tolist()}

Code Lab

Complete, production-quality code examples for the most common deep learning tasks.

Python · Complete MNIST Training
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ─── Data ────────────────────────────────────────────────────
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])
train_ds = datasets.MNIST("./data", train=True, download=True, transform=transform)
test_ds  = datasets.MNIST("./data", train=False, transform=transform)
train_dl = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4)
test_dl  = DataLoader(test_ds,  batch_size=256)

# ─── Model ────────────────────────────────────────────────────
class ConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(2),   # 28→14
            nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4))
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128*4*4, 256), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(256, 10)
        )
    def forward(self, x): return self.classifier(self.features(x))

device = "cuda" if torch.cuda.is_available() else "cpu"
model = ConvNet().to(device)
opt = optim.AdamW(model.parameters(), lr=1e-3)
sched = optim.lr_scheduler.OneCycleLR(opt, max_lr=1e-2, steps_per_epoch=len(train_dl), epochs=10)
crit = nn.CrossEntropyLoss()

# ─── Train ────────────────────────────────────────────────────
for epoch in range(10):
    model.train()
    for X, y in train_dl:
        X, y = X.to(device), y.to(device)
        opt.zero_grad()
        loss = crit(model(X), y)
        loss.backward()
        opt.step(); sched.step()
    model.eval()
    correct = sum((model(X.to(device)).argmax(1) == y.to(device)).sum().item() for X,y in test_dl)
    print(f"Epoch {epoch+1}: acc={correct/len(test_ds)*100:.2f}%")
Python · Transfer Learning (EfficientNet)
import torch
import torchvision.models as models
from torchvision import transforms, datasets
from torch.utils.data import DataLoader
import torch.nn as nn

# ─── Augmentation pipeline ────────────────────────────────────
train_tf = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.2, 0.2, 0.2),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# ─── Load pretrained EfficientNet-B2 ─────────────────────────
backbone = models.efficientnet_b2(weights="IMAGENET1K_V1")
num_ftrs  = backbone.classifier[1].in_features
backbone.classifier = nn.Sequential(
    nn.Dropout(0.4),
    nn.Linear(num_ftrs, num_classes)
)

# Phase 1: train only head (frozen backbone)
for p in backbone.features.parameters(): p.requires_grad = False
opt1 = torch.optim.AdamW(backbone.classifier.parameters(), lr=3e-3)

# Phase 2: unfreeze and fine-tune all layers
for p in backbone.parameters(): p.requires_grad = True
opt2 = torch.optim.AdamW(backbone.parameters(), lr=3e-5)  # low LR!
Python · BERT Fine-tuning (Hugging Face)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ─── Load model & tokeniser ───────────────────────────────────
model_name = "bert-base-uncased"
tokeniser  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# ─── Tokenise dataset ─────────────────────────────────────────
dataset = load_dataset("imdb")
def tokenise(batch):
    return tokeniser(batch["text"], truncation=True, max_length=512, padding="max_length")
dataset = dataset.map(tokenise, batched=True)

# ─── Training ─────────────────────────────────────────────────
args = TrainingArguments(
    output_dir="./bert-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,            # low LR for fine-tuning
    warmup_ratio=0.06,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    fp16=True,                      # mixed precision
    logging_steps=100,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds)}

trainer = Trainer(model=model, args=args,
                  train_dataset=dataset["train"], eval_dataset=dataset["test"],
                  compute_metrics=compute_metrics)
trainer.train()
Python · Custom Dataset + Augmentation
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import pandas as pd, os

class ImageDataset(Dataset):
    def __init__(self, csv_path, img_dir, transform=None):
        self.df        = pd.read_csv(csv_path)   # columns: filename, label
        self.img_dir   = img_dir
        self.transform = transform

    def __len__(self): return len(self.df)

    def __getitem__(self, idx):
        row   = self.df.iloc[idx]
        img   = Image.open(os.path.join(self.img_dir, row.filename)).convert("RGB")
        label = row.label
        if self.transform: img = self.transform(img)
        return img, label

# Heavy augmentation for training
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.3, contrast=0.3),
    transforms.RandomGrayscale(p=0.1),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

ds     = ImageDataset("train.csv", "./images", transform=train_transform)
loader = DataLoader(ds, batch_size=32, shuffle=True, num_workers=8, pin_memory=True)