Complete Deep Learning Curriculum

Master Deep Learning
from Neurons to Production

A comprehensive, hands-on reference covering neural network theory, architectures, training techniques, and real-world deployment — from first principles to state-of-the-art models.

Modules

50+

Code Examples

Architectures

∞

Learning

Learning Path

1. Foundations

Linear algebra, calculus, probability — the math powering every model

2. Neural Networks

Perceptrons → MLPs, activation functions, forward/backprop

3. CNNs

Convolutions, pooling, ResNet, EfficientNet for vision tasks

4. RNNs & LSTMs

Sequence modeling, vanishing gradients, gated architectures

5. Transformers

Attention mechanism, BERT, GPT, ViT — modern AI backbone

6. GANs & Diffusion

Generative models, adversarial training, image synthesis

7. Training Mastery

Optimizers, regularisation, hyperparameter tuning, mixed precision

8. Production

ONNX, TensorRT, FastAPI, Docker, MLflow, monitoring

What You'll Learn

🧮

Mathematical Foundations

Tensors, matrix ops, chain rule, probability distributions — the core maths behind every DL algorithm.

⚙️

Architecture Design

When to choose CNN vs. RNN vs. Transformer. Design intuitions with real trade-off tables.

🔬

Training Techniques

Adam, batch norm, dropout, learning-rate scheduling, gradient clipping and more.

🚀

Production Deployment

Export models to ONNX, serve with TorchServe/FastAPI, containerise with Docker, monitor with MLflow.

Architecture Comparison

Architecture	Best For	Key Innovation	Parameters	Year
MLP	Tabular data, classification	Universal approximator	Thousands	1986
CNN	Image, video, audio spectrograms	Weight sharing, local connectivity	Millions	1998
LSTM	Time series, NLP sequences	Gated memory cells	Millions	1997
Transformer	NLP, vision, multimodal	Self-attention, parallelisation	Billions	2017
GAN	Image synthesis, data augmentation	Adversarial training	Millions–billions	2014
Diffusion	Image/video/audio generation	Denoising score matching	Billions	2020

Mathematical Foundations

The core maths every deep learning practitioner must understand — from tensors to gradients.

Tensors

Tensors are the fundamental data structure in deep learning — generalisations of scalars, vectors, and matrices to arbitrary dimensions (ranks).

Rank	Name	Example Shape	DL Use
0	Scalar	()	Loss value, learning rate
1	Vector	(512,)	Embedding, bias
2	Matrix	(64, 512)	Weight matrix, batch
3	3D Tensor	(32, 128, 512)	Batch of sequences
4	4D Tensor	(32, 3, 224, 224)	Batch of images (NCHW)

Essential Operations

Matrix Multiplication C[i,j] = Σ_k A[i,k] \cdot B[k,j] C = A @ B \to shape: (m,n) @ (n,p) = (m,p)

Dot Product / Inner Product a \cdot b = Σᵢ aᵢbᵢ = |a||b|cos(θ)

Hadamard (Element-wise) (A ⊙ B)[i,j] = A[i,j] \cdot B[i,j]

Broadcast Rule Dims aligned right; size-1 dims expand to match

Python · PyTorch

import torch

# Creating tensors
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # from list
zeros = torch.zeros(3, 4)                    # shape (3,4)
rand  = torch.randn(32, 512)                # normal dist

# Fundamental ops
W = torch.randn(512, 256)
b = torch.zeros(256)
out = rand @ W + b   # (32,512)@(512,256)+256 → (32,256)

# Reshape, transpose, squeeze
t = torch.arange(24).reshape(2, 3, 4)
t_T = t.transpose(1, 2)           # (2,4,3)
flat = t.flatten(1)               # (2,12)

# GPU transfer
device = "cuda" if torch.cuda.is_available() else "cpu"
x = x.to(device)

The Chain Rule — Heart of Backprop

Chain Rule dL/dw = (dL/dy) \cdot (dy/dw) For composition f(g(x)): df/dx = (df/dg) \cdot (dg/dx)

Gradient Descent Update θ \leftarrow θ - η \cdot \nabla_θ L(θ) where η = learning rate \nabla_θ L = gradient of loss w.r.t. θ

💡

Key InsightThe gradient tells us the direction of steepest ascent in loss space. We subtract it to descend toward lower loss.

Partial Derivatives in Layers

For a linear layer y = Wx + b and loss L:

Gradients of Linear Layer \partialL/\partialW = (\partialL/\partialy) \cdot xᵀ \partialL/\partialb = \partialL/\partialy \partialL/\partialx = Wᵀ \cdot (\partialL/\partialy)

Jacobian Matrix J[i,j] = \partialyᵢ/\partialxⱼ For vector \to vector functions Shape: (dim_y \times dim_x)

Python · Autograd

import torch

# Automatic differentiation with requires_grad
x = torch.tensor([2.0, 3.0], requires_grad=True)
W = torch.randn(2, 2, requires_grad=True)

# Forward pass — builds computation graph
y = x @ W            # (2,) @ (2,2) → (2,)
loss = y.sum()       # scalar loss

# Backward pass — computes gradients via chain rule
loss.backward()
print(x.grad)        # ∂loss/∂x
print(W.grad)        # ∂loss/∂W

# Manual gradient check
with torch.no_grad():
    W -= 0.01 * W.grad  # gradient descent step
    W.grad.zero_()       # must zero before next backward()

Key Distributions in DL

Distribution	Use in DL
Normal N(μ,σ²)	Weight init, noise injection, VAE latent
Bernoulli	Binary classification output, dropout
Categorical	Multi-class softmax output, token prediction
Uniform	Xavier init, random sampling
Dirichlet	Topic models, mixture models

Loss Functions as Likelihoods

Cross-Entropy Loss (Classification) L = -Σᵢ yᵢ \cdot log(ŷᵢ) = -log P(true class | input)

MSE Loss (Regression) L = (1/n) Σᵢ (yᵢ - ŷᵢ)² = MLE under Gaussian noise assumption

KL Divergence (VAE) KL(P‖Q) = Σᵢ P(x) log[P(x)/Q(x)]

Information Theory

Shannon Entropy H(X) = -Σᵢ P(xᵢ) \cdot log₂P(xᵢ) Measures uncertainty / information content

Mutual Information I(X;Y) = H(X) - H(X|Y) How much Y tells us about X

ℹ️

Why It MattersCross-entropy loss is just the negative log-likelihood, which minimises KL divergence between predicted and true distributions — directly rooted in information theory.

🔑

Softmax TemperatureDividing logits by temperature T before softmax controls sharpness: T→0 = argmax, T→∞ = uniform. Used in knowledge distillation and sampling from LLMs.

Neural Networks

From the biological neuron to deep multi-layer perceptrons — theory, math, and interactive visualisation.

Architecture

1
Input Layer
Receives raw features. No computation — passes values forward. Each node = one feature.
2
Hidden Layers
Each neuron computes z = Wx + b, then applies activation σ(z). Multiple hidden layers = "deep" network.
3
Output Layer
Produces predictions. Activation depends on task: sigmoid (binary), softmax (multiclass), linear (regression).
4
Forward Pass
Data flows input→output. Loss is computed comparing prediction to ground truth.
5
Backpropagation
Gradients flow output→input via chain rule. Each weight updated: w ← w − η·∂L/∂w.

Interactive Neural Network — click neurons 4-4-3-2 layers

Activation Functions

Activation Functions ComparisonVisualisation

Function	Formula	Range	Use Case	Drawback
Sigmoid	1/(1+e⁻ˣ)	(0,1)	Binary output	Vanishing gradient
Tanh	(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)	(-1,1)	Hidden layers (old)	Vanishing gradient
ReLU	max(0, x)	[0,∞)	Most hidden layers	Dying ReLU
Leaky ReLU	max(αx, x)	(-∞,∞)	Fixes dying ReLU	Extra hyperparameter
GELU	x·Φ(x)	≈(-0.17,∞)	Transformers (BERT, GPT)	More compute
Swish	x·sigmoid(x)	(-∞,∞)	EfficientNet	More compute

Complete MLP Implementation

Python · PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# ─── Define MLP ───────────────────────────────────────────
class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
        super().__init__()
        layers = []
        dims = [input_dim] + hidden_dims
        for i in range(len(dims) - 1):
            layers += [
                nn.Linear(dims[i], dims[i+1]),
                nn.BatchNorm1d(dims[i+1]),  # normalise activations
                nn.GELU(),                  # smooth non-linearity
                nn.Dropout(dropout),        # regularisation
            ]
        layers.append(nn.Linear(dims[-1], output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

# ─── Training Loop ─────────────────────────────────────────
def train(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    for X, y in loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()        # clear previous gradients
        logits = model(X)            # forward pass
        loss = criterion(logits, y)  # compute loss
        loss.backward()             # backpropagation
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # gradient clip
        optimizer.step()            # update weights
        total_loss += loss.item()
    return total_loss / len(loader)

# ─── Instantiate and run ────────────────────────────────────
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MLP(input_dim=784, hidden_dims=[512, 256, 128], output_dim=10).to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
criterion = nn.CrossEntropyLoss()

Convolutional Neural Networks

Spatial pattern recognition through learned filters — the foundation of computer vision.

Core Concepts

2D Convolution (I * K)[i,j] = Σₘ Σₙ I[i+m, j+n] \cdot K[m,n] Output size = ⌊(N + 2P - F)/S⌋ + 1 N=input, P=padding, F=filter, S=stride

Small weight matrices (e.g. 3×3, 5×5) that slide over the input, computing dot products. Each filter learns to detect a specific pattern — edges, textures, shapes. Stacking multiple filters creates channels (depth) in the feature map.

Max pooling keeps the strongest activation per region. Average pooling takes the mean. Both reduce spatial dimensions while retaining important features. Modern CNNs use global average pooling before the classifier head.

The region of input space that a neuron "sees". Stacking 3×3 convolutions: 1 layer → 3×3, 2 layers → 5×5, 3 layers → 7×7. Deep CNNs build massive receptive fields from small kernels — efficient and powerful.

Layer 1: edges, gradients. Layer 2: textures, corners. Layer 3: object parts. Layer 4-5: entire objects, semantic concepts. This hierarchical representation is why CNNs transfer well across domains.

Architecture Milestones

LeNet-5 (1998)

First practical CNN — handwritten digit recognition. Established conv→pool→fc pattern.

AlexNet (2012)

Sparked the DL revolution. ReLU, dropout, GPU training. ImageNet top-5: 15.3% error.

VGG-16 (2014)

Deep, uniform 3×3 conv stacks. Simple and effective. Still popular for transfer learning.

ResNet (2015)

Residual connections solved vanishing gradients. Enabled 152-layer networks. Skip connections = game changer.

EfficientNet (2019)

Compound scaling of width, depth, resolution. SOTA accuracy with 8× fewer params than ResNet-50.

ConvNeXt (2022)

Modernised ResNet design inspired by Transformers. Competitive with ViT on ImageNet.

ResNet Skip Connection

Residual Block y = F(x, {Wᵢ}) + x F(x) = Conv \to BN \to ReLU \to Conv \to BN Output = F(x) + x (identity shortcut) Gradient: \partialL/\partialx = \partialL/\partialy \cdot (\partialF/\partialx + 1) — always \geq 1, preventing vanishing

Python · ResNet Block

import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(channels)
        self.relu  = nn.ReLU(inplace=True)
        # Shortcut if stride changes spatial dims
        self.shortcut = nn.Sequential(
            nn.Conv2d(channels, channels, 1, stride=stride, bias=False),
            nn.BatchNorm2d(channels)
        ) if stride != 1 else nn.Identity()

    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)   # ← skip connection
        return self.relu(out)

# Transfer learning with pretrained ResNet
import torchvision.models as models
backbone = models.resnet50(weights="IMAGENET1K_V1")
backbone.fc = nn.Linear(2048, num_classes)   # replace head

# Freeze backbone, fine-tune head only
for p in backbone.parameters():
    p.requires_grad = False
for p in backbone.fc.parameters():
    p.requires_grad = True

RNNs & LSTMs

Modelling sequential dependencies — from simple recurrent nets to gated memory architectures.

Recurrent Networks

Vanilla RNN hₜ = tanh(Wₕ\cdothₜ₋₁ + Wₓ\cdotxₜ + b) yₜ = Wᵧ\cdothₜ + bᵧ hₜ = hidden state at time t xₜ = input at time t

⚠️

Vanishing Gradient ProblemIn deep unrolled RNNs, gradients can shrink to ~0 over long sequences: ∂h₁₀₀/∂h₁ ≈ (∂hₜ/∂hₜ₋₁)¹⁰⁰ → 0 if |∂hₜ/∂hₜ₋₁| < 1. LSTMs and GRUs solve this with gating.

GRU (Gated Recurrent Unit)

zₜ = σ(Wz\cdot[hₜ₋₁, xₜ]) — update gate rₜ = σ(Wr\cdot[hₜ₋₁, xₜ]) — reset gate h̃ₜ = tanh(W\cdot[rₜ⊙hₜ₋₁, xₜ]) — candidate hₜ = (1-zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ

LSTM Architecture

LSTM Gates fₜ = σ(Wf\cdot[hₜ₋₁, xₜ] + bf) — forget gate iₜ = σ(Wi\cdot[hₜ₋₁, xₜ] + bi) — input gate C̃ₜ = tanh(Wc\cdot[hₜ₋₁, xₜ] + bc) — candidate Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ — cell state oₜ = σ(Wo\cdot[hₜ₋₁, xₜ] + bo) — output gate hₜ = oₜ⊙tanh(Cₜ)

💡

Cell State = HighwayThe cell state Cₜ runs along the top of the LSTM with only minor linear interactions. Gradients can flow through it almost unchanged over hundreds of steps — solving vanishing gradients.

Python · Bidirectional LSTM

import torch
import torch.nn as nn

class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,   # forward + backward
            dropout=0.3
        )
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),  # *2 for bidir
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, x, lengths):
        emb = self.embedding(x)           # (B, T, E)
        # Pack for variable-length sequences
        packed = nn.utils.rnn.pack_padded_sequence(emb, lengths, batch_first=True, enforce_sorted=False)
        out, (hn, _) = self.lstm(packed)
        # Concat last forward + backward hidden states
        last_hidden = torch.cat([hn[-2], hn[-1]], dim=1)  # (B, H*2)
        return self.classifier(last_hidden)

Transformers

The architecture that redefined AI — self-attention, positional encoding, and the models built on top.

Self-Attention

Scaled Dot-Product Attention Attention(Q,K,V) = softmax(QKᵀ / \sqrtdₖ) \cdot V Q = XWᴼ, K = XWᴷ, V = XWᵛ dₖ = key dimension (scale prevents small gradients)

Multi-Head Attention MHA(Q,K,V) = Concat(head₁,...,headₕ)Wᴼ headᵢ = Attention(QWᵢᴼ, KWᵢᴷ, VWᵢᵛ) Each head learns different relationship types

🔑

Why Attention WorksUnlike RNNs, attention computes relationships between ALL pairs of tokens in O(n²) — but fully in parallel. Long-range dependencies cost the same as short-range ones.

Encoder-Decoder Structure

1
Input Embedding + PE
Token IDs → embeddings. Add sinusoidal or learnable positional encoding to inject sequence order.
2
Encoder Block
Multi-Head Self-Attention → Add & Norm → Feed Forward → Add & Norm. Repeated N times.
3
Decoder Block
Masked Self-Attention → Cross-Attention (attends to encoder) → FFN. Generates one token at a time.
4
Output Projection
Linear + Softmax over vocabulary. At inference: greedy / beam search / nucleus sampling.

Popular Variants

Model	Type	Params	Key Use	Innovation
BERT	Encoder-only	110M–340M	Classification, NER, QA	Masked language modelling (MLM)
GPT-4	Decoder-only	~1.8T	Text generation, chat	RLHF + MoE scaling
T5	Encoder-Decoder	11B	Summarisation, translation	Text-to-text framing
ViT	Encoder-only	86M–632M	Image classification	Patch embeddings replace CNN
Llama 3	Decoder-only	8B–70B	Open-source LLM	GQA, RoPE, SwiGLU
Whisper	Encoder-Decoder	39M–1.5B	Speech recognition	Multitask audio transformer

Python · Self-Attention from Scratch

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.h   = num_heads
        self.Wq  = nn.Linear(d_model, d_model)
        self.Wk  = nn.Linear(d_model, d_model)
        self.Wv  = nn.Linear(d_model, d_model)
        self.Wo  = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        B, T, D = q.shape
        Q = self.Wq(q).view(B, T, self.h, self.d_k).transpose(1,2)
        K = self.Wk(k).view(B, -1, self.h, self.d_k).transpose(1,2)
        V = self.Wv(v).view(B, -1, self.h, self.d_k).transpose(1,2)

        # Scaled dot-product attention
        scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn = F.softmax(scores, dim=-1)
        out  = (attn @ V).transpose(1,2).reshape(B, T, D)
        return self.Wo(out), attn

GANs & Generative Models

Adversarial training, VAEs, and diffusion — teaching machines to create.

GAN Framework

Minimax Objective min_G max_D V(D,G) = E[log D(x)] + E[log(1 - D(G(z)))] D(x) \to 1 for real, 0 for fake G(z) \to fool D into D(G(z)) \to 1

⚠️

Training InstabilityGANs suffer from mode collapse (G generates only a few modes) and vanishing gradients when D is too strong. Solutions: WGAN-GP, spectral norm, minibatch discrimination, progressive growing.

GAN Variants

Variant	Innovation
DCGAN	Conv layers, batch norm — stable training
WGAN-GP	Wasserstein loss + gradient penalty
StyleGAN 3	Alias-free generation, style mixing
CycleGAN	Unpaired image translation
Pix2Pix	Paired image-to-image translation

VAE vs. GAN vs. Diffusion

🧮

VAE (Variational Autoencoder)

Encodes input to latent distribution N(μ,σ²). Maximises ELBO = reconstruction − KL(q‖p). Smooth latent space. Blurry outputs.

⚔️

GAN

Generator vs. discriminator adversarial game. Sharp, photorealistic outputs. Hard to train, mode collapse risk.

❄️

Diffusion Models

Gradually add Gaussian noise to data, train a U-Net to predict and reverse the noise. State-of-the-art quality. Slower inference (many steps). Stable Diffusion, DALL-E 3, Imagen.

Python · DCGAN Generator

import torch.nn as nn

class DCGANGenerator(nn.Module):
    def __init__(self, latent_dim=100, channels=3):
        super().__init__()
        def block(in_c, out_c, stride=2, padding=1):
            return [
                nn.ConvTranspose2d(in_c, out_c, 4, stride, padding, bias=False),
                nn.BatchNorm2d(out_c),
                nn.ReLU(True)
            ]
        self.net = nn.Sequential(
            # 1×1 → 4×4
            nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False),
            nn.BatchNorm2d(512), nn.ReLU(True),
            *block(512, 256),  # 8×8
            *block(256, 128),  # 16×16
            *block(128, 64),   # 32×32
            nn.ConvTranspose2d(64, channels, 4, 2, 1, bias=False),
            nn.Tanh()          # 64×64, range [-1,1]
        )
    def forward(self, z):
        return self.net(z.view(-1, z.shape[1], 1, 1))

Training Deep Learning Models

Optimisers, regularisation, hyperparameter tuning, and tricks that separate good models from great ones.

Optimisers

vₜ = β·vₜ₋₁ + ∇L; θ ← θ − η·vₜ. Momentum β≈0.9 dampens oscillations and accelerates convergence. Still best for CNNs with careful LR scheduling. Nesterov variant: look-ahead gradient.

mₜ = β₁mₜ₋₁ + (1-β₁)g; vₜ = β₂vₜ₋₁ + (1-β₂)g². θ ← θ − η·m̂ₜ/√(v̂ₜ+ε). Defaults: β₁=0.9, β₂=0.999, ε=1e-8, η=3e-4. Robust default for most tasks.

Decouples weight decay from gradient update. θ ← θ − η·(m̂/√v̂ + λθ). Preferred over Adam for transformers and LLMs. Use weight_decay=0.01-0.1.

Cosine Annealing: η oscillates from ηₘₐₓ to ηₘᵢₙ. Warmup + Cosine (transformers): linearly ramp LR for first N steps then cosine decay. OneCycleLR: super-convergence with very high max LR. ReduceLROnPlateau: adaptive decay on metric stagnation.

Regularisation Techniques

Dropout (p=0.5)

82%

Batch Normalisation

91%

Weight Decay (L2)

78%

Data Augmentation

88%

Early Stopping

74%

Label Smoothing

70%

Effectiveness score (higher = more commonly beneficial across task types)

Batch vs. Layer Normalisation

Type	Normalises Over	Best For
BatchNorm	Batch dimension	CNNs, large batches
LayerNorm	Feature dimension	Transformers, NLP, RNNs
GroupNorm	Groups of channels	Small batch sizes
RMSNorm	Feature dim (simpler)	Modern LLMs (Llama)

Python · Mixed Precision + Gradient Scaling

import torch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()   # handles FP16 loss scaling

def train_step(model, batch, optimizer, criterion):
    X, y = batch
    optimizer.zero_grad()

    with autocast():                  # FP16 forward pass
        logits = model(X)
        loss   = criterion(logits, y)

    scaler.scale(loss).backward()    # scaled gradients
    scaler.unscale_(optimizer)         # unscale before clip
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)             # update weights
    scaler.update()                    # adjust scale factor
    return loss.item()

# LR warmup + cosine decay (Transformers best practice)
import math

def get_lr(step, d_model, warmup_steps):
    if step == 0: return 0.0
    scale = min(step ** -0.5, step * warmup_steps ** -1.5)
    return d_model ** -0.5 * scale   # original transformer formula

Deployment & MLOps

Getting models from experiment to production — export, serve, containerise, and monitor.

Export Pipeline

1
Train & Validate
Achieve target metrics. Save checkpoint with torch.save() or Hugging Face safetensors.
2
Export to ONNX
torch.onnx.export() converts the model to a framework-agnostic graph for cross-platform inference.
3
Optimise with TensorRT
trtexec or ONNX-TensorRT converts ONNX to a TensorRT engine. 3–10× faster on NVIDIA GPUs.
4
Serve via FastAPI
Wrap inference in a REST endpoint. Use ONNX Runtime for lightweight CPU/GPU serving.
5
Containerise
Docker image with model weights + FastAPI. Push to ECR / ACR / GCR and deploy to Kubernetes.
6
Monitor with MLflow
Track experiments, model versions, metrics drift. Set up alerts for data/concept drift.

Optimisation Techniques

✂️

Quantisation (INT8/FP16)

Reduce precision of weights/activations. 2–4× memory reduction, 2–3× speedup with minimal accuracy loss. Post-training quantisation (PTQ) or QAT.

🪄

Pruning

Remove low-magnitude weights (unstructured) or entire filters/heads (structured). 40–80% parameter reduction with retraining.

🎓

Knowledge Distillation

Train small student to mimic large teacher's soft probability outputs. DistilBERT = 40% smaller, 60% faster, 97% of BERT's performance.

Python · ONNX Export + FastAPI Server

import torch
import onnxruntime as ort
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np

# ─── Export to ONNX ─────────────────────────────────────────
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model, dummy_input, "model.onnx",
    opset_version=17,
    input_names=["image"], output_names=["logits"],
    dynamic_axes={"image": {0: "batch"}}   # variable batch size
)

# ─── ONNX Runtime Inference ─────────────────────────────────
sess_opts = ort.SessionOptions()
sess_opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession("model.onnx", sess_options=sess_opts,
                                providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

# ─── FastAPI endpoint ────────────────────────────────────────
app = FastAPI(title="Deep Learning Model API")

class PredictRequest(BaseModel):
    image: list[list[list[list[float]]]]  # NCHW float array

@app.post("/predict")
async def predict(req: PredictRequest):
    x = np.array(req.image, dtype=np.float32)
    logits = session.run(["logits"], {"image": x})[0]
    probs  = np.exp(logits) / np.exp(logits).sum(-1, keepdims=True)
    top_k  = np.argsort(probs[0])[::-1][:5]
    return {"top5_classes": top_k.tolist(), "probs": probs[0][top_k].tolist()}

Code Lab

Complete, production-quality code examples for the most common deep learning tasks.

Python · Complete MNIST Training

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ─── Data ────────────────────────────────────────────────────
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])
train_ds = datasets.MNIST("./data", train=True, download=True, transform=transform)
test_ds  = datasets.MNIST("./data", train=False, transform=transform)
train_dl = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4)
test_dl  = DataLoader(test_ds,  batch_size=256)

# ─── Model ────────────────────────────────────────────────────
class ConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(2),   # 28→14
            nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4))
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128*4*4, 256), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(256, 10)
        )
    def forward(self, x): return self.classifier(self.features(x))

device = "cuda" if torch.cuda.is_available() else "cpu"
model = ConvNet().to(device)
opt = optim.AdamW(model.parameters(), lr=1e-3)
sched = optim.lr_scheduler.OneCycleLR(opt, max_lr=1e-2, steps_per_epoch=len(train_dl), epochs=10)
crit = nn.CrossEntropyLoss()

# ─── Train ────────────────────────────────────────────────────
for epoch in range(10):
    model.train()
    for X, y in train_dl:
        X, y = X.to(device), y.to(device)
        opt.zero_grad()
        loss = crit(model(X), y)
        loss.backward()
        opt.step(); sched.step()
    model.eval()
    correct = sum((model(X.to(device)).argmax(1) == y.to(device)).sum().item() for X,y in test_dl)
    print(f"Epoch {epoch+1}: acc={correct/len(test_ds)*100:.2f}%")

Python · Transfer Learning (EfficientNet)

import torch
import torchvision.models as models
from torchvision import transforms, datasets
from torch.utils.data import DataLoader
import torch.nn as nn

# ─── Augmentation pipeline ────────────────────────────────────
train_tf = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.2, 0.2, 0.2),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# ─── Load pretrained EfficientNet-B2 ─────────────────────────
backbone = models.efficientnet_b2(weights="IMAGENET1K_V1")
num_ftrs  = backbone.classifier[1].in_features
backbone.classifier = nn.Sequential(
    nn.Dropout(0.4),
    nn.Linear(num_ftrs, num_classes)
)

# Phase 1: train only head (frozen backbone)
for p in backbone.features.parameters(): p.requires_grad = False
opt1 = torch.optim.AdamW(backbone.classifier.parameters(), lr=3e-3)

# Phase 2: unfreeze and fine-tune all layers
for p in backbone.parameters(): p.requires_grad = True
opt2 = torch.optim.AdamW(backbone.parameters(), lr=3e-5)  # low LR!

Python · BERT Fine-tuning (Hugging Face)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ─── Load model & tokeniser ───────────────────────────────────
model_name = "bert-base-uncased"
tokeniser  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# ─── Tokenise dataset ─────────────────────────────────────────
dataset = load_dataset("imdb")
def tokenise(batch):
    return tokeniser(batch["text"], truncation=True, max_length=512, padding="max_length")
dataset = dataset.map(tokenise, batched=True)

# ─── Training ─────────────────────────────────────────────────
args = TrainingArguments(
    output_dir="./bert-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,            # low LR for fine-tuning
    warmup_ratio=0.06,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    fp16=True,                      # mixed precision
    logging_steps=100,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds)}

trainer = Trainer(model=model, args=args,
                  train_dataset=dataset["train"], eval_dataset=dataset["test"],
                  compute_metrics=compute_metrics)
trainer.train()

Python · Custom Dataset + Augmentation

from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import pandas as pd, os

class ImageDataset(Dataset):
    def __init__(self, csv_path, img_dir, transform=None):
        self.df        = pd.read_csv(csv_path)   # columns: filename, label
        self.img_dir   = img_dir
        self.transform = transform

    def __len__(self): return len(self.df)

    def __getitem__(self, idx):
        row   = self.df.iloc[idx]
        img   = Image.open(os.path.join(self.img_dir, row.filename)).convert("RGB")
        label = row.label
        if self.transform: img = self.transform(img)
        return img, label

# Heavy augmentation for training
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.3, contrast=0.3),
    transforms.RandomGrayscale(p=0.1),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

ds     = ImageDataset("train.csv", "./images", transform=train_transform)
loader = DataLoader(ds, batch_size=32, shuffle=True, num_workers=8, pin_memory=True)

Master Deep Learningfrom Neurons to Production

Learning Path

What You'll Learn

Architecture Comparison

Mathematical Foundations

Tensors

Essential Operations

The Chain Rule — Heart of Backprop

Partial Derivatives in Layers

Key Distributions in DL

Loss Functions as Likelihoods

Information Theory

Neural Networks

Architecture

Activation Functions

Complete MLP Implementation

Convolutional Neural Networks

Core Concepts

Architecture Milestones

ResNet Skip Connection

RNNs & LSTMs

Recurrent Networks

GRU (Gated Recurrent Unit)

LSTM Architecture

Transformers

Self-Attention

Encoder-Decoder Structure

Popular Variants

GANs & Generative Models

GAN Framework

GAN Variants

VAE vs. GAN vs. Diffusion

Training Deep Learning Models

Optimisers

Regularisation Techniques

Batch vs. Layer Normalisation

Deployment & MLOps

Export Pipeline

Optimisation Techniques

Code Lab

Master Deep Learning
from Neurons to Production