Master Deep Learning
from Neurons to Production
A comprehensive, hands-on reference covering neural network theory, architectures, training techniques, and real-world deployment — from first principles to state-of-the-art models.
Learning Path
What You'll Learn
Architecture Comparison
| Architecture | Best For | Key Innovation | Parameters | Year |
|---|---|---|---|---|
| MLP | Tabular data, classification | Universal approximator | Thousands | 1986 |
| CNN | Image, video, audio spectrograms | Weight sharing, local connectivity | Millions | 1998 |
| LSTM | Time series, NLP sequences | Gated memory cells | Millions | 1997 |
| Transformer | NLP, vision, multimodal | Self-attention, parallelisation | Billions | 2017 |
| GAN | Image synthesis, data augmentation | Adversarial training | Millions–billions | 2014 |
| Diffusion | Image/video/audio generation | Denoising score matching | Billions | 2020 |
Mathematical Foundations
The core maths every deep learning practitioner must understand — from tensors to gradients.
Tensors
Tensors are the fundamental data structure in deep learning — generalisations of scalars, vectors, and matrices to arbitrary dimensions (ranks).
| Rank | Name | Example Shape | DL Use |
|---|---|---|---|
| 0 | Scalar | () | Loss value, learning rate |
| 1 | Vector | (512,) | Embedding, bias |
| 2 | Matrix | (64, 512) | Weight matrix, batch |
| 3 | 3D Tensor | (32, 128, 512) | Batch of sequences |
| 4 | 4D Tensor | (32, 3, 224, 224) | Batch of images (NCHW) |
Essential Operations
C = A @ B → shape: (m,n) @ (n,p) = (m,p)
import torch
# Creating tensors
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]]) # from list
zeros = torch.zeros(3, 4) # shape (3,4)
rand = torch.randn(32, 512) # normal dist
# Fundamental ops
W = torch.randn(512, 256)
b = torch.zeros(256)
out = rand @ W + b # (32,512)@(512,256)+256 → (32,256)
# Reshape, transpose, squeeze
t = torch.arange(24).reshape(2, 3, 4)
t_T = t.transpose(1, 2) # (2,4,3)
flat = t.flatten(1) # (2,12)
# GPU transfer
device = "cuda" if torch.cuda.is_available() else "cpu"
x = x.to(device)
The Chain Rule — Heart of Backprop
For composition f(g(x)):
df/dx = (df/dg) · (dg/dx)
where η = learning rate
∇_θ L = gradient of loss w.r.t. θ
Partial Derivatives in Layers
For a linear layer y = Wx + b and loss L:
∂L/∂b = ∂L/∂y
∂L/∂x = Wᵀ · (∂L/∂y)
For vector → vector functions
Shape: (dim_y × dim_x)
import torch
# Automatic differentiation with requires_grad
x = torch.tensor([2.0, 3.0], requires_grad=True)
W = torch.randn(2, 2, requires_grad=True)
# Forward pass — builds computation graph
y = x @ W # (2,) @ (2,2) → (2,)
loss = y.sum() # scalar loss
# Backward pass — computes gradients via chain rule
loss.backward()
print(x.grad) # ∂loss/∂x
print(W.grad) # ∂loss/∂W
# Manual gradient check
with torch.no_grad():
W -= 0.01 * W.grad # gradient descent step
W.grad.zero_() # must zero before next backward()
Key Distributions in DL
| Distribution | Use in DL |
|---|---|
| Normal N(μ,σ²) | Weight init, noise injection, VAE latent |
| Bernoulli | Binary classification output, dropout |
| Categorical | Multi-class softmax output, token prediction |
| Uniform | Xavier init, random sampling |
| Dirichlet | Topic models, mixture models |
Loss Functions as Likelihoods
= −log P(true class | input)
= MLE under Gaussian noise assumption
Information Theory
Measures uncertainty / information content
How much Y tells us about X
Neural Networks
From the biological neuron to deep multi-layer perceptrons — theory, math, and interactive visualisation.
Architecture
- 1Input LayerReceives raw features. No computation — passes values forward. Each node = one feature.
- 2Hidden LayersEach neuron computes z = Wx + b, then applies activation σ(z). Multiple hidden layers = "deep" network.
- 3Output LayerProduces predictions. Activation depends on task: sigmoid (binary), softmax (multiclass), linear (regression).
- 4Forward PassData flows input→output. Loss is computed comparing prediction to ground truth.
- 5BackpropagationGradients flow output→input via chain rule. Each weight updated: w ← w − η·∂L/∂w.
Activation Functions
| Function | Formula | Range | Use Case | Drawback |
|---|---|---|---|---|
| Sigmoid | 1/(1+e⁻ˣ) | (0,1) | Binary output | Vanishing gradient |
| Tanh | (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ) | (-1,1) | Hidden layers (old) | Vanishing gradient |
| ReLU | max(0, x) | [0,∞) | Most hidden layers | Dying ReLU |
| Leaky ReLU | max(αx, x) | (-∞,∞) | Fixes dying ReLU | Extra hyperparameter |
| GELU | x·Φ(x) | ≈(-0.17,∞) | Transformers (BERT, GPT) | More compute |
| Swish | x·sigmoid(x) | (-∞,∞) | EfficientNet | More compute |
Complete MLP Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# ─── Define MLP ───────────────────────────────────────────
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
super().__init__()
layers = []
dims = [input_dim] + hidden_dims
for i in range(len(dims) - 1):
layers += [
nn.Linear(dims[i], dims[i+1]),
nn.BatchNorm1d(dims[i+1]), # normalise activations
nn.GELU(), # smooth non-linearity
nn.Dropout(dropout), # regularisation
]
layers.append(nn.Linear(dims[-1], output_dim))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
# ─── Training Loop ─────────────────────────────────────────
def train(model, loader, criterion, optimizer, device):
model.train()
total_loss = 0
for X, y in loader:
X, y = X.to(device), y.to(device)
optimizer.zero_grad() # clear previous gradients
logits = model(X) # forward pass
loss = criterion(logits, y) # compute loss
loss.backward() # backpropagation
nn.utils.clip_grad_norm_(model.parameters(), 1.0) # gradient clip
optimizer.step() # update weights
total_loss += loss.item()
return total_loss / len(loader)
# ─── Instantiate and run ────────────────────────────────────
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MLP(input_dim=784, hidden_dims=[512, 256, 128], output_dim=10).to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
criterion = nn.CrossEntropyLoss()
Convolutional Neural Networks
Spatial pattern recognition through learned filters — the foundation of computer vision.
Core Concepts
Output size = ⌊(N + 2P − F)/S⌋ + 1
N=input, P=padding, F=filter, S=stride
Architecture Milestones
ResNet Skip Connection
F(x) = Conv → BN → ReLU → Conv → BN
Output = F(x) + x (identity shortcut)
Gradient: ∂L/∂x = ∂L/∂y · (∂F/∂x + 1) — always ≥ 1, preventing vanishing
import torch.nn as nn
class ResidualBlock(nn.Module):
def __init__(self, channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(channels)
self.relu = nn.ReLU(inplace=True)
# Shortcut if stride changes spatial dims
self.shortcut = nn.Sequential(
nn.Conv2d(channels, channels, 1, stride=stride, bias=False),
nn.BatchNorm2d(channels)
) if stride != 1 else nn.Identity()
def forward(self, x):
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x) # ← skip connection
return self.relu(out)
# Transfer learning with pretrained ResNet
import torchvision.models as models
backbone = models.resnet50(weights="IMAGENET1K_V1")
backbone.fc = nn.Linear(2048, num_classes) # replace head
# Freeze backbone, fine-tune head only
for p in backbone.parameters():
p.requires_grad = False
for p in backbone.fc.parameters():
p.requires_grad = True
RNNs & LSTMs
Modelling sequential dependencies — from simple recurrent nets to gated memory architectures.
Recurrent Networks
yₜ = Wᵧ·hₜ + bᵧ
hₜ = hidden state at time t
xₜ = input at time t
GRU (Gated Recurrent Unit)
rₜ = σ(Wr·[hₜ₋₁, xₜ]) — reset gate
h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ]) — candidate
hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ
LSTM Architecture
iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi) — input gate
C̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc) — candidate
Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ — cell state
oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo) — output gate
hₜ = oₜ⊙tanh(Cₜ)
import torch
import torch.nn as nn
class BiLSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
embed_dim, hidden_dim,
num_layers=num_layers,
batch_first=True,
bidirectional=True, # forward + backward
dropout=0.3
)
self.classifier = nn.Sequential(
nn.Linear(hidden_dim * 2, hidden_dim), # *2 for bidir
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden_dim, num_classes)
)
def forward(self, x, lengths):
emb = self.embedding(x) # (B, T, E)
# Pack for variable-length sequences
packed = nn.utils.rnn.pack_padded_sequence(emb, lengths, batch_first=True, enforce_sorted=False)
out, (hn, _) = self.lstm(packed)
# Concat last forward + backward hidden states
last_hidden = torch.cat([hn[-2], hn[-1]], dim=1) # (B, H*2)
return self.classifier(last_hidden)
Transformers
The architecture that redefined AI — self-attention, positional encoding, and the models built on top.
Self-Attention
Q = XWᴼ, K = XWᴷ, V = XWᵛ
dₖ = key dimension (scale prevents small gradients)
headᵢ = Attention(QWᵢᴼ, KWᵢᴷ, VWᵢᵛ)
Each head learns different relationship types
Encoder-Decoder Structure
- 1Input Embedding + PEToken IDs → embeddings. Add sinusoidal or learnable positional encoding to inject sequence order.
- 2Encoder BlockMulti-Head Self-Attention → Add & Norm → Feed Forward → Add & Norm. Repeated N times.
- 3Decoder BlockMasked Self-Attention → Cross-Attention (attends to encoder) → FFN. Generates one token at a time.
- 4Output ProjectionLinear + Softmax over vocabulary. At inference: greedy / beam search / nucleus sampling.
Popular Variants
| Model | Type | Params | Key Use | Innovation |
|---|---|---|---|---|
| BERT | Encoder-only | 110M–340M | Classification, NER, QA | Masked language modelling (MLM) |
| GPT-4 | Decoder-only | ~1.8T | Text generation, chat | RLHF + MoE scaling |
| T5 | Encoder-Decoder | 11B | Summarisation, translation | Text-to-text framing |
| ViT | Encoder-only | 86M–632M | Image classification | Patch embeddings replace CNN |
| Llama 3 | Decoder-only | 8B–70B | Open-source LLM | GQA, RoPE, SwiGLU |
| Whisper | Encoder-Decoder | 39M–1.5B | Speech recognition | Multitask audio transformer |
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
assert d_model % num_heads == 0
self.d_k = d_model // num_heads
self.h = num_heads
self.Wq = nn.Linear(d_model, d_model)
self.Wk = nn.Linear(d_model, d_model)
self.Wv = nn.Linear(d_model, d_model)
self.Wo = nn.Linear(d_model, d_model)
def forward(self, q, k, v, mask=None):
B, T, D = q.shape
Q = self.Wq(q).view(B, T, self.h, self.d_k).transpose(1,2)
K = self.Wk(k).view(B, -1, self.h, self.d_k).transpose(1,2)
V = self.Wv(v).view(B, -1, self.h, self.d_k).transpose(1,2)
# Scaled dot-product attention
scores = (Q @ K.transpose(-2,-1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attn = F.softmax(scores, dim=-1)
out = (attn @ V).transpose(1,2).reshape(B, T, D)
return self.Wo(out), attn
GANs & Generative Models
Adversarial training, VAEs, and diffusion — teaching machines to create.
GAN Framework
E[log D(x)] + E[log(1 − D(G(z)))]
D(x) → 1 for real, 0 for fake
G(z) → fool D into D(G(z)) → 1
GAN Variants
| Variant | Innovation |
|---|---|
| DCGAN | Conv layers, batch norm — stable training |
| WGAN-GP | Wasserstein loss + gradient penalty |
| StyleGAN 3 | Alias-free generation, style mixing |
| CycleGAN | Unpaired image translation |
| Pix2Pix | Paired image-to-image translation |
VAE vs. GAN vs. Diffusion
import torch.nn as nn
class DCGANGenerator(nn.Module):
def __init__(self, latent_dim=100, channels=3):
super().__init__()
def block(in_c, out_c, stride=2, padding=1):
return [
nn.ConvTranspose2d(in_c, out_c, 4, stride, padding, bias=False),
nn.BatchNorm2d(out_c),
nn.ReLU(True)
]
self.net = nn.Sequential(
# 1×1 → 4×4
nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False),
nn.BatchNorm2d(512), nn.ReLU(True),
*block(512, 256), # 8×8
*block(256, 128), # 16×16
*block(128, 64), # 32×32
nn.ConvTranspose2d(64, channels, 4, 2, 1, bias=False),
nn.Tanh() # 64×64, range [-1,1]
)
def forward(self, z):
return self.net(z.view(-1, z.shape[1], 1, 1))
Training Deep Learning Models
Optimisers, regularisation, hyperparameter tuning, and tricks that separate good models from great ones.
Optimisers
Regularisation Techniques
Effectiveness score (higher = more commonly beneficial across task types)
Batch vs. Layer Normalisation
| Type | Normalises Over | Best For |
|---|---|---|
| BatchNorm | Batch dimension | CNNs, large batches |
| LayerNorm | Feature dimension | Transformers, NLP, RNNs |
| GroupNorm | Groups of channels | Small batch sizes |
| RMSNorm | Feature dim (simpler) | Modern LLMs (Llama) |
import torch
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler() # handles FP16 loss scaling
def train_step(model, batch, optimizer, criterion):
X, y = batch
optimizer.zero_grad()
with autocast(): # FP16 forward pass
logits = model(X)
loss = criterion(logits, y)
scaler.scale(loss).backward() # scaled gradients
scaler.unscale_(optimizer) # unscale before clip
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer) # update weights
scaler.update() # adjust scale factor
return loss.item()
# LR warmup + cosine decay (Transformers best practice)
import math
def get_lr(step, d_model, warmup_steps):
if step == 0: return 0.0
scale = min(step ** -0.5, step * warmup_steps ** -1.5)
return d_model ** -0.5 * scale # original transformer formula
Deployment & MLOps
Getting models from experiment to production — export, serve, containerise, and monitor.
Export Pipeline
- 1Train & ValidateAchieve target metrics. Save checkpoint with torch.save() or Hugging Face safetensors.
- 2Export to ONNXtorch.onnx.export() converts the model to a framework-agnostic graph for cross-platform inference.
- 3Optimise with TensorRTtrtexec or ONNX-TensorRT converts ONNX to a TensorRT engine. 3–10× faster on NVIDIA GPUs.
- 4Serve via FastAPIWrap inference in a REST endpoint. Use ONNX Runtime for lightweight CPU/GPU serving.
- 5ContaineriseDocker image with model weights + FastAPI. Push to ECR / ACR / GCR and deploy to Kubernetes.
- 6Monitor with MLflowTrack experiments, model versions, metrics drift. Set up alerts for data/concept drift.
Optimisation Techniques
import torch
import onnxruntime as ort
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np
# ─── Export to ONNX ─────────────────────────────────────────
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model, dummy_input, "model.onnx",
opset_version=17,
input_names=["image"], output_names=["logits"],
dynamic_axes={"image": {0: "batch"}} # variable batch size
)
# ─── ONNX Runtime Inference ─────────────────────────────────
sess_opts = ort.SessionOptions()
sess_opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession("model.onnx", sess_options=sess_opts,
providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
# ─── FastAPI endpoint ────────────────────────────────────────
app = FastAPI(title="Deep Learning Model API")
class PredictRequest(BaseModel):
image: list[list[list[list[float]]]] # NCHW float array
@app.post("/predict")
async def predict(req: PredictRequest):
x = np.array(req.image, dtype=np.float32)
logits = session.run(["logits"], {"image": x})[0]
probs = np.exp(logits) / np.exp(logits).sum(-1, keepdims=True)
top_k = np.argsort(probs[0])[::-1][:5]
return {"top5_classes": top_k.tolist(), "probs": probs[0][top_k].tolist()}
Code Lab
Complete, production-quality code examples for the most common deep learning tasks.
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# ─── Data ────────────────────────────────────────────────────
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_ds = datasets.MNIST("./data", train=True, download=True, transform=transform)
test_ds = datasets.MNIST("./data", train=False, transform=transform)
train_dl = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4)
test_dl = DataLoader(test_ds, batch_size=256)
# ─── Model ────────────────────────────────────────────────────
class ConvNet(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(2), # 28→14
nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
nn.AdaptiveAvgPool2d((4, 4))
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128*4*4, 256), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(256, 10)
)
def forward(self, x): return self.classifier(self.features(x))
device = "cuda" if torch.cuda.is_available() else "cpu"
model = ConvNet().to(device)
opt = optim.AdamW(model.parameters(), lr=1e-3)
sched = optim.lr_scheduler.OneCycleLR(opt, max_lr=1e-2, steps_per_epoch=len(train_dl), epochs=10)
crit = nn.CrossEntropyLoss()
# ─── Train ────────────────────────────────────────────────────
for epoch in range(10):
model.train()
for X, y in train_dl:
X, y = X.to(device), y.to(device)
opt.zero_grad()
loss = crit(model(X), y)
loss.backward()
opt.step(); sched.step()
model.eval()
correct = sum((model(X.to(device)).argmax(1) == y.to(device)).sum().item() for X,y in test_dl)
print(f"Epoch {epoch+1}: acc={correct/len(test_ds)*100:.2f}%")
import torch
import torchvision.models as models
from torchvision import transforms, datasets
from torch.utils.data import DataLoader
import torch.nn as nn
# ─── Augmentation pipeline ────────────────────────────────────
train_tf = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(0.2, 0.2, 0.2),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
# ─── Load pretrained EfficientNet-B2 ─────────────────────────
backbone = models.efficientnet_b2(weights="IMAGENET1K_V1")
num_ftrs = backbone.classifier[1].in_features
backbone.classifier = nn.Sequential(
nn.Dropout(0.4),
nn.Linear(num_ftrs, num_classes)
)
# Phase 1: train only head (frozen backbone)
for p in backbone.features.parameters(): p.requires_grad = False
opt1 = torch.optim.AdamW(backbone.classifier.parameters(), lr=3e-3)
# Phase 2: unfreeze and fine-tune all layers
for p in backbone.parameters(): p.requires_grad = True
opt2 = torch.optim.AdamW(backbone.parameters(), lr=3e-5) # low LR!
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
# ─── Load model & tokeniser ───────────────────────────────────
model_name = "bert-base-uncased"
tokeniser = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# ─── Tokenise dataset ─────────────────────────────────────────
dataset = load_dataset("imdb")
def tokenise(batch):
return tokeniser(batch["text"], truncation=True, max_length=512, padding="max_length")
dataset = dataset.map(tokenise, batched=True)
# ─── Training ─────────────────────────────────────────────────
args = TrainingArguments(
output_dir="./bert-imdb",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5, # low LR for fine-tuning
warmup_ratio=0.06,
weight_decay=0.01,
evaluation_strategy="epoch",
fp16=True, # mixed precision
logging_steps=100,
)
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {"accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds)}
trainer = Trainer(model=model, args=args,
train_dataset=dataset["train"], eval_dataset=dataset["test"],
compute_metrics=compute_metrics)
trainer.train()
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import pandas as pd, os
class ImageDataset(Dataset):
def __init__(self, csv_path, img_dir, transform=None):
self.df = pd.read_csv(csv_path) # columns: filename, label
self.img_dir = img_dir
self.transform = transform
def __len__(self): return len(self.df)
def __getitem__(self, idx):
row = self.df.iloc[idx]
img = Image.open(os.path.join(self.img_dir, row.filename)).convert("RGB")
label = row.label
if self.transform: img = self.transform(img)
return img, label
# Heavy augmentation for training
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.3, contrast=0.3),
transforms.RandomGrayscale(p=0.1),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
ds = ImageDataset("train.csv", "./images", transform=train_transform)
loader = DataLoader(ds, batch_size=32, shuffle=True, num_workers=8, pin_memory=True)