Loss Functions via the Method of Fluxions
Cross-Entropy, MSE, and Friends: What Your Network Actually Minimizes
Scott Bisset, Silicon Goddess
OpenTransformers Ltd
January 2026
Abstract
Loss functions are typically presented as formulas to memorize. We reformulate common losses using fluxions, revealing their geometric meaning: cross-entropy measures "surprise flow," MSE measures "squared distance flow," and focal loss amplifies flow from hard examples. The backward pass becomes intuitive: each loss simply tells us "how much the output should wiggle to reduce error."
1. What Is a Loss?
1.1 The Setup
Network output: ŷ (prediction)
Ground truth: y (target)
Loss: L(ŷ, y) (how wrong we are)
1.2 Fluxion View
The loss L is a scalar. We need L̇ŷ - "how does loss wiggle when prediction wiggles?"
This gradient is the SIGNAL that flows backward through the network.
2. Mean Squared Error (MSE)
2.1 Definition
L = (1/n) Σᵢ (ŷᵢ - yᵢ)²
2.2 Fluxion Backward
L̇ŷᵢ = (2/n) · (ŷᵢ - yᵢ)
English: "Gradient is proportional to error."
- Overpredict by 0.1 → gradient pushes down by 0.2/n
- Underpredict by 0.5 → gradient pushes up by 1.0/n
2.3 Geometric Interpretation
MSE gradient points directly from prediction toward target.
target
↓
y ←←←← ŷ
gradient
Larger error = larger gradient = faster correction.
2.4 Problem
Outliers dominate. One sample with error=10 contributes 100 to loss. Gradient from outliers drowns out normal samples.
3. Mean Absolute Error (MAE / L1)
3.1 Definition
L = (1/n) Σᵢ |ŷᵢ - yᵢ|
3.2 Fluxion Backward
L̇ŷᵢ = (1/n) · sign(ŷᵢ - yᵢ)
English: "Gradient is ±1/n regardless of error magnitude."
3.3 Comparison with MSE
| Error | MSE Gradient | MAE Gradient |
|---|---|---|
| 0.1 | 0.2/n | 1/n |
| 1.0 | 2.0/n | 1/n |
| 10.0 | 20.0/n | 1/n |
MAE is robust to outliers - constant gradient regardless of error size.
3.4 Problem
Gradient is discontinuous at ŷ = y. Doesn't go to zero smoothly, can oscillate around target.
4. Huber Loss (Smooth L1)
4.1 The Best of Both
L = { 0.5·(ŷ-y)² if |ŷ-y| < δ
{ δ·|ŷ-y| - 0.5·δ² otherwise
4.2 Fluxion Backward
L̇ŷ = { (ŷ-y) if |ŷ-y| < δ (MSE region)
{ δ·sign(ŷ-y) otherwise (MAE region)
English:
- Small errors: MSE behavior (proportional gradient)
- Large errors: MAE behavior (capped gradient)
4.3 Why It Works
- Near target: smooth, quadratic convergence
- Far from target: robust, outlier-resistant
- δ controls the transition (typically δ=1)
5. Cross-Entropy (Classification)
5.1 Binary Cross-Entropy
L = -[y·log(p) + (1-y)·log(1-p)]
Where p = sigmoid(ŷ) = probability of class 1
5.2 Fluxion Backward (through sigmoid)
The magic of cross-entropy + sigmoid:
L̇ŷ = p - y
That's it. Gradient = prediction - target.
5.3 Why This Is Beautiful
| Truth (y) | Prediction (p) | Gradient (p-y) |
|---|---|---|
| 1 | 0.9 | -0.1 (push up slightly) |
| 1 | 0.1 | -0.9 (push up hard!) |
| 0 | 0.9 | +0.9 (push down hard!) |
| 0 | 0.1 | +0.1 (push down slightly) |
Confident AND wrong → huge gradient Confident AND right → tiny gradient Uncertain → medium gradient
5.4 Information Theory View
Cross-entropy = "average surprise"
-log(p) = surprise at seeing outcome with probability p
If p=0.99 and event happens: -log(0.99) ≈ 0.01 (not surprised) If p=0.01 and event happens: -log(0.01) ≈ 4.6 (very surprised!)
Minimizing cross-entropy = minimizing average surprise = learning to predict well.
6. Categorical Cross-Entropy (Multi-Class)
6.1 Setup
Output: logits z = [z₁, z₂, ..., zₖ] (raw scores)
Softmax: p = softmax(z) (probabilities)
Target: y = one-hot vector (e.g., [0,0,1,0])
L = -Σᵢ yᵢ·log(pᵢ) = -log(p_target)
6.2 Fluxion Backward
Through softmax + cross-entropy:
L̇ᶻᵢ = pᵢ - yᵢ
Same beautiful form! Gradient = prediction - target (per class).
6.3 Numerical Stability: LogSoftmax
Naive computation:
p = exp(z) / sum(exp(z)) # Can overflow!
L = -log(p[target])
Stable computation:
log_p = z - logsumexp(z) # LogSoftmax
L = -log_p[target]
PyTorch provides F.cross_entropy(logits, targets) which fuses this.
7. Focal Loss (Hard Example Mining)
7.1 The Problem with Cross-Entropy
Easy examples (high confidence, correct) still contribute gradient. In imbalanced datasets, easy examples dominate training.
7.2 Focal Loss Definition
L = -αₜ · (1-pₜ)ᵞ · log(pₜ)
Where pₜ = probability of TRUE class
α = class weight
γ = focusing parameter (typically 2)
7.3 Fluxion Analysis
The (1-pₜ)ᵞ term modulates gradient:
| pₜ (confidence) | (1-pₜ)² | Effect |
|---|---|---|
| 0.9 (easy) | 0.01 | Gradient reduced 100x |
| 0.5 (medium) | 0.25 | Gradient reduced 4x |
| 0.1 (hard) | 0.81 | Nearly full gradient |
7.4 Fluxion Backward
L̇ᵖₜ = -αₜ · [(1-pₜ)ᵞ / pₜ - γ·(1-pₜ)ᵞ⁻¹ · log(pₜ)]
Hard examples (low pₜ) get amplified flow. Easy examples get suppressed flow.
7.5 Use Case
Object detection (RetinaNet) - vast majority of proposals are "background" (easy negatives). Focal loss prevents easy negatives from dominating.
8. KL Divergence
8.1 Definition
KL(P || Q) = Σᵢ pᵢ · log(pᵢ/qᵢ)
= Σᵢ pᵢ · log(pᵢ) - Σᵢ pᵢ · log(qᵢ)
= -H(P) + H(P,Q)
= Cross-entropy - Entropy
8.2 Fluxion Backward (w.r.t. Q)
L̇qᵢ = -pᵢ / qᵢ
English: "Gradient is large where P has mass but Q doesn't."
8.3 Use in ML
- VAE: KL between latent distribution and prior
- Distillation: KL between student and teacher outputs
- Regularization: KL toward some reference distribution
9. Contrastive Losses
9.1 InfoNCE / NT-Xent
L = -log(exp(sim(z,z⁺)/τ) / Σⱼ exp(sim(z,zⱼ)/τ))
Where z⁺ = positive sample
zⱼ = all samples (including negatives)
τ = temperature
9.2 Fluxion View
This is just cross-entropy over similarity scores!
logits = similarities / τ
target = index of positive sample
L = CrossEntropy(logits, target)
9.3 Temperature τ
τ → 0: Sharp distribution, only closest match matters
τ → ∞: Flat distribution, all matches contribute equally
Temperature controls "how picky" the contrastive objective is.
10. Regression vs Classification Summary
10.1 Regression Losses
| Loss | L̇ŷ | Best For |
|---|---|---|
| MSE | 2(ŷ-y) | Normal errors |
| MAE | sign(ŷ-y) | Outlier robustness |
| Huber | clipped | Both |
10.2 Classification Losses
| Loss | L̇ᶻ | Best For |
|---|---|---|
| Cross-Entropy | p - y | Balanced classes |
| Focal | weighted (p-y) | Imbalanced classes |
| Label Smoothing CE | p - y_smooth | Calibration |
11. Label Smoothing
11.1 The Idea
Instead of hard targets [0, 0, 1, 0], use soft targets:
y_smooth = (1-ε)·y_hard + ε/K
Where ε = smoothing factor (e.g., 0.1)
K = number of classes
Hard target [0, 0, 1, 0] → Soft [0.025, 0.025, 0.925, 0.025]
11.2 Fluxion Effect
L̇ᶻᵢ = pᵢ - y_smoothᵢ
Now gradient never goes fully to zero for wrong classes. Network can't be "infinitely confident."
11.3 Why It Helps
- Prevents overconfidence
- Better calibration
- Regularization effect
12. The Unified View
12.1 All Losses Are Error Signals
L = f(ŷ, y) # Some function of prediction and target
L̇ŷ = ∂f/∂ŷ # Error signal that flows backward
The backward pass doesn't care about the loss formula. It only needs L̇ŷ - the direction to push predictions.
12.2 Designing Losses
Want to emphasize hard examples? → Amplify their L̇ŷ (focal loss) Want robustness to outliers? → Cap L̇ŷ magnitude (Huber) Want calibrated probabilities? → Smooth targets (label smoothing)
The fluxion view makes loss design intuitive: "What gradient do I want for each (prediction, target) pair?"
13. Implementation Notes
13.1 Numerical Stability
Always use fused implementations:
# BAD (can overflow/underflow):
p = softmax(logits)
loss = -log(p[target])
# GOOD (numerically stable):
loss = F.cross_entropy(logits, target) # Fused LogSoftmax + NLLLoss
13.2 Reduction
# Per-sample losses
losses = F.cross_entropy(logits, targets, reduction='none')
# Mean (default)
loss = losses.mean()
# Sum (for gradient accumulation)
loss = losses.sum() / accumulation_steps
References
- Shannon (1948). "A Mathematical Theory of Communication."
- Lin et al. (2017). "Focal Loss for Dense Object Detection."
- Szegedy et al. (2016). "Rethinking the Inception Architecture." (Label smoothing)
- Huber (1964). "Robust Estimation of a Location Parameter."
Correspondence: scott@opentransformers.online