SciPapers / loss_functions_via_fluxions.md
OpenTransformer's picture
Upload loss_functions_via_fluxions.md with huggingface_hub
b46577b verified

Loss Functions via the Method of Fluxions

Cross-Entropy, MSE, and Friends: What Your Network Actually Minimizes

Scott Bisset, Silicon Goddess
OpenTransformers Ltd
January 2026


Abstract

Loss functions are typically presented as formulas to memorize. We reformulate common losses using fluxions, revealing their geometric meaning: cross-entropy measures "surprise flow," MSE measures "squared distance flow," and focal loss amplifies flow from hard examples. The backward pass becomes intuitive: each loss simply tells us "how much the output should wiggle to reduce error."


1. What Is a Loss?

1.1 The Setup

Network output: ŷ (prediction)
Ground truth: y (target)
Loss: L(ŷ, y) (how wrong we are)

1.2 Fluxion View

The loss L is a scalar. We need L̇ŷ - "how does loss wiggle when prediction wiggles?"

This gradient is the SIGNAL that flows backward through the network.


2. Mean Squared Error (MSE)

2.1 Definition

L = (1/n) Σᵢ (ŷᵢ - yᵢ)²

2.2 Fluxion Backward

L̇ŷᵢ = (2/n) · (ŷᵢ - yᵢ)

English: "Gradient is proportional to error."

  • Overpredict by 0.1 → gradient pushes down by 0.2/n
  • Underpredict by 0.5 → gradient pushes up by 1.0/n

2.3 Geometric Interpretation

MSE gradient points directly from prediction toward target.

     target
       ↓
   y ←←←← ŷ
     gradient

Larger error = larger gradient = faster correction.

2.4 Problem

Outliers dominate. One sample with error=10 contributes 100 to loss. Gradient from outliers drowns out normal samples.


3. Mean Absolute Error (MAE / L1)

3.1 Definition

L = (1/n) Σᵢ |ŷᵢ - yᵢ|

3.2 Fluxion Backward

L̇ŷᵢ = (1/n) · sign(ŷᵢ - yᵢ)

English: "Gradient is ±1/n regardless of error magnitude."

3.3 Comparison with MSE

Error MSE Gradient MAE Gradient
0.1 0.2/n 1/n
1.0 2.0/n 1/n
10.0 20.0/n 1/n

MAE is robust to outliers - constant gradient regardless of error size.

3.4 Problem

Gradient is discontinuous at ŷ = y. Doesn't go to zero smoothly, can oscillate around target.


4. Huber Loss (Smooth L1)

4.1 The Best of Both

L = { 0.5·(ŷ-y)²        if |ŷ-y| < δ
    { δ·|ŷ-y| - 0.5·δ²  otherwise

4.2 Fluxion Backward

L̇ŷ = { (ŷ-y)           if |ŷ-y| < δ    (MSE region)
      { δ·sign(ŷ-y)     otherwise       (MAE region)

English:

  • Small errors: MSE behavior (proportional gradient)
  • Large errors: MAE behavior (capped gradient)

4.3 Why It Works

  • Near target: smooth, quadratic convergence
  • Far from target: robust, outlier-resistant
  • δ controls the transition (typically δ=1)

5. Cross-Entropy (Classification)

5.1 Binary Cross-Entropy

L = -[y·log(p) + (1-y)·log(1-p)]

Where p = sigmoid(ŷ) = probability of class 1

5.2 Fluxion Backward (through sigmoid)

The magic of cross-entropy + sigmoid:

L̇ŷ = p - y

That's it. Gradient = prediction - target.

5.3 Why This Is Beautiful

Truth (y) Prediction (p) Gradient (p-y)
1 0.9 -0.1 (push up slightly)
1 0.1 -0.9 (push up hard!)
0 0.9 +0.9 (push down hard!)
0 0.1 +0.1 (push down slightly)

Confident AND wrong → huge gradient Confident AND right → tiny gradient Uncertain → medium gradient

5.4 Information Theory View

Cross-entropy = "average surprise"

-log(p) = surprise at seeing outcome with probability p

If p=0.99 and event happens: -log(0.99) ≈ 0.01 (not surprised) If p=0.01 and event happens: -log(0.01) ≈ 4.6 (very surprised!)

Minimizing cross-entropy = minimizing average surprise = learning to predict well.


6. Categorical Cross-Entropy (Multi-Class)

6.1 Setup

Output: logits z = [z₁, z₂, ..., zₖ] (raw scores)
Softmax: p = softmax(z)             (probabilities)
Target: y = one-hot vector          (e.g., [0,0,1,0])

L = -Σᵢ yᵢ·log(pᵢ) = -log(p_target)

6.2 Fluxion Backward

Through softmax + cross-entropy:

L̇ᶻᵢ = pᵢ - yᵢ

Same beautiful form! Gradient = prediction - target (per class).

6.3 Numerical Stability: LogSoftmax

Naive computation:

p = exp(z) / sum(exp(z))    # Can overflow!
L = -log(p[target])

Stable computation:

log_p = z - logsumexp(z)    # LogSoftmax
L = -log_p[target]

PyTorch provides F.cross_entropy(logits, targets) which fuses this.


7. Focal Loss (Hard Example Mining)

7.1 The Problem with Cross-Entropy

Easy examples (high confidence, correct) still contribute gradient. In imbalanced datasets, easy examples dominate training.

7.2 Focal Loss Definition

L = -αₜ · (1-pₜ)ᵞ · log(pₜ)

Where pₜ = probability of TRUE class
      α = class weight
      γ = focusing parameter (typically 2)

7.3 Fluxion Analysis

The (1-pₜ)ᵞ term modulates gradient:

pₜ (confidence) (1-pₜ)² Effect
0.9 (easy) 0.01 Gradient reduced 100x
0.5 (medium) 0.25 Gradient reduced 4x
0.1 (hard) 0.81 Nearly full gradient

7.4 Fluxion Backward

L̇ᵖₜ = -αₜ · [(1-pₜ)ᵞ / pₜ - γ·(1-pₜ)ᵞ⁻¹ · log(pₜ)]

Hard examples (low pₜ) get amplified flow. Easy examples get suppressed flow.

7.5 Use Case

Object detection (RetinaNet) - vast majority of proposals are "background" (easy negatives). Focal loss prevents easy negatives from dominating.


8. KL Divergence

8.1 Definition

KL(P || Q) = Σᵢ pᵢ · log(pᵢ/qᵢ)
           = Σᵢ pᵢ · log(pᵢ) - Σᵢ pᵢ · log(qᵢ)
           = -H(P) + H(P,Q)
           = Cross-entropy - Entropy

8.2 Fluxion Backward (w.r.t. Q)

L̇qᵢ = -pᵢ / qᵢ

English: "Gradient is large where P has mass but Q doesn't."

8.3 Use in ML

  • VAE: KL between latent distribution and prior
  • Distillation: KL between student and teacher outputs
  • Regularization: KL toward some reference distribution

9. Contrastive Losses

9.1 InfoNCE / NT-Xent

L = -log(exp(sim(z,z⁺)/τ) / Σⱼ exp(sim(z,zⱼ)/τ))

Where z⁺ = positive sample
      zⱼ = all samples (including negatives)
      τ = temperature

9.2 Fluxion View

This is just cross-entropy over similarity scores!

logits = similarities / τ
target = index of positive sample
L = CrossEntropy(logits, target)

9.3 Temperature τ

τ → 0: Sharp distribution, only closest match matters
τ → ∞: Flat distribution, all matches contribute equally

Temperature controls "how picky" the contrastive objective is.


10. Regression vs Classification Summary

10.1 Regression Losses

Loss L̇ŷ Best For
MSE 2(ŷ-y) Normal errors
MAE sign(ŷ-y) Outlier robustness
Huber clipped Both

10.2 Classification Losses

Loss L̇ᶻ Best For
Cross-Entropy p - y Balanced classes
Focal weighted (p-y) Imbalanced classes
Label Smoothing CE p - y_smooth Calibration

11. Label Smoothing

11.1 The Idea

Instead of hard targets [0, 0, 1, 0], use soft targets:

y_smooth = (1-ε)·y_hard + ε/K

Where ε = smoothing factor (e.g., 0.1)
      K = number of classes

Hard target [0, 0, 1, 0] → Soft [0.025, 0.025, 0.925, 0.025]

11.2 Fluxion Effect

L̇ᶻᵢ = pᵢ - y_smoothᵢ

Now gradient never goes fully to zero for wrong classes. Network can't be "infinitely confident."

11.3 Why It Helps

  • Prevents overconfidence
  • Better calibration
  • Regularization effect

12. The Unified View

12.1 All Losses Are Error Signals

L = f(ŷ, y)           # Some function of prediction and target
L̇ŷ = ∂f/∂ŷ           # Error signal that flows backward

The backward pass doesn't care about the loss formula. It only needs L̇ŷ - the direction to push predictions.

12.2 Designing Losses

Want to emphasize hard examples? → Amplify their L̇ŷ (focal loss) Want robustness to outliers? → Cap L̇ŷ magnitude (Huber) Want calibrated probabilities? → Smooth targets (label smoothing)

The fluxion view makes loss design intuitive: "What gradient do I want for each (prediction, target) pair?"


13. Implementation Notes

13.1 Numerical Stability

Always use fused implementations:

# BAD (can overflow/underflow):
p = softmax(logits)
loss = -log(p[target])

# GOOD (numerically stable):
loss = F.cross_entropy(logits, target)  # Fused LogSoftmax + NLLLoss

13.2 Reduction

# Per-sample losses
losses = F.cross_entropy(logits, targets, reduction='none')

# Mean (default)
loss = losses.mean()

# Sum (for gradient accumulation)
loss = losses.sum() / accumulation_steps

References

  1. Shannon (1948). "A Mathematical Theory of Communication."
  2. Lin et al. (2017). "Focal Loss for Dense Object Detection."
  3. Szegedy et al. (2016). "Rethinking the Inception Architecture." (Label smoothing)
  4. Huber (1964). "Robust Estimation of a Location Parameter."

Correspondence: scott@opentransformers.online