SciPapers / optimizers_via_fluxions.md

Upload optimizers_via_fluxions.md with huggingface_hub

f256bdc verified 3 months ago

preview code

raw

history blame contribute delete

14.5 kB

Gradient Descent and Optimizers via the Method of Fluxions

From SGD to AdamW: A Newtonian Perspective

Scott Bisset, Silicon Goddess
OpenTransformers Ltd
January 2026

Abstract

Neural network optimizers are typically presented as update rules with cryptic Greek letters (β₁, β₂, ε) and little intuition for why they work. We reformulate gradient descent, momentum, RMSprop, Adam, and AdamW using Newton's method of fluxions. In this framework, optimization becomes physical: weights flow through parameter space, momentum is literal velocity, and adaptive learning rates emerge from measuring flow variance. This perspective reveals why certain hyperparameter choices work and suggests principled modifications.

1. The Optimization Problem

1.1 What We Want

Find weights W that minimize loss L(W).

1.2 The Fluxion Framing

Imagine weights as particles flowing through parameter space. The loss function L(W) defines a landscape—hills and valleys. We want weights to flow downhill to the lowest valley.

Key quantities:

Symbol	Meaning
W	Position in weight space
Ẇ	Velocity (how weights flow)
Ẅ	Acceleration (how velocity changes)
L̇ᵂ	Gradient (which direction is uphill)
g	Shorthand for L̇ᵂ (the gradient)

2. Vanilla Gradient Descent

2.1 The Update Rule

Leibniz (opaque):

W_{t+1} = W_t - η · ∂L/∂W

Fluxion (physical):

Ẇ = -η · g

"Weights flow opposite to gradient, scaled by learning rate"

2.2 Physical Interpretation

Imagine a ball on a hill:

g = L̇ᵂ points uphill (steepest ascent)
-g points downhill
η controls flow speed

The ball has no mass, no inertia—it teleports in the downhill direction each step.

2.3 Problems

Ravine oscillation: Narrow valleys cause zig-zagging
Flat region stalling: Tiny gradient = tiny movement
Uniform speed: Same η for all parameters, regardless of curvature

3. Momentum: Adding Inertia

3.1 The Idea

Give the ball mass. Let it build up speed.

3.2 Fluxion Formulation

Introduce velocity v as a separate state:

v̇ = β · v + g          # Velocity accumulates gradient (with decay)
Ẇ = -η · v             # Position flows with velocity

Physical interpretation:

β = friction coefficient (0.9 = low friction, velocity persists)
v accumulates gradient over time
Ball builds momentum rolling downhill

3.3 Why It Helps

Ravine problem solved:

Side-to-side gradients cancel out in v
Down-the-valley gradients accumulate
Ball rolls straight down valley floor

Flat regions:

Momentum carries ball through plateaus
Previous velocity persists even when current gradient is small

3.4 The β Parameter

β = 0.0: No momentum, vanilla GD
β = 0.9: Standard choice, 10-step effective memory
β = 0.99: Heavy ball, 100-step memory

Effective memory ≈ 1/(1-β) steps

4. Nesterov Momentum: Look Before You Leap

4.1 The Problem with Standard Momentum

Ball computes gradient at current position, then moves. But it's GOING to move with velocity v anyway. Why not compute gradient at where we're GOING to be?

4.2 Fluxion Formulation

W_ahead = W + β · v           # Where momentum will take us
g_ahead = L̇ᵂ(W_ahead)        # Gradient at future position
v̇ = β · v + g_ahead          # Update velocity with lookahead gradient
Ẇ = -η · v

4.3 Physical Interpretation

"Look downhill from where you'll land, not where you stand."

The ball predicts its next position, evaluates the slope THERE, then adjusts.

4.4 Why It Helps

Anticipates overshooting
Dampens oscillations faster
Converges slightly faster in practice

5. AdaGrad: Adaptive Learning Rates

5.1 The Problem

Some parameters get huge gradients, others tiny. Uniform η is wrong for both.

5.2 The Idea

Track cumulative squared gradient per parameter. Scale learning rate inversely.

5.3 Fluxion Formulation

ṡ = s + g²                    # Accumulate squared gradient (elementwise)
Ẇ = -η · g / (√s + ε)         # Scale by inverse sqrt of accumulator

5.4 Physical Interpretation

s measures "how much this parameter has been pushed historically."

High s → parameter was pushed a lot → reduce sensitivity
Low s → parameter barely moved → increase sensitivity

5.5 Problem

s only grows. Learning rate only shrinks. Eventually ALL learning rates → 0. Training stalls.

6. RMSprop: Exponential Moving Average Fix

6.1 The Fix

Don't accumulate forever. Use exponential moving average.

6.2 Fluxion Formulation

ṡ = β · s + (1-β) · g²        # EMA of squared gradient
Ẇ = -η · g / (√s + ε)         # Adaptive scaling

6.3 Physical Interpretation

s now measures "recent gradient variance."

High recent variance → parameter is noisy → take smaller steps
Low recent variance → parameter is stable → take larger steps

6.4 The β Parameter (typically 0.99)

β = 0.99: ~100 step memory for variance estimate
β = 0.9:  ~10 step memory (more reactive)

7. Adam: Best of Both Worlds

7.1 The Combination

Adam = Momentum + RMSprop

Track BOTH:

First moment (mean gradient) → momentum
Second moment (gradient variance) → adaptive rate

7.2 Fluxion Formulation

# First moment: momentum
ṁ = β₁ · m + (1-β₁) · g

# Second moment: variance
v̇ = β₂ · v + (1-β₂) · g²

# Bias correction (important at start!)
m̂ = m / (1 - β₁ᵗ)
v̂ = v / (1 - β₂ᵗ)

# Update
Ẇ = -η · m̂ / (√v̂ + ε)

7.3 Physical Interpretation

m = smoothed direction (where to flow) v = smoothed magnitude variance (how carefully to flow)

"Flow in the average recent direction, at speed inversely proportional to recent bumpiness."

7.4 Bias Correction: Why?

At t=0, m=0 and v=0. First update: m = (1-β₁)·g ≈ 0.1·g (biased low!)

Division by (1-β₁ᵗ) corrects:

t=1: divide by 0.1 → correct scale
t=∞: divide by 1.0 → no correction needed

7.5 Standard Hyperparameters

β₁ = 0.9    # Momentum coefficient (~10 step memory)
β₂ = 0.999  # Variance coefficient (~1000 step memory)  
ε = 1e-8    # Numerical stability (prevents division by zero)
η = 0.001   # Base learning rate

8. AdamW: Weight Decay Done Right

8.1 The Problem with L2 Regularization

Original Adam with L2 regularization:

g_reg = g + λ·W              # Add weight penalty to gradient
ṁ = β₁·m + (1-β₁)·g_reg     # Momentum includes penalty

Problem: The adaptive scaling also scales the weight decay! Large weights with small gradients get LESS decay, not more.

8.2 AdamW: Decoupled Weight Decay

# Moments on RAW gradient (no weight penalty)
ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²

# Bias correction
m̂ = m / (1-β₁ᵗ)
v̂ = v / (1-β₂ᵗ)

# Update with SEPARATE weight decay
Ẇ = -η · (m̂/(√v̂+ε) + λ·W)

8.3 Physical Interpretation

Two separate forces on each weight:

Gradient force: Push toward lower loss
Decay force: Pull toward zero (regularization)

AdamW keeps these forces separate. Original Adam mixed them, causing the decay force to be scaled by the adaptive rate.

8.4 Why It Matters

AdamW consistently outperforms Adam+L2 on language models. The "W" stands for "decoupled Weight decay."

9. Complete Algorithm Comparison

9.1 In Fluxion Notation

SGD:

Ẇ = -η·g

SGD + Momentum:

v̇ = β·v + g
Ẇ = -η·v

RMSprop:

ṡ = β·s + (1-β)·g²
Ẇ = -η·g/(√s+ε)

Adam:

ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²
Ẇ = -η·m̂/(√v̂+ε)

AdamW:

ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²
Ẇ = -η·(m̂/(√v̂+ε) + λ·W)

9.2 State Required

Optimizer	States per parameter
SGD	0
Momentum	1 (velocity)
RMSprop	1 (variance)
Adam	2 (momentum + variance)
AdamW	2 (same as Adam)

Adam requires 2x memory for optimizer states! For large models, this matters.

10. Learning Rate Schedules

10.1 The Problem

Fixed η is suboptimal:

Early training: large steps okay, landscape is far from optimum
Late training: need precision, should take smaller steps

10.2 Common Schedules in Fluxion Terms

Constant:

η̇ = 0     (η never changes)

Linear decay:

η̇ = -η₀/T    (linear decrease to 0 over T steps)

Cosine decay:

η(t) = η_min + (η₀-η_min)·(1+cos(πt/T))/2

Warmup:

t < T_warm:  η(t) = η₀·t/T_warm     (ramp up)
t ≥ T_warm:  normal schedule        (then decay)

10.3 Why Warmup?

At initialization:

Weights are random
Gradients are huge and noisy
Adam's variance estimate (v) is zero

Large initial steps can destabilize training. Warmup lets variance estimates stabilize before taking big steps.

11. Gradient Clipping

11.1 The Problem

Occasionally, gradients explode (‖g‖ → ∞). One bad step can ruin training.

11.2 Fluxion Formulation

if ‖g‖ > max_norm:
    g ← g · (max_norm / ‖g‖)     # Rescale to max_norm
    
# Then proceed with normal optimizer

11.3 Physical Interpretation

"Cap the maximum force that can act on any weight."

No matter how steep the local slope, the ball can only accelerate so fast.

12. Implementation: Fused vs Unfused

12.1 The Computational Point

Mathematically equivalent formulations can have VERY different performance.

Unfused Adam (naive):

m = beta1 * m + (1-beta1) * g           # Read m, g, write m
v = beta2 * v + (1-beta2) * g**2        # Read v, g, write v  
m_hat = m / (1 - beta1**t)              # Read m, write m_hat
v_hat = v / (1 - beta2**t)              # Read v, write v_hat
W = W - lr * m_hat / (sqrt(v_hat) + eps) # Read W,m_hat,v_hat, write W

5 separate kernel launches, multiple memory round-trips.

Fused Adam:

# Single kernel: read g,m,v,W once, write m,v,W once
fused_adam_kernel(g, m, v, W, beta1, beta2, lr, eps, t)

1 kernel, 1 memory round-trip.

12.2 The Fluxion Insight

When written as flows:

ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²
Ẇ = -η·m̂/(√v̂+ε)

These are clearly THREE coupled ODEs that should be integrated together. The flow notation suggests fusion naturally.

Leibniz notation hides this by writing separate update equations.

13. Second-Order Methods (Brief)

13.1 Newton's Method (the optimization one, not fluxions)

Use curvature (second derivative) information:

Ẇ = -H⁻¹·g

Where H = Hessian = matrix of Ẅ (second derivatives)

13.2 Fluxion Interpretation

First-order (gradient descent): "Flow downhill" Second-order (Newton): "Flow toward the minimum, accounting for curvature"

If the landscape is a bowl, Newton's method jumps straight to the bottom in one step. Gradient descent spirals down gradually.

13.3 Why Not Used?

Computing H⁻¹ is O(n²) storage, O(n³) compute for n parameters. For n = 1 billion, this is impossible.

Approximations exist (L-BFGS, K-FAC) but Adam usually wins in practice.

14. Summary: Optimizer Selection

14.1 Quick Guide

Situation	Optimizer
Simple convex problem	SGD + momentum
Deep networks, general	Adam
Language models	AdamW
Memory constrained	SGD + momentum
Fine-tuning	Lower LR Adam/AdamW

14.2 The Unified View

All optimizers are just different ways of computing Ẇ from g:

Ẇ = f(g, history, W)

SGD: Ẇ = -η·g (no history)
Momentum: Ẇ = -η·EMA(g) (first moment history)
Adam: Ẇ = -η·EMA(g)/√EMA(g²) (first and second moment)
AdamW: Ẇ = -η·(EMA(g)/√EMA(g²) + λ·W) (plus decay force)

15. Conclusion

Optimizers become physical when viewed through fluxions:

Weights are particles with position W
Gradients are forces pushing uphill
Momentum is literal velocity
Adaptive rates measure local bumpiness
Weight decay is a restoring force toward origin

This isn't just pedagogy—the flow formulation naturally suggests:

Fused implementations (coupled ODEs)
Continuous-time analysis (neural ODEs)
Novel optimizers (what other forces could we add?)

The math is equivalent, but the intuition is transformative.

References

Ruder, S. (2016). "An overview of gradient descent optimization algorithms."
Kingma & Ba (2014). "Adam: A Method for Stochastic Optimization."
Loshchilov & Hutter (2017). "Decoupled Weight Decay Regularization." (AdamW)
Newton, I. (1736). The Method of Fluxions.

Appendix: PyTorch Implementation

class AdamWFluxion:
    """AdamW in fluxion style - flows computed explicitly"""
    
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), 
                 eps=1e-8, weight_decay=0.01):
        self.params = list(params)
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.wd = weight_decay
        self.t = 0
        
        # Flow states (m = momentum flow, v = variance flow)
        self.m = [torch.zeros_like(p) for p in self.params]
        self.v = [torch.zeros_like(p) for p in self.params]
    
    def step(self):
        self.t += 1
        
        for i, W in enumerate(self.params):
            if W.grad is None:
                continue
                
            g = W.grad  # Gradient = L̇ᵂ
            
            # Momentum flow: ṁ = β₁·m + (1-β₁)·g
            self.m[i] = self.beta1 * self.m[i] + (1-self.beta1) * g
            
            # Variance flow: v̇ = β₂·v + (1-β₂)·g²
            self.v[i] = self.beta2 * self.v[i] + (1-self.beta2) * g**2
            
            # Bias correction
            m_hat = self.m[i] / (1 - self.beta1**self.t)
            v_hat = self.v[i] / (1 - self.beta2**self.t)
            
            # Weight flow: Ẇ = -η·(m̂/(√v̂+ε) + λ·W)
            W_dot = -self.lr * (m_hat / (v_hat.sqrt() + self.eps) 
                               + self.wd * W)
            
            # Apply flow
            W.data += W_dot

Correspondence: scott@opentransformers.online