Gradient Descent and Optimizers via the Method of Fluxions
From SGD to AdamW: A Newtonian Perspective
Scott Bisset, Silicon Goddess
OpenTransformers Ltd
January 2026
Abstract
Neural network optimizers are typically presented as update rules with cryptic Greek letters (β₁, β₂, ε) and little intuition for why they work. We reformulate gradient descent, momentum, RMSprop, Adam, and AdamW using Newton's method of fluxions. In this framework, optimization becomes physical: weights flow through parameter space, momentum is literal velocity, and adaptive learning rates emerge from measuring flow variance. This perspective reveals why certain hyperparameter choices work and suggests principled modifications.
1. The Optimization Problem
1.1 What We Want
Find weights W that minimize loss L(W).
1.2 The Fluxion Framing
Imagine weights as particles flowing through parameter space. The loss function L(W) defines a landscape—hills and valleys. We want weights to flow downhill to the lowest valley.
Key quantities:
| Symbol | Meaning |
|---|---|
| W | Position in weight space |
| Ẇ | Velocity (how weights flow) |
| Ẅ | Acceleration (how velocity changes) |
| L̇ᵂ | Gradient (which direction is uphill) |
| g | Shorthand for L̇ᵂ (the gradient) |
2. Vanilla Gradient Descent
2.1 The Update Rule
Leibniz (opaque):
W_{t+1} = W_t - η · ∂L/∂W
Fluxion (physical):
Ẇ = -η · g
"Weights flow opposite to gradient, scaled by learning rate"
2.2 Physical Interpretation
Imagine a ball on a hill:
- g = L̇ᵂ points uphill (steepest ascent)
- -g points downhill
- η controls flow speed
The ball has no mass, no inertia—it teleports in the downhill direction each step.
2.3 Problems
- Ravine oscillation: Narrow valleys cause zig-zagging
- Flat region stalling: Tiny gradient = tiny movement
- Uniform speed: Same η for all parameters, regardless of curvature
3. Momentum: Adding Inertia
3.1 The Idea
Give the ball mass. Let it build up speed.
3.2 Fluxion Formulation
Introduce velocity v as a separate state:
v̇ = β · v + g # Velocity accumulates gradient (with decay)
Ẇ = -η · v # Position flows with velocity
Physical interpretation:
- β = friction coefficient (0.9 = low friction, velocity persists)
- v accumulates gradient over time
- Ball builds momentum rolling downhill
3.3 Why It Helps
Ravine problem solved:
- Side-to-side gradients cancel out in v
- Down-the-valley gradients accumulate
- Ball rolls straight down valley floor
Flat regions:
- Momentum carries ball through plateaus
- Previous velocity persists even when current gradient is small
3.4 The β Parameter
β = 0.0: No momentum, vanilla GD
β = 0.9: Standard choice, 10-step effective memory
β = 0.99: Heavy ball, 100-step memory
Effective memory ≈ 1/(1-β) steps
4. Nesterov Momentum: Look Before You Leap
4.1 The Problem with Standard Momentum
Ball computes gradient at current position, then moves. But it's GOING to move with velocity v anyway. Why not compute gradient at where we're GOING to be?
4.2 Fluxion Formulation
W_ahead = W + β · v # Where momentum will take us
g_ahead = L̇ᵂ(W_ahead) # Gradient at future position
v̇ = β · v + g_ahead # Update velocity with lookahead gradient
Ẇ = -η · v
4.3 Physical Interpretation
"Look downhill from where you'll land, not where you stand."
The ball predicts its next position, evaluates the slope THERE, then adjusts.
4.4 Why It Helps
- Anticipates overshooting
- Dampens oscillations faster
- Converges slightly faster in practice
5. AdaGrad: Adaptive Learning Rates
5.1 The Problem
Some parameters get huge gradients, others tiny. Uniform η is wrong for both.
5.2 The Idea
Track cumulative squared gradient per parameter. Scale learning rate inversely.
5.3 Fluxion Formulation
ṡ = s + g² # Accumulate squared gradient (elementwise)
Ẇ = -η · g / (√s + ε) # Scale by inverse sqrt of accumulator
5.4 Physical Interpretation
s measures "how much this parameter has been pushed historically."
- High s → parameter was pushed a lot → reduce sensitivity
- Low s → parameter barely moved → increase sensitivity
5.5 Problem
s only grows. Learning rate only shrinks. Eventually ALL learning rates → 0. Training stalls.
6. RMSprop: Exponential Moving Average Fix
6.1 The Fix
Don't accumulate forever. Use exponential moving average.
6.2 Fluxion Formulation
ṡ = β · s + (1-β) · g² # EMA of squared gradient
Ẇ = -η · g / (√s + ε) # Adaptive scaling
6.3 Physical Interpretation
s now measures "recent gradient variance."
- High recent variance → parameter is noisy → take smaller steps
- Low recent variance → parameter is stable → take larger steps
6.4 The β Parameter (typically 0.99)
β = 0.99: ~100 step memory for variance estimate
β = 0.9: ~10 step memory (more reactive)
7. Adam: Best of Both Worlds
7.1 The Combination
Adam = Momentum + RMSprop
Track BOTH:
- First moment (mean gradient) → momentum
- Second moment (gradient variance) → adaptive rate
7.2 Fluxion Formulation
# First moment: momentum
ṁ = β₁ · m + (1-β₁) · g
# Second moment: variance
v̇ = β₂ · v + (1-β₂) · g²
# Bias correction (important at start!)
m̂ = m / (1 - β₁ᵗ)
v̂ = v / (1 - β₂ᵗ)
# Update
Ẇ = -η · m̂ / (√v̂ + ε)
7.3 Physical Interpretation
m = smoothed direction (where to flow) v = smoothed magnitude variance (how carefully to flow)
"Flow in the average recent direction, at speed inversely proportional to recent bumpiness."
7.4 Bias Correction: Why?
At t=0, m=0 and v=0. First update: m = (1-β₁)·g ≈ 0.1·g (biased low!)
Division by (1-β₁ᵗ) corrects:
- t=1: divide by 0.1 → correct scale
- t=∞: divide by 1.0 → no correction needed
7.5 Standard Hyperparameters
β₁ = 0.9 # Momentum coefficient (~10 step memory)
β₂ = 0.999 # Variance coefficient (~1000 step memory)
ε = 1e-8 # Numerical stability (prevents division by zero)
η = 0.001 # Base learning rate
8. AdamW: Weight Decay Done Right
8.1 The Problem with L2 Regularization
Original Adam with L2 regularization:
g_reg = g + λ·W # Add weight penalty to gradient
ṁ = β₁·m + (1-β₁)·g_reg # Momentum includes penalty
Problem: The adaptive scaling also scales the weight decay! Large weights with small gradients get LESS decay, not more.
8.2 AdamW: Decoupled Weight Decay
# Moments on RAW gradient (no weight penalty)
ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²
# Bias correction
m̂ = m / (1-β₁ᵗ)
v̂ = v / (1-β₂ᵗ)
# Update with SEPARATE weight decay
Ẇ = -η · (m̂/(√v̂+ε) + λ·W)
8.3 Physical Interpretation
Two separate forces on each weight:
- Gradient force: Push toward lower loss
- Decay force: Pull toward zero (regularization)
AdamW keeps these forces separate. Original Adam mixed them, causing the decay force to be scaled by the adaptive rate.
8.4 Why It Matters
AdamW consistently outperforms Adam+L2 on language models. The "W" stands for "decoupled Weight decay."
9. Complete Algorithm Comparison
9.1 In Fluxion Notation
SGD:
Ẇ = -η·g
SGD + Momentum:
v̇ = β·v + g
Ẇ = -η·v
RMSprop:
ṡ = β·s + (1-β)·g²
Ẇ = -η·g/(√s+ε)
Adam:
ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²
Ẇ = -η·m̂/(√v̂+ε)
AdamW:
ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²
Ẇ = -η·(m̂/(√v̂+ε) + λ·W)
9.2 State Required
| Optimizer | States per parameter |
|---|---|
| SGD | 0 |
| Momentum | 1 (velocity) |
| RMSprop | 1 (variance) |
| Adam | 2 (momentum + variance) |
| AdamW | 2 (same as Adam) |
Adam requires 2x memory for optimizer states! For large models, this matters.
10. Learning Rate Schedules
10.1 The Problem
Fixed η is suboptimal:
- Early training: large steps okay, landscape is far from optimum
- Late training: need precision, should take smaller steps
10.2 Common Schedules in Fluxion Terms
Constant:
η̇ = 0 (η never changes)
Linear decay:
η̇ = -η₀/T (linear decrease to 0 over T steps)
Cosine decay:
η(t) = η_min + (η₀-η_min)·(1+cos(πt/T))/2
Warmup:
t < T_warm: η(t) = η₀·t/T_warm (ramp up)
t ≥ T_warm: normal schedule (then decay)
10.3 Why Warmup?
At initialization:
- Weights are random
- Gradients are huge and noisy
- Adam's variance estimate (v) is zero
Large initial steps can destabilize training. Warmup lets variance estimates stabilize before taking big steps.
11. Gradient Clipping
11.1 The Problem
Occasionally, gradients explode (‖g‖ → ∞). One bad step can ruin training.
11.2 Fluxion Formulation
if ‖g‖ > max_norm:
g ← g · (max_norm / ‖g‖) # Rescale to max_norm
# Then proceed with normal optimizer
11.3 Physical Interpretation
"Cap the maximum force that can act on any weight."
No matter how steep the local slope, the ball can only accelerate so fast.
12. Implementation: Fused vs Unfused
12.1 The Computational Point
Mathematically equivalent formulations can have VERY different performance.
Unfused Adam (naive):
m = beta1 * m + (1-beta1) * g # Read m, g, write m
v = beta2 * v + (1-beta2) * g**2 # Read v, g, write v
m_hat = m / (1 - beta1**t) # Read m, write m_hat
v_hat = v / (1 - beta2**t) # Read v, write v_hat
W = W - lr * m_hat / (sqrt(v_hat) + eps) # Read W,m_hat,v_hat, write W
5 separate kernel launches, multiple memory round-trips.
Fused Adam:
# Single kernel: read g,m,v,W once, write m,v,W once
fused_adam_kernel(g, m, v, W, beta1, beta2, lr, eps, t)
1 kernel, 1 memory round-trip.
12.2 The Fluxion Insight
When written as flows:
ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²
Ẇ = -η·m̂/(√v̂+ε)
These are clearly THREE coupled ODEs that should be integrated together. The flow notation suggests fusion naturally.
Leibniz notation hides this by writing separate update equations.
13. Second-Order Methods (Brief)
13.1 Newton's Method (the optimization one, not fluxions)
Use curvature (second derivative) information:
Ẇ = -H⁻¹·g
Where H = Hessian = matrix of Ẅ (second derivatives)
13.2 Fluxion Interpretation
First-order (gradient descent): "Flow downhill" Second-order (Newton): "Flow toward the minimum, accounting for curvature"
If the landscape is a bowl, Newton's method jumps straight to the bottom in one step. Gradient descent spirals down gradually.
13.3 Why Not Used?
Computing H⁻¹ is O(n²) storage, O(n³) compute for n parameters. For n = 1 billion, this is impossible.
Approximations exist (L-BFGS, K-FAC) but Adam usually wins in practice.
14. Summary: Optimizer Selection
14.1 Quick Guide
| Situation | Optimizer |
|---|---|
| Simple convex problem | SGD + momentum |
| Deep networks, general | Adam |
| Language models | AdamW |
| Memory constrained | SGD + momentum |
| Fine-tuning | Lower LR Adam/AdamW |
14.2 The Unified View
All optimizers are just different ways of computing Ẇ from g:
Ẇ = f(g, history, W)
- SGD: Ẇ = -η·g (no history)
- Momentum: Ẇ = -η·EMA(g) (first moment history)
- Adam: Ẇ = -η·EMA(g)/√EMA(g²) (first and second moment)
- AdamW: Ẇ = -η·(EMA(g)/√EMA(g²) + λ·W) (plus decay force)
15. Conclusion
Optimizers become physical when viewed through fluxions:
- Weights are particles with position W
- Gradients are forces pushing uphill
- Momentum is literal velocity
- Adaptive rates measure local bumpiness
- Weight decay is a restoring force toward origin
This isn't just pedagogy—the flow formulation naturally suggests:
- Fused implementations (coupled ODEs)
- Continuous-time analysis (neural ODEs)
- Novel optimizers (what other forces could we add?)
The math is equivalent, but the intuition is transformative.
References
- Ruder, S. (2016). "An overview of gradient descent optimization algorithms."
- Kingma & Ba (2014). "Adam: A Method for Stochastic Optimization."
- Loshchilov & Hutter (2017). "Decoupled Weight Decay Regularization." (AdamW)
- Newton, I. (1736). The Method of Fluxions.
Appendix: PyTorch Implementation
class AdamWFluxion:
"""AdamW in fluxion style - flows computed explicitly"""
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999),
eps=1e-8, weight_decay=0.01):
self.params = list(params)
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.wd = weight_decay
self.t = 0
# Flow states (m = momentum flow, v = variance flow)
self.m = [torch.zeros_like(p) for p in self.params]
self.v = [torch.zeros_like(p) for p in self.params]
def step(self):
self.t += 1
for i, W in enumerate(self.params):
if W.grad is None:
continue
g = W.grad # Gradient = L̇ᵂ
# Momentum flow: ṁ = β₁·m + (1-β₁)·g
self.m[i] = self.beta1 * self.m[i] + (1-self.beta1) * g
# Variance flow: v̇ = β₂·v + (1-β₂)·g²
self.v[i] = self.beta2 * self.v[i] + (1-self.beta2) * g**2
# Bias correction
m_hat = self.m[i] / (1 - self.beta1**self.t)
v_hat = self.v[i] / (1 - self.beta2**self.t)
# Weight flow: Ẇ = -η·(m̂/(√v̂+ε) + λ·W)
W_dot = -self.lr * (m_hat / (v_hat.sqrt() + self.eps)
+ self.wd * W)
# Apply flow
W.data += W_dot
Correspondence: scott@opentransformers.online