# Gradient Descent and Optimizers via the Method of Fluxions
## From SGD to AdamW: A Newtonian Perspective

**Scott Bisset, Silicon Goddess**  
OpenTransformers Ltd  
January 2026

---

## Abstract

Neural network optimizers are typically presented as update rules with cryptic Greek letters (β₁, β₂, ε) and little intuition for why they work. We reformulate gradient descent, momentum, RMSprop, Adam, and AdamW using Newton's method of fluxions. In this framework, optimization becomes physical: weights flow through parameter space, momentum is literal velocity, and adaptive learning rates emerge from measuring flow variance. This perspective reveals why certain hyperparameter choices work and suggests principled modifications.

---

## 1. The Optimization Problem

### 1.1 What We Want

Find weights W that minimize loss L(W).

### 1.2 The Fluxion Framing

Imagine weights as particles flowing through parameter space. The loss function L(W) defines a landscape—hills and valleys. We want weights to flow downhill to the lowest valley.

**Key quantities:**
| Symbol | Meaning |
|--------|---------|
| W | Position in weight space |
| Ẇ | Velocity (how weights flow) |
| Ẅ | Acceleration (how velocity changes) |
| L̇ᵂ | Gradient (which direction is uphill) |
| g | Shorthand for L̇ᵂ (the gradient) |

---

## 2. Vanilla Gradient Descent

### 2.1 The Update Rule

**Leibniz (opaque):**
```
W_{t+1} = W_t - η · ∂L/∂W
```

**Fluxion (physical):**
```
Ẇ = -η · g

"Weights flow opposite to gradient, scaled by learning rate"
```

### 2.2 Physical Interpretation

Imagine a ball on a hill:
- **g = L̇ᵂ** points uphill (steepest ascent)
- **-g** points downhill
- **η** controls flow speed

The ball has no mass, no inertia—it teleports in the downhill direction each step.

### 2.3 Problems

1. **Ravine oscillation**: Narrow valleys cause zig-zagging
2. **Flat region stalling**: Tiny gradient = tiny movement
3. **Uniform speed**: Same η for all parameters, regardless of curvature

---

## 3. Momentum: Adding Inertia

### 3.1 The Idea

Give the ball mass. Let it build up speed.

### 3.2 Fluxion Formulation

Introduce velocity v as a separate state:

```
v̇ = β · v + g          # Velocity accumulates gradient (with decay)
Ẇ = -η · v             # Position flows with velocity
```

**Physical interpretation:**
- β = friction coefficient (0.9 = low friction, velocity persists)
- v accumulates gradient over time
- Ball builds momentum rolling downhill

### 3.3 Why It Helps

**Ravine problem solved:**
- Side-to-side gradients cancel out in v
- Down-the-valley gradients accumulate
- Ball rolls straight down valley floor

**Flat regions:**
- Momentum carries ball through plateaus
- Previous velocity persists even when current gradient is small

### 3.4 The β Parameter

```
β = 0.0: No momentum, vanilla GD
β = 0.9: Standard choice, 10-step effective memory
β = 0.99: Heavy ball, 100-step memory
```

Effective memory ≈ 1/(1-β) steps

---

## 4. Nesterov Momentum: Look Before You Leap

### 4.1 The Problem with Standard Momentum

Ball computes gradient at current position, then moves.
But it's GOING to move with velocity v anyway.
Why not compute gradient at where we're GOING to be?

### 4.2 Fluxion Formulation

```
W_ahead = W + β · v           # Where momentum will take us
g_ahead = L̇ᵂ(W_ahead)        # Gradient at future position
v̇ = β · v + g_ahead          # Update velocity with lookahead gradient
Ẇ = -η · v
```

### 4.3 Physical Interpretation

"Look downhill from where you'll land, not where you stand."

The ball predicts its next position, evaluates the slope THERE, then adjusts.

### 4.4 Why It Helps

- Anticipates overshooting
- Dampens oscillations faster
- Converges slightly faster in practice

---

## 5. AdaGrad: Adaptive Learning Rates

### 5.1 The Problem

Some parameters get huge gradients, others tiny.
Uniform η is wrong for both.

### 5.2 The Idea

Track cumulative squared gradient per parameter.
Scale learning rate inversely.

### 5.3 Fluxion Formulation

```
ṡ = s + g²                    # Accumulate squared gradient (elementwise)
Ẇ = -η · g / (√s + ε)         # Scale by inverse sqrt of accumulator
```

### 5.4 Physical Interpretation

**s** measures "how much this parameter has been pushed historically."

- High s → parameter was pushed a lot → reduce sensitivity
- Low s → parameter barely moved → increase sensitivity

### 5.5 Problem

s only grows. Learning rate only shrinks.
Eventually ALL learning rates → 0.
Training stalls.

---

## 6. RMSprop: Exponential Moving Average Fix

### 6.1 The Fix

Don't accumulate forever. Use exponential moving average.

### 6.2 Fluxion Formulation

```
ṡ = β · s + (1-β) · g²        # EMA of squared gradient
Ẇ = -η · g / (√s + ε)         # Adaptive scaling
```

### 6.3 Physical Interpretation

**s** now measures "recent gradient variance."

- High recent variance → parameter is noisy → take smaller steps
- Low recent variance → parameter is stable → take larger steps

### 6.4 The β Parameter (typically 0.99)

```
β = 0.99: ~100 step memory for variance estimate
β = 0.9:  ~10 step memory (more reactive)
```

---

## 7. Adam: Best of Both Worlds

### 7.1 The Combination

Adam = Momentum + RMSprop

Track BOTH:
- First moment (mean gradient) → momentum
- Second moment (gradient variance) → adaptive rate

### 7.2 Fluxion Formulation

```
# First moment: momentum
ṁ = β₁ · m + (1-β₁) · g

# Second moment: variance
v̇ = β₂ · v + (1-β₂) · g²

# Bias correction (important at start!)
m̂ = m / (1 - β₁ᵗ)
v̂ = v / (1 - β₂ᵗ)

# Update
Ẇ = -η · m̂ / (√v̂ + ε)
```

### 7.3 Physical Interpretation

**m** = smoothed direction (where to flow)
**v** = smoothed magnitude variance (how carefully to flow)

"Flow in the average recent direction, at speed inversely proportional to recent bumpiness."

### 7.4 Bias Correction: Why?

At t=0, m=0 and v=0.
First update: m = (1-β₁)·g ≈ 0.1·g (biased low!)

Division by (1-β₁ᵗ) corrects:
- t=1: divide by 0.1 → correct scale
- t=∞: divide by 1.0 → no correction needed

### 7.5 Standard Hyperparameters

```
β₁ = 0.9    # Momentum coefficient (~10 step memory)
β₂ = 0.999  # Variance coefficient (~1000 step memory)  
ε = 1e-8    # Numerical stability (prevents division by zero)
η = 0.001   # Base learning rate
```

---

## 8. AdamW: Weight Decay Done Right

### 8.1 The Problem with L2 Regularization

Original Adam with L2 regularization:
```
g_reg = g + λ·W              # Add weight penalty to gradient
ṁ = β₁·m + (1-β₁)·g_reg     # Momentum includes penalty
```

Problem: The adaptive scaling also scales the weight decay!
Large weights with small gradients get LESS decay, not more.

### 8.2 AdamW: Decoupled Weight Decay

```
# Moments on RAW gradient (no weight penalty)
ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²

# Bias correction
m̂ = m / (1-β₁ᵗ)
v̂ = v / (1-β₂ᵗ)

# Update with SEPARATE weight decay
Ẇ = -η · (m̂/(√v̂+ε) + λ·W)
```

### 8.3 Physical Interpretation

Two separate forces on each weight:
1. **Gradient force**: Push toward lower loss
2. **Decay force**: Pull toward zero (regularization)

AdamW keeps these forces separate.
Original Adam mixed them, causing the decay force to be scaled by the adaptive rate.

### 8.4 Why It Matters

AdamW consistently outperforms Adam+L2 on language models.
The "W" stands for "decoupled Weight decay."

---

## 9. Complete Algorithm Comparison

### 9.1 In Fluxion Notation

**SGD:**
```
Ẇ = -η·g
```

**SGD + Momentum:**
```
v̇ = β·v + g
Ẇ = -η·v
```

**RMSprop:**
```
ṡ = β·s + (1-β)·g²
Ẇ = -η·g/(√s+ε)
```

**Adam:**
```
ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²
Ẇ = -η·m̂/(√v̂+ε)
```

**AdamW:**
```
ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²
Ẇ = -η·(m̂/(√v̂+ε) + λ·W)
```

### 9.2 State Required

| Optimizer | States per parameter |
|-----------|---------------------|
| SGD | 0 |
| Momentum | 1 (velocity) |
| RMSprop | 1 (variance) |
| Adam | 2 (momentum + variance) |
| AdamW | 2 (same as Adam) |

Adam requires 2x memory for optimizer states!
For large models, this matters.

---

## 10. Learning Rate Schedules

### 10.1 The Problem

Fixed η is suboptimal:
- Early training: large steps okay, landscape is far from optimum
- Late training: need precision, should take smaller steps

### 10.2 Common Schedules in Fluxion Terms

**Constant:**
```
η̇ = 0     (η never changes)
```

**Linear decay:**
```
η̇ = -η₀/T    (linear decrease to 0 over T steps)
```

**Cosine decay:**
```
η(t) = η_min + (η₀-η_min)·(1+cos(πt/T))/2
```

**Warmup:**
```
t < T_warm:  η(t) = η₀·t/T_warm     (ramp up)
t ≥ T_warm:  normal schedule        (then decay)
```

### 10.3 Why Warmup?

At initialization:
- Weights are random
- Gradients are huge and noisy
- Adam's variance estimate (v) is zero

Large initial steps can destabilize training.
Warmup lets variance estimates stabilize before taking big steps.

---

## 11. Gradient Clipping

### 11.1 The Problem

Occasionally, gradients explode (‖g‖ → ∞).
One bad step can ruin training.

### 11.2 Fluxion Formulation

```
if ‖g‖ > max_norm:
    g ← g · (max_norm / ‖g‖)     # Rescale to max_norm
    
# Then proceed with normal optimizer
```

### 11.3 Physical Interpretation

"Cap the maximum force that can act on any weight."

No matter how steep the local slope, the ball can only accelerate so fast.

---

## 12. Implementation: Fused vs Unfused

### 12.1 The Computational Point

Mathematically equivalent formulations can have VERY different performance.

**Unfused Adam (naive):**
```python
m = beta1 * m + (1-beta1) * g           # Read m, g, write m
v = beta2 * v + (1-beta2) * g**2        # Read v, g, write v  
m_hat = m / (1 - beta1**t)              # Read m, write m_hat
v_hat = v / (1 - beta2**t)              # Read v, write v_hat
W = W - lr * m_hat / (sqrt(v_hat) + eps) # Read W,m_hat,v_hat, write W
```
5 separate kernel launches, multiple memory round-trips.

**Fused Adam:**
```python
# Single kernel: read g,m,v,W once, write m,v,W once
fused_adam_kernel(g, m, v, W, beta1, beta2, lr, eps, t)
```
1 kernel, 1 memory round-trip.

### 12.2 The Fluxion Insight

When written as flows:
```
ṁ = β₁·m + (1-β₁)·g
v̇ = β₂·v + (1-β₂)·g²
Ẇ = -η·m̂/(√v̂+ε)
```

These are clearly THREE coupled ODEs that should be integrated together.
The flow notation suggests fusion naturally.

Leibniz notation hides this by writing separate update equations.

---

## 13. Second-Order Methods (Brief)

### 13.1 Newton's Method (the optimization one, not fluxions)

Use curvature (second derivative) information:

```
Ẇ = -H⁻¹·g

Where H = Hessian = matrix of Ẅ (second derivatives)
```

### 13.2 Fluxion Interpretation

**First-order (gradient descent):** "Flow downhill"
**Second-order (Newton):** "Flow toward the minimum, accounting for curvature"

If the landscape is a bowl, Newton's method jumps straight to the bottom in one step.
Gradient descent spirals down gradually.

### 13.3 Why Not Used?

Computing H⁻¹ is O(n²) storage, O(n³) compute for n parameters.
For n = 1 billion, this is impossible.

Approximations exist (L-BFGS, K-FAC) but Adam usually wins in practice.

---

## 14. Summary: Optimizer Selection

### 14.1 Quick Guide

| Situation | Optimizer |
|-----------|-----------|
| Simple convex problem | SGD + momentum |
| Deep networks, general | Adam |
| Language models | AdamW |
| Memory constrained | SGD + momentum |
| Fine-tuning | Lower LR Adam/AdamW |

### 14.2 The Unified View

All optimizers are just different ways of computing Ẇ from g:

```
Ẇ = f(g, history, W)
```

- SGD: Ẇ = -η·g (no history)
- Momentum: Ẇ = -η·EMA(g) (first moment history)
- Adam: Ẇ = -η·EMA(g)/√EMA(g²) (first and second moment)
- AdamW: Ẇ = -η·(EMA(g)/√EMA(g²) + λ·W) (plus decay force)

---

## 15. Conclusion

Optimizers become physical when viewed through fluxions:

- **Weights** are particles with position W
- **Gradients** are forces pushing uphill
- **Momentum** is literal velocity
- **Adaptive rates** measure local bumpiness
- **Weight decay** is a restoring force toward origin

This isn't just pedagogy—the flow formulation naturally suggests:
1. Fused implementations (coupled ODEs)
2. Continuous-time analysis (neural ODEs)  
3. Novel optimizers (what other forces could we add?)

The math is equivalent, but the intuition is transformative.

---

## References

1. Ruder, S. (2016). "An overview of gradient descent optimization algorithms."
2. Kingma & Ba (2014). "Adam: A Method for Stochastic Optimization."
3. Loshchilov & Hutter (2017). "Decoupled Weight Decay Regularization." (AdamW)
4. Newton, I. (1736). *The Method of Fluxions.*

---

## Appendix: PyTorch Implementation

```python
class AdamWFluxion:
    """AdamW in fluxion style - flows computed explicitly"""
    
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), 
                 eps=1e-8, weight_decay=0.01):
        self.params = list(params)
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.wd = weight_decay
        self.t = 0
        
        # Flow states (m = momentum flow, v = variance flow)
        self.m = [torch.zeros_like(p) for p in self.params]
        self.v = [torch.zeros_like(p) for p in self.params]
    
    def step(self):
        self.t += 1
        
        for i, W in enumerate(self.params):
            if W.grad is None:
                continue
                
            g = W.grad  # Gradient = L̇ᵂ
            
            # Momentum flow: ṁ = β₁·m + (1-β₁)·g
            self.m[i] = self.beta1 * self.m[i] + (1-self.beta1) * g
            
            # Variance flow: v̇ = β₂·v + (1-β₂)·g²
            self.v[i] = self.beta2 * self.v[i] + (1-self.beta2) * g**2
            
            # Bias correction
            m_hat = self.m[i] / (1 - self.beta1**self.t)
            v_hat = self.v[i] / (1 - self.beta2**self.t)
            
            # Weight flow: Ẇ = -η·(m̂/(√v̂+ε) + λ·W)
            W_dot = -self.lr * (m_hat / (v_hat.sqrt() + self.eps) 
                               + self.wd * W)
            
            # Apply flow
            W.data += W_dot
```

---

*Correspondence: scott@opentransformers.online*