SciPapers / positional_encodings_via_fluxions.md

Upload positional_encodings_via_fluxions.md with huggingface_hub

4d911fb verified 3 months ago

preview code

raw

history blame contribute delete

11.1 kB

Positional Encodings via the Method of Fluxions

How Transformers Know Where Things Are

Scott Bisset, Silicon Goddess
OpenTransformers Ltd
January 2026

Abstract

Positional encodings are often presented as "magic sine waves" or "learned embeddings" without explaining WHY they work. We analyze positional encodings through the fluxion lens, revealing: (1) sinusoidal encodings create a Fourier basis for position, (2) learned embeddings are just position-specific biases, (3) RoPE rotates the query-key space to make dot products position-aware, and (4) ALiBi adds position-dependent damping to attention. Each method has different gradient flow characteristics that explain their empirical behavior.

1. The Position Problem

1.1 Self-Attention Is Permutation-Invariant

Attention(X) = softmax(QKᵀ/√d) · V

Where Q = XWq, K = XWk, V = XWv

If we shuffle the rows of X, we get shuffled output. The attention mechanism itself has NO concept of order.

1.2 Why This Matters

"The cat sat on the mat" and "mat the on sat cat The" produce different attention patterns ONLY if we add position information.

2. Sinusoidal Positional Encoding (Original Transformer)

2.1 The Formula

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Where pos = position (0, 1, 2, ...)
      i = dimension index
      d = model dimension

2.2 Fluxion Interpretation

Each dimension oscillates at a different frequency:

Dimension 0,1: frequency = 1/10000⁰ = 1 (fastest)
Dimension 2,3: frequency = 1/10000^(2/d)
...
Dimension d-2,d-1: frequency = 1/10000¹ (slowest)

This is a Fourier basis for position!

Low dimensions: change rapidly with position (fine detail) High dimensions: change slowly (coarse position)

2.3 Why Sin AND Cos?

PE(pos) = [sin(ω₀·pos), cos(ω₀·pos), sin(ω₁·pos), cos(ω₁·pos), ...]

Sin and cos together allow LINEAR interpolation of relative positions:

PE(pos+k) = PE(pos) · R(k)

Where R(k) is a rotation matrix (depends only on offset k)

The network can learn to compute relative positions via linear operations!

2.4 Gradient Flow

Sinusoidal encodings are FIXED (not learned).

L̇ᴾᴱ = 0   (no gradient flows to positional encoding)

All position information must be extracted by the attention weights.

2.5 Addition vs Concatenation

Original Transformer ADDS PE to embeddings:

X = TokenEmbed(tokens) + PE(positions)

Fluxion view: Gradient flows equally to token embedding and through position.

Alternative (concatenation):

X = [TokenEmbed(tokens), PE(positions)]

Doubles dimension but keeps position separate.

3. Learned Positional Embeddings

3.1 The Idea

Just learn a separate embedding for each position:

PE = PositionEmbedding(pos)   # Shape: [max_len, d]

X = TokenEmbed(tokens) + PE[positions]

3.2 Fluxion Backward

L̇ᴾᴱ[pos] = L̇ˣ[pos]   (gradient flows directly)

Each position gets gradient from all samples at that position.

3.3 Advantages

Can learn arbitrary position patterns
No assumptions about structure

3.4 Disadvantages

Limited to max_len seen during training
No extrapolation: position 1001 has no embedding if max_len=1000
More parameters: max_len × d additional weights

3.5 Use Cases

BERT, GPT-2 (fixed max length)
Most encoder-only models

4. Relative Positional Encodings

4.1 The Insight

Attention should depend on RELATIVE position (i-j), not absolute.

"Token 5 attending to token 3" and "token 105 attending to token 103" should use the same relative position encoding.

4.2 Transformer-XL Style

Add relative position bias to attention scores:

S_ij = (Q_i · K_j + Q_i · R_{i-j}) / √d

Where R_{i-j} = relative position embedding for offset (i-j)

4.3 Fluxion Backward

L̇ᴿ[k] = Σᵢⱼ:i-j=k L̇ˢᵢⱼ · Qᵢ

Gradient to relative embedding k = sum over all (i,j) pairs with that offset.

5. Rotary Position Embedding (RoPE)

5.1 The Core Idea

Instead of ADDING position to embeddings, ROTATE them:

Q_rotated = Rotate(Q, θ·pos)
K_rotated = Rotate(K, θ·pos)

Then attention becomes:

Q_rot · K_rotᵀ = f(Q, K, pos_q - pos_k)

The dot product naturally encodes RELATIVE position!

5.2 The Rotation

For each pair of dimensions (2i, 2i+1):

[q_{2i}  ]     [cos(mθᵢ)  -sin(mθᵢ)] [q_{2i}  ]
[q_{2i+1}]  =  [sin(mθᵢ)   cos(mθᵢ)] [q_{2i+1}]

Where m = position index
      θᵢ = base^(-2i/d), typically base=10000

5.3 Why Rotation Works

Q_m · K_nᵀ = Σᵢ (q_{2i}·cos(mθᵢ) - q_{2i+1}·sin(mθᵢ)) 
               × (k_{2i}·cos(nθᵢ) - k_{2i+1}·sin(nθᵢ)) + ...
           = f(q, k, (m-n)θ)   # Only depends on relative position!

5.4 Fluxion Backward

L̇Q_pre_rotate = Rotateᵀ(L̇Q_rotated, θ·pos)
              = Rotate(L̇Q_rotated, -θ·pos)

Gradient flows backward through inverse rotation.

5.5 Advantages

Extrapolates to longer sequences (rotation works at any position)
No additional parameters
Relative position is built into attention

5.6 Use Cases

LLaMA, Mistral, most modern LLMs
Becoming the default for decoder-only models

6. ALiBi (Attention with Linear Biases)

6.1 The Simplest Approach

Don't modify Q or K. Just add a bias to attention scores:

S_ij = Q_i · K_jᵀ / √d - m · |i - j|

Where m = head-specific slope

6.2 Fluxion View

S_ij = raw_attention - position_penalty

Distant tokens get penalized. Attention naturally focuses on nearby tokens.

6.3 The Slopes

Different heads use different slopes:

Head 1: m = 2^(-8/n_heads)     (mild penalty)
Head 2: m = 2^(-16/n_heads)    (steeper)
...

Some heads focus locally, others can attend far.

6.4 Gradient Flow

L̇Q = L̇ˢ · K / √d         (unchanged from normal attention)
L̇K = L̇ˢᵀ · Q / √d        (unchanged)

Position bias has no learnable parameters. Zero gradient to position encoding (because there isn't one).

6.5 Advantages

Zero additional computation
Zero additional parameters
Extrapolates extremely well
Simple to implement

6.6 Disadvantages

Less expressive than RoPE
Assumes "closer is more relevant" (not always true)

7. Comparison Table

Method	Parameters	Extrapolation	Relative Position	Compute
Sinusoidal	0	Limited	Via linear transform	+
Learned	max_len × d	None	No	+
RoPE	0	Good	Yes (native)	++
ALiBi	0	Excellent	Yes (via bias)	+

8. NTK-Aware Interpolation (Long Context)

8.1 The Problem

RoPE trained on 4K context doesn't work at 32K. The rotations become too fast, angles wrap around.

8.2 The Fix: Adjust the Base

Original: θᵢ = 10000^(-2i/d)
Scaled:   θᵢ = (10000 · α)^(-2i/d)

Where α = (target_len / train_len)^(d/(d-2))

8.3 Fluxion Interpretation

Slower rotation = larger effective wavelength = position information spreads across longer range.

8.4 YaRN, CodeLLaMA, etc.

Various interpolation schemes exist:

Linear interpolation (scale all frequencies)
NTK-aware (scale base, preserve high frequencies)
YaRN (attention scaling + NTK)

All modify how position information flows through attention.

9. Absolute vs Relative: The Gradient Perspective

9.1 Absolute Position Gradients

L̇ᴾᴱ[pos] ∝ "how useful was knowing absolute position pos"

If position 0 is always "start token," PE[0] gets specialized gradient.

9.2 Relative Position Gradients

L̇ᴿ[offset] ∝ "how useful was knowing relative offset"

If "1 token apart" is meaningful, R[1] and R[-1] get large gradients.

9.3 RoPE: No Position Parameters

L̇θ = 0   (rotation angles are fixed, not learned)

All position learning happens in Q, K, V projections. The model learns "what to encode" rather than "how position affects attention."

10. Position in Different Architectures

10.1 Encoder-Only (BERT)

Input: [CLS] tok1 tok2 ... [SEP]
Position: 0    1    2   ...  n

Absolute position works fine - always process fixed-length chunks.

10.2 Decoder-Only (GPT)

Input: tok1 tok2 tok3 ... [generating]
Position: 0    1    2   ...  n

Must attend causally: position i can only see ≤ i

Relative position helps - model cares about "how far back" not "absolute slot."

10.3 Encoder-Decoder (T5)

Encoder: bidirectional, absolute position
Decoder: causal, relative position to encoder via cross-attention

Often uses different position schemes for different components.

11. Implementation: RoPE

def apply_rope(x, cos, sin):
    """
    x: [batch, seq_len, n_heads, head_dim]
    cos, sin: [seq_len, head_dim]
    """
    # Split into pairs
    x1 = x[..., 0::2]  # Even dimensions
    x2 = x[..., 1::2]  # Odd dimensions
    
    # Rotate
    x_rotated = torch.cat([
        x1 * cos - x2 * sin,
        x1 * sin + x2 * cos
    ], dim=-1)
    
    return x_rotated


def precompute_rope(dim, max_len, base=10000):
    """Precompute rotation matrices"""
    inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2) / dim))
    positions = torch.arange(max_len)
    angles = positions.unsqueeze(1) * inv_freq.unsqueeze(0)
    
    cos = angles.cos()
    sin = angles.sin()
    
    return cos, sin

11.1 Fluxion Backward (Manual)

def rope_backward(grad_output, cos, sin):
    """Backward through RoPE = inverse rotation"""
    g1 = grad_output[..., 0::2]
    g2 = grad_output[..., 1::2]
    
    # Inverse rotation (negate sin)
    grad_input = torch.cat([
        g1 * cos + g2 * sin,   # Note: +sin (inverse)
        -g1 * sin + g2 * cos
    ], dim=-1)
    
    return grad_input

12. Summary

12.1 The Position Problem

Transformers need position information injected because self-attention is permutation-invariant.

12.2 Solutions

Method	How	Gradient Flow
Sinusoidal	Add Fourier basis	None (fixed)
Learned	Add learned embeddings	To position params
RoPE	Rotate Q, K	Through Q, K projections
ALiBi	Bias attention scores	None (fixed bias)

12.3 Modern Best Practice

RoPE for most LLMs (good extrapolation, relative position)
ALiBi for extreme length extrapolation
Learned for fixed-length encoders

The fluxion view reveals: position encoding is about "where gradient needs to flow to learn position-aware representations."

References

Vaswani et al. (2017). "Attention Is All You Need." (Sinusoidal)
Su et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." (RoPE)
Press et al. (2022). "Train Short, Test Long: Attention with Linear Biases." (ALiBi)
Chen et al. (2023). "Extending Context Window of Large Language Models via Positional Interpolation."

Correspondence: scott@opentransformers.online