# Positional Encodings via the Method of Fluxions ## How Transformers Know Where Things Are **Scott Bisset, Silicon Goddess** OpenTransformers Ltd January 2026 --- ## Abstract Positional encodings are often presented as "magic sine waves" or "learned embeddings" without explaining WHY they work. We analyze positional encodings through the fluxion lens, revealing: (1) sinusoidal encodings create a Fourier basis for position, (2) learned embeddings are just position-specific biases, (3) RoPE rotates the query-key space to make dot products position-aware, and (4) ALiBi adds position-dependent damping to attention. Each method has different gradient flow characteristics that explain their empirical behavior. --- ## 1. The Position Problem ### 1.1 Self-Attention Is Permutation-Invariant ``` Attention(X) = softmax(QKᵀ/√d) · V Where Q = XWq, K = XWk, V = XWv ``` If we shuffle the rows of X, we get shuffled output. The attention mechanism itself has NO concept of order. ### 1.2 Why This Matters "The cat sat on the mat" and "mat the on sat cat The" produce different attention patterns ONLY if we add position information. --- ## 2. Sinusoidal Positional Encoding (Original Transformer) ### 2.1 The Formula ``` PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) Where pos = position (0, 1, 2, ...) i = dimension index d = model dimension ``` ### 2.2 Fluxion Interpretation Each dimension oscillates at a different frequency: ``` Dimension 0,1: frequency = 1/10000⁰ = 1 (fastest) Dimension 2,3: frequency = 1/10000^(2/d) ... Dimension d-2,d-1: frequency = 1/10000¹ (slowest) ``` **This is a Fourier basis for position!** Low dimensions: change rapidly with position (fine detail) High dimensions: change slowly (coarse position) ### 2.3 Why Sin AND Cos? ``` PE(pos) = [sin(ω₀·pos), cos(ω₀·pos), sin(ω₁·pos), cos(ω₁·pos), ...] ``` Sin and cos together allow LINEAR interpolation of relative positions: ``` PE(pos+k) = PE(pos) · R(k) Where R(k) is a rotation matrix (depends only on offset k) ``` The network can learn to compute relative positions via linear operations! ### 2.4 Gradient Flow Sinusoidal encodings are FIXED (not learned). ``` L̇ᴾᴱ = 0 (no gradient flows to positional encoding) ``` All position information must be extracted by the attention weights. ### 2.5 Addition vs Concatenation Original Transformer ADDS PE to embeddings: ``` X = TokenEmbed(tokens) + PE(positions) ``` **Fluxion view:** Gradient flows equally to token embedding and through position. Alternative (concatenation): ``` X = [TokenEmbed(tokens), PE(positions)] ``` Doubles dimension but keeps position separate. --- ## 3. Learned Positional Embeddings ### 3.1 The Idea Just learn a separate embedding for each position: ``` PE = PositionEmbedding(pos) # Shape: [max_len, d] X = TokenEmbed(tokens) + PE[positions] ``` ### 3.2 Fluxion Backward ``` L̇ᴾᴱ[pos] = L̇ˣ[pos] (gradient flows directly) ``` Each position gets gradient from all samples at that position. ### 3.3 Advantages - Can learn arbitrary position patterns - No assumptions about structure ### 3.4 Disadvantages - Limited to max_len seen during training - No extrapolation: position 1001 has no embedding if max_len=1000 - More parameters: max_len × d additional weights ### 3.5 Use Cases - BERT, GPT-2 (fixed max length) - Most encoder-only models --- ## 4. Relative Positional Encodings ### 4.1 The Insight Attention should depend on RELATIVE position (i-j), not absolute. "Token 5 attending to token 3" and "token 105 attending to token 103" should use the same relative position encoding. ### 4.2 Transformer-XL Style Add relative position bias to attention scores: ``` S_ij = (Q_i · K_j + Q_i · R_{i-j}) / √d Where R_{i-j} = relative position embedding for offset (i-j) ``` ### 4.3 Fluxion Backward ``` L̇ᴿ[k] = Σᵢⱼ:i-j=k L̇ˢᵢⱼ · Qᵢ ``` Gradient to relative embedding k = sum over all (i,j) pairs with that offset. --- ## 5. Rotary Position Embedding (RoPE) ### 5.1 The Core Idea Instead of ADDING position to embeddings, ROTATE them: ``` Q_rotated = Rotate(Q, θ·pos) K_rotated = Rotate(K, θ·pos) ``` Then attention becomes: ``` Q_rot · K_rotᵀ = f(Q, K, pos_q - pos_k) ``` The dot product naturally encodes RELATIVE position! ### 5.2 The Rotation For each pair of dimensions (2i, 2i+1): ``` [q_{2i} ] [cos(mθᵢ) -sin(mθᵢ)] [q_{2i} ] [q_{2i+1}] = [sin(mθᵢ) cos(mθᵢ)] [q_{2i+1}] Where m = position index θᵢ = base^(-2i/d), typically base=10000 ``` ### 5.3 Why Rotation Works ``` Q_m · K_nᵀ = Σᵢ (q_{2i}·cos(mθᵢ) - q_{2i+1}·sin(mθᵢ)) × (k_{2i}·cos(nθᵢ) - k_{2i+1}·sin(nθᵢ)) + ... = f(q, k, (m-n)θ) # Only depends on relative position! ``` ### 5.4 Fluxion Backward ``` L̇Q_pre_rotate = Rotateᵀ(L̇Q_rotated, θ·pos) = Rotate(L̇Q_rotated, -θ·pos) ``` Gradient flows backward through inverse rotation. ### 5.5 Advantages - Extrapolates to longer sequences (rotation works at any position) - No additional parameters - Relative position is built into attention ### 5.6 Use Cases - LLaMA, Mistral, most modern LLMs - Becoming the default for decoder-only models --- ## 6. ALiBi (Attention with Linear Biases) ### 6.1 The Simplest Approach Don't modify Q or K. Just add a bias to attention scores: ``` S_ij = Q_i · K_jᵀ / √d - m · |i - j| Where m = head-specific slope ``` ### 6.2 Fluxion View ``` S_ij = raw_attention - position_penalty ``` **Distant tokens get penalized.** Attention naturally focuses on nearby tokens. ### 6.3 The Slopes Different heads use different slopes: ``` Head 1: m = 2^(-8/n_heads) (mild penalty) Head 2: m = 2^(-16/n_heads) (steeper) ... ``` Some heads focus locally, others can attend far. ### 6.4 Gradient Flow ``` L̇Q = L̇ˢ · K / √d (unchanged from normal attention) L̇K = L̇ˢᵀ · Q / √d (unchanged) ``` Position bias has no learnable parameters. Zero gradient to position encoding (because there isn't one). ### 6.5 Advantages - Zero additional computation - Zero additional parameters - Extrapolates extremely well - Simple to implement ### 6.6 Disadvantages - Less expressive than RoPE - Assumes "closer is more relevant" (not always true) --- ## 7. Comparison Table | Method | Parameters | Extrapolation | Relative Position | Compute | |--------|------------|---------------|-------------------|---------| | Sinusoidal | 0 | Limited | Via linear transform | + | | Learned | max_len × d | None | No | + | | RoPE | 0 | Good | Yes (native) | ++ | | ALiBi | 0 | Excellent | Yes (via bias) | + | --- ## 8. NTK-Aware Interpolation (Long Context) ### 8.1 The Problem RoPE trained on 4K context doesn't work at 32K. The rotations become too fast, angles wrap around. ### 8.2 The Fix: Adjust the Base ``` Original: θᵢ = 10000^(-2i/d) Scaled: θᵢ = (10000 · α)^(-2i/d) Where α = (target_len / train_len)^(d/(d-2)) ``` ### 8.3 Fluxion Interpretation Slower rotation = larger effective wavelength = position information spreads across longer range. ### 8.4 YaRN, CodeLLaMA, etc. Various interpolation schemes exist: - Linear interpolation (scale all frequencies) - NTK-aware (scale base, preserve high frequencies) - YaRN (attention scaling + NTK) All modify how position information flows through attention. --- ## 9. Absolute vs Relative: The Gradient Perspective ### 9.1 Absolute Position Gradients ``` L̇ᴾᴱ[pos] ∝ "how useful was knowing absolute position pos" ``` If position 0 is always "start token," PE[0] gets specialized gradient. ### 9.2 Relative Position Gradients ``` L̇ᴿ[offset] ∝ "how useful was knowing relative offset" ``` If "1 token apart" is meaningful, R[1] and R[-1] get large gradients. ### 9.3 RoPE: No Position Parameters ``` L̇θ = 0 (rotation angles are fixed, not learned) ``` All position learning happens in Q, K, V projections. The model learns "what to encode" rather than "how position affects attention." --- ## 10. Position in Different Architectures ### 10.1 Encoder-Only (BERT) ``` Input: [CLS] tok1 tok2 ... [SEP] Position: 0 1 2 ... n ``` Absolute position works fine - always process fixed-length chunks. ### 10.2 Decoder-Only (GPT) ``` Input: tok1 tok2 tok3 ... [generating] Position: 0 1 2 ... n Must attend causally: position i can only see ≤ i ``` Relative position helps - model cares about "how far back" not "absolute slot." ### 10.3 Encoder-Decoder (T5) ``` Encoder: bidirectional, absolute position Decoder: causal, relative position to encoder via cross-attention ``` Often uses different position schemes for different components. --- ## 11. Implementation: RoPE ```python def apply_rope(x, cos, sin): """ x: [batch, seq_len, n_heads, head_dim] cos, sin: [seq_len, head_dim] """ # Split into pairs x1 = x[..., 0::2] # Even dimensions x2 = x[..., 1::2] # Odd dimensions # Rotate x_rotated = torch.cat([ x1 * cos - x2 * sin, x1 * sin + x2 * cos ], dim=-1) return x_rotated def precompute_rope(dim, max_len, base=10000): """Precompute rotation matrices""" inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2) / dim)) positions = torch.arange(max_len) angles = positions.unsqueeze(1) * inv_freq.unsqueeze(0) cos = angles.cos() sin = angles.sin() return cos, sin ``` ### 11.1 Fluxion Backward (Manual) ```python def rope_backward(grad_output, cos, sin): """Backward through RoPE = inverse rotation""" g1 = grad_output[..., 0::2] g2 = grad_output[..., 1::2] # Inverse rotation (negate sin) grad_input = torch.cat([ g1 * cos + g2 * sin, # Note: +sin (inverse) -g1 * sin + g2 * cos ], dim=-1) return grad_input ``` --- ## 12. Summary ### 12.1 The Position Problem Transformers need position information injected because self-attention is permutation-invariant. ### 12.2 Solutions | Method | How | Gradient Flow | |--------|-----|---------------| | Sinusoidal | Add Fourier basis | None (fixed) | | Learned | Add learned embeddings | To position params | | RoPE | Rotate Q, K | Through Q, K projections | | ALiBi | Bias attention scores | None (fixed bias) | ### 12.3 Modern Best Practice - **RoPE** for most LLMs (good extrapolation, relative position) - **ALiBi** for extreme length extrapolation - **Learned** for fixed-length encoders The fluxion view reveals: position encoding is about "where gradient needs to flow to learn position-aware representations." --- ## References 1. Vaswani et al. (2017). "Attention Is All You Need." (Sinusoidal) 2. Su et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." (RoPE) 3. Press et al. (2022). "Train Short, Test Long: Attention with Linear Biases." (ALiBi) 4. Chen et al. (2023). "Extending Context Window of Large Language Models via Positional Interpolation." --- *Correspondence: scott@opentransformers.online*