Positional Encodings via the Method of Fluxions
How Transformers Know Where Things Are
Scott Bisset, Silicon Goddess
OpenTransformers Ltd
January 2026
Abstract
Positional encodings are often presented as "magic sine waves" or "learned embeddings" without explaining WHY they work. We analyze positional encodings through the fluxion lens, revealing: (1) sinusoidal encodings create a Fourier basis for position, (2) learned embeddings are just position-specific biases, (3) RoPE rotates the query-key space to make dot products position-aware, and (4) ALiBi adds position-dependent damping to attention. Each method has different gradient flow characteristics that explain their empirical behavior.
1. The Position Problem
1.1 Self-Attention Is Permutation-Invariant
Attention(X) = softmax(QKᵀ/√d) · V
Where Q = XWq, K = XWk, V = XWv
If we shuffle the rows of X, we get shuffled output. The attention mechanism itself has NO concept of order.
1.2 Why This Matters
"The cat sat on the mat" and "mat the on sat cat The" produce different attention patterns ONLY if we add position information.
2. Sinusoidal Positional Encoding (Original Transformer)
2.1 The Formula
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Where pos = position (0, 1, 2, ...)
i = dimension index
d = model dimension
2.2 Fluxion Interpretation
Each dimension oscillates at a different frequency:
Dimension 0,1: frequency = 1/10000⁰ = 1 (fastest)
Dimension 2,3: frequency = 1/10000^(2/d)
...
Dimension d-2,d-1: frequency = 1/10000¹ (slowest)
This is a Fourier basis for position!
Low dimensions: change rapidly with position (fine detail) High dimensions: change slowly (coarse position)
2.3 Why Sin AND Cos?
PE(pos) = [sin(ω₀·pos), cos(ω₀·pos), sin(ω₁·pos), cos(ω₁·pos), ...]
Sin and cos together allow LINEAR interpolation of relative positions:
PE(pos+k) = PE(pos) · R(k)
Where R(k) is a rotation matrix (depends only on offset k)
The network can learn to compute relative positions via linear operations!
2.4 Gradient Flow
Sinusoidal encodings are FIXED (not learned).
L̇ᴾᴱ = 0 (no gradient flows to positional encoding)
All position information must be extracted by the attention weights.
2.5 Addition vs Concatenation
Original Transformer ADDS PE to embeddings:
X = TokenEmbed(tokens) + PE(positions)
Fluxion view: Gradient flows equally to token embedding and through position.
Alternative (concatenation):
X = [TokenEmbed(tokens), PE(positions)]
Doubles dimension but keeps position separate.
3. Learned Positional Embeddings
3.1 The Idea
Just learn a separate embedding for each position:
PE = PositionEmbedding(pos) # Shape: [max_len, d]
X = TokenEmbed(tokens) + PE[positions]
3.2 Fluxion Backward
L̇ᴾᴱ[pos] = L̇ˣ[pos] (gradient flows directly)
Each position gets gradient from all samples at that position.
3.3 Advantages
- Can learn arbitrary position patterns
- No assumptions about structure
3.4 Disadvantages
- Limited to max_len seen during training
- No extrapolation: position 1001 has no embedding if max_len=1000
- More parameters: max_len × d additional weights
3.5 Use Cases
- BERT, GPT-2 (fixed max length)
- Most encoder-only models
4. Relative Positional Encodings
4.1 The Insight
Attention should depend on RELATIVE position (i-j), not absolute.
"Token 5 attending to token 3" and "token 105 attending to token 103" should use the same relative position encoding.
4.2 Transformer-XL Style
Add relative position bias to attention scores:
S_ij = (Q_i · K_j + Q_i · R_{i-j}) / √d
Where R_{i-j} = relative position embedding for offset (i-j)
4.3 Fluxion Backward
L̇ᴿ[k] = Σᵢⱼ:i-j=k L̇ˢᵢⱼ · Qᵢ
Gradient to relative embedding k = sum over all (i,j) pairs with that offset.
5. Rotary Position Embedding (RoPE)
5.1 The Core Idea
Instead of ADDING position to embeddings, ROTATE them:
Q_rotated = Rotate(Q, θ·pos)
K_rotated = Rotate(K, θ·pos)
Then attention becomes:
Q_rot · K_rotᵀ = f(Q, K, pos_q - pos_k)
The dot product naturally encodes RELATIVE position!
5.2 The Rotation
For each pair of dimensions (2i, 2i+1):
[q_{2i} ] [cos(mθᵢ) -sin(mθᵢ)] [q_{2i} ]
[q_{2i+1}] = [sin(mθᵢ) cos(mθᵢ)] [q_{2i+1}]
Where m = position index
θᵢ = base^(-2i/d), typically base=10000
5.3 Why Rotation Works
Q_m · K_nᵀ = Σᵢ (q_{2i}·cos(mθᵢ) - q_{2i+1}·sin(mθᵢ))
× (k_{2i}·cos(nθᵢ) - k_{2i+1}·sin(nθᵢ)) + ...
= f(q, k, (m-n)θ) # Only depends on relative position!
5.4 Fluxion Backward
L̇Q_pre_rotate = Rotateᵀ(L̇Q_rotated, θ·pos)
= Rotate(L̇Q_rotated, -θ·pos)
Gradient flows backward through inverse rotation.
5.5 Advantages
- Extrapolates to longer sequences (rotation works at any position)
- No additional parameters
- Relative position is built into attention
5.6 Use Cases
- LLaMA, Mistral, most modern LLMs
- Becoming the default for decoder-only models
6. ALiBi (Attention with Linear Biases)
6.1 The Simplest Approach
Don't modify Q or K. Just add a bias to attention scores:
S_ij = Q_i · K_jᵀ / √d - m · |i - j|
Where m = head-specific slope
6.2 Fluxion View
S_ij = raw_attention - position_penalty
Distant tokens get penalized. Attention naturally focuses on nearby tokens.
6.3 The Slopes
Different heads use different slopes:
Head 1: m = 2^(-8/n_heads) (mild penalty)
Head 2: m = 2^(-16/n_heads) (steeper)
...
Some heads focus locally, others can attend far.
6.4 Gradient Flow
L̇Q = L̇ˢ · K / √d (unchanged from normal attention)
L̇K = L̇ˢᵀ · Q / √d (unchanged)
Position bias has no learnable parameters. Zero gradient to position encoding (because there isn't one).
6.5 Advantages
- Zero additional computation
- Zero additional parameters
- Extrapolates extremely well
- Simple to implement
6.6 Disadvantages
- Less expressive than RoPE
- Assumes "closer is more relevant" (not always true)
7. Comparison Table
| Method | Parameters | Extrapolation | Relative Position | Compute |
|---|---|---|---|---|
| Sinusoidal | 0 | Limited | Via linear transform | + |
| Learned | max_len × d | None | No | + |
| RoPE | 0 | Good | Yes (native) | ++ |
| ALiBi | 0 | Excellent | Yes (via bias) | + |
8. NTK-Aware Interpolation (Long Context)
8.1 The Problem
RoPE trained on 4K context doesn't work at 32K. The rotations become too fast, angles wrap around.
8.2 The Fix: Adjust the Base
Original: θᵢ = 10000^(-2i/d)
Scaled: θᵢ = (10000 · α)^(-2i/d)
Where α = (target_len / train_len)^(d/(d-2))
8.3 Fluxion Interpretation
Slower rotation = larger effective wavelength = position information spreads across longer range.
8.4 YaRN, CodeLLaMA, etc.
Various interpolation schemes exist:
- Linear interpolation (scale all frequencies)
- NTK-aware (scale base, preserve high frequencies)
- YaRN (attention scaling + NTK)
All modify how position information flows through attention.
9. Absolute vs Relative: The Gradient Perspective
9.1 Absolute Position Gradients
L̇ᴾᴱ[pos] ∝ "how useful was knowing absolute position pos"
If position 0 is always "start token," PE[0] gets specialized gradient.
9.2 Relative Position Gradients
L̇ᴿ[offset] ∝ "how useful was knowing relative offset"
If "1 token apart" is meaningful, R[1] and R[-1] get large gradients.
9.3 RoPE: No Position Parameters
L̇θ = 0 (rotation angles are fixed, not learned)
All position learning happens in Q, K, V projections. The model learns "what to encode" rather than "how position affects attention."
10. Position in Different Architectures
10.1 Encoder-Only (BERT)
Input: [CLS] tok1 tok2 ... [SEP]
Position: 0 1 2 ... n
Absolute position works fine - always process fixed-length chunks.
10.2 Decoder-Only (GPT)
Input: tok1 tok2 tok3 ... [generating]
Position: 0 1 2 ... n
Must attend causally: position i can only see ≤ i
Relative position helps - model cares about "how far back" not "absolute slot."
10.3 Encoder-Decoder (T5)
Encoder: bidirectional, absolute position
Decoder: causal, relative position to encoder via cross-attention
Often uses different position schemes for different components.
11. Implementation: RoPE
def apply_rope(x, cos, sin):
"""
x: [batch, seq_len, n_heads, head_dim]
cos, sin: [seq_len, head_dim]
"""
# Split into pairs
x1 = x[..., 0::2] # Even dimensions
x2 = x[..., 1::2] # Odd dimensions
# Rotate
x_rotated = torch.cat([
x1 * cos - x2 * sin,
x1 * sin + x2 * cos
], dim=-1)
return x_rotated
def precompute_rope(dim, max_len, base=10000):
"""Precompute rotation matrices"""
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2) / dim))
positions = torch.arange(max_len)
angles = positions.unsqueeze(1) * inv_freq.unsqueeze(0)
cos = angles.cos()
sin = angles.sin()
return cos, sin
11.1 Fluxion Backward (Manual)
def rope_backward(grad_output, cos, sin):
"""Backward through RoPE = inverse rotation"""
g1 = grad_output[..., 0::2]
g2 = grad_output[..., 1::2]
# Inverse rotation (negate sin)
grad_input = torch.cat([
g1 * cos + g2 * sin, # Note: +sin (inverse)
-g1 * sin + g2 * cos
], dim=-1)
return grad_input
12. Summary
12.1 The Position Problem
Transformers need position information injected because self-attention is permutation-invariant.
12.2 Solutions
| Method | How | Gradient Flow |
|---|---|---|
| Sinusoidal | Add Fourier basis | None (fixed) |
| Learned | Add learned embeddings | To position params |
| RoPE | Rotate Q, K | Through Q, K projections |
| ALiBi | Bias attention scores | None (fixed bias) |
12.3 Modern Best Practice
- RoPE for most LLMs (good extrapolation, relative position)
- ALiBi for extreme length extrapolation
- Learned for fixed-length encoders
The fluxion view reveals: position encoding is about "where gradient needs to flow to learn position-aware representations."
References
- Vaswani et al. (2017). "Attention Is All You Need." (Sinusoidal)
- Su et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." (RoPE)
- Press et al. (2022). "Train Short, Test Long: Attention with Linear Biases." (ALiBi)
- Chen et al. (2023). "Extending Context Window of Large Language Models via Positional Interpolation."
Correspondence: scott@opentransformers.online