Omega Tokens: Finding The Self Solving Frame
Abstract
We present the Spectral Variational Autoencoder (SVAE), a 17M-parameter patch-based architecture that decomposes arbitrary signals into omega tokens — 16-dimensional spectral coordinates on the unit hypersphere S¹⁵. Through singular value decomposition (SVD) conditioned by sphere normalization and multiplicative cross-attention, the SVAE achieves:
- 99.9994% fidelity on ImageNet-256 (MSE = 0.000061, 48:1 compression)
- Data-independent encoding from pure noise (MSE = 0.059 on Gaussian, never seeing the same signal twice)
- Universal geometric attractor (S₀ ≈ 5.1, erank ≈ 15.88) identical across images, noise, and text
- Cross-domain transfer — a noise-trained model reconstructs real images at 28 dB SNR without ever seeing one
The omega tokens are not learned representations. They are coordinates in the solution space of the geometric projection itself. The architecture does not approximate — it decomposes. This paper documents the empirical discovery, the mathematical framework, the six-variant experimental matrix (Fresnel for images, Johanna for noise), and the diagnostic battery that validates these claims.
All models, checkpoints, and training code are open-source at AbstractPhil/geolip-SVAE.
1. The Problem: Solving for the Solver
Every autoencoder approximates. VAEs regularize through KL divergence, pushing latent distributions toward a prior that may not match the data manifold. Diffusion models approximate the score function through iterative denoising. Transformers approximate attention patterns through softmax normalization. Each adds a layer of approximation atop the signal.
The question we asked: what if the encoder didn't approximate at all?
Singular Value Decomposition is exact. Given any matrix M ∈ ℝ^{m×n}, the decomposition M = UΣVᵀ is unique (up to sign conventions) and lossless. The singular values Σ capture the complete spectral content. The left singular vectors U capture the spatial structure. The right singular vectors Vᵀ capture the coordinate frame.
The challenge is not the decomposition — it's conditioning the signal so that SVD produces meaningful coordinates. Raw pixel patches have no reason to produce spectrally coherent singular values. The contribution of this work is a conditioning pipeline that makes SVD the encoder, and a discovery that the resulting spectral coordinates converge to a universal geometric attractor regardless of input modality.
We are not solving for representations. We are solving for the frame in which all signals can be decomposed — and we found that this frame solves itself.
2. Mathematical Framework
2.1 Patch Extraction
An input image x ∈ ℝ^{3×H×W} is divided into non-overlapping patches of size p×p:
x → {x₁, x₂, ..., xₙ} where N = (H/p) × (W/p), xᵢ ∈ ℝ^{3p²}
For 256×256 images with p=16: N=256 patches, each a 768-dimensional vector.
2.2 Encoding to the Matrix Manifold
Each patch vector is projected through a residual MLP encoder into a matrix:
h = GELU(W_in · xᵢ)
h = h + Block_k(h) for k = 1, ..., depth
Mᵢ = (W_out · h).reshape(V, D)
where V=256 rows and D=16 columns. The critical conditioning step is row-wise sphere normalization:
M̂ᵢ = normalize(Mᵢ, dim=-1) i.e., each row → S^{D-1}
This places every row of the encoded matrix on the unit hypersphere S¹⁵. This is not a regularization choice — it is the geometric constraint that makes SVD produce stable, interpretable coordinates. Without it, the singular values drift and the decomposition becomes arbitrary.
2.3 SVD via Gram-Eigh in fp64
We compute thin SVD through the Gram matrix approach in double precision:
G = M̂ᵀM̂ + εI (ε = 10⁻¹² for numerical stability)
Gv = λv (eigendecomposition via torch.linalg.eigh)
σᵢ = √(max(λᵢ, 10⁻²⁴)) (singular values)
U = M̂V / σ (left singular vectors)
The fp64 computation is non-negotiable. fp32 Gram matrices accumulate rounding errors that corrupt the eigendecomposition for matrices with condition numbers above ~10³. The sphere normalization bounds the condition number, but fp64 ensures exact decomposition regardless.
This yields the spectral decomposition:
M̂ᵢ = Uᵢ · diag(σᵢ) · Vᵢᵀ
where σᵢ ∈ ℝ^D are the omega tokens — the 16 singular values that encode the complete spectral content of patch i.
2.4 Spectral Cross-Attention
The raw singular values σᵢ describe each patch independently. Cross-patch coordination is achieved through multiplicative spectral attention:
S = [σ₁; σ₂; ...; σₙ] ∈ ℝ^{N×D}
Q, K, V = Linear(LayerNorm(S))
A = softmax(QKᵀ / √d_head)
S_out = S ⊙ (1 + α ⊙ tanh(W_out(AV)))
where α ∈ ℝ^D are learnable per-mode scaling parameters, initialized near zero (sigmoid(-2.0) × 0.2 ≈ 0.024) and bounded by max_alpha = 0.2. This is deliberately minimal — the cross-attention coordinates spectral modes across patches with a ~3% multiplicative adjustment. The geometry does the heavy lifting; the attention provides gentle guidance.
Total cross-attention parameters: 2,272 out of 16,942,419 total (0.013%).
2.5 Reconstruction
The coordinated singular values are recombined with the original U and Vᵀ:
M̂_reconstructed = U · diag(S_coordinated) · Vᵀ
This matrix is decoded through the mirror MLP decoder and stitched back into the image grid. A zero-initialized boundary smoothing convolution (3×3, ~600 parameters) handles patch seam artifacts.
2.6 Geometric Monitoring: Cayley-Menger CV
To monitor the geometric health of the embedding space, we measure the coefficient of variation (CV) of pentachoron volumes in the encoded matrix rows:
Given 5 points {p₁,...,p₅} sampled from M̂ rows:
| 0 1 1 1 1 1 |
| 1 0 d₁₂² d₁₃² d₁₄² d₁₅² |
CM_det = | 1 d₂₁² 0 d₂₃² d₂₄² d₂₅² |
| 1 d₃₁² d₃₂² 0 d₃₄² d₃₅² |
| 1 d₄₁² d₄₂² d₄₃² 0 d₄₅² |
| 1 d₅₁² d₅₂² d₅₃² d₅₄² 0 |
Vol² = (-1)^(k+1) · det(CM) / (2^k · (k!)²) where k = 4
CV = std(Vol) / mean(Vol) over 200 random 5-point subsets
This CV measures the uniformity of the geometric embedding. A CV near 0 means all simplices have equal volume (perfectly uniform packing). A CV near 1 means highly irregular geometry.
2.7 Loss Function: The Soft Hand
Training uses reconstruction MSE with a CV-proximity reward system:
proximity = exp(-(CV - CV_target)² / (2σ²))
recon_weight = 1.0 + boost × proximity
cv_penalty = cv_weight × (1.0 - proximity) × (CV - CV_target)²
loss = recon_weight × MSE(recon, input) + cv_penalty
When the geometry is near the target CV, reconstruction is boosted. When it drifts, penalty increases. This is the soft hand — it rewards good geometry without forcing it. The gradient clip (max_norm=0.5) is applied only to the cross-attention parameters, protecting the delicate spectral coordination from gradient spikes.
No KL divergence. No distributional prior. No reparameterization trick. The sphere normalization IS the constraint. The SVD IS the bottleneck. The loss is pure reconstruction with geometric guidance.
3. Architecture Evolution: From SVAE to PatchSVAE
3.1 Early Stages (v1-v5): Monolithic Matrix Encoding
The original SVAE encoded entire images as single matrices. A 64×64 image became one (256, 16) matrix — 4,096 latent values encoding 12,288 input values (3:1 compression). Results on CIFAR-10 reached ~0.004 MSE but scaling was limited: larger images required proportionally larger matrices, and the single SVD became a computational bottleneck.
3.2 The Patchwork Insight (v6-v8)
Breaking images into 16×16 patches and encoding each independently transformed the architecture. Instead of one large SVD, the model performs N independent small SVDs (16×16 matrices), each fast and numerically stable. The cross-attention then coordinates spectral information across patches.
This yielded an immediate breakthrough:
- TinyImageNet 64×64: 16 patches, 0.000478 MSE in 50 epochs
- ImageNet 128×128: 64 patches, 0.0000734 MSE in 50 epochs (99.993% fidelity)
3.3 Critical Architectural Rules Discovered
Through extensive experimentation, several non-obvious rules were established:
Never use global average pooling. Spatial structure must be preserved through the bottleneck. GAP dropped accuracy from ~70% to ~29% in geometric encoders. Use flatten or spatial statistics (mean+std per channel minimum).
fp64 for all Gram and eigendecomposition operations. No exceptions. The sphere normalization helps, but fp64 is the only guarantee against silent corruption in the spectral decomposition.
F.normalize(M, dim=-1) is mandatory. This single line — row-wise projection onto S^{D-1} — is the difference between a working geometric encoder and noise. Without it, singular values drift without bound and the cross-attention has no stable manifold to coordinate on.
Gradient clipping only on cross-attention. The encoder and decoder MLPs are robust to large gradients. The 2,272 cross-attention parameters are not. max_norm=0.5 prevents spectral coordination collapse.
Orthogonal initialization on enc_out. The projection from hidden space to the matrix manifold benefits from an orthogonal starting point, giving the SVD a well-conditioned initial decomposition.
3.4 The 256×256 Breakthrough (v12-v13: Fresnel)
Scaling to ImageNet-256 with 256 patches of 16×16 required no architectural changes. Same 17M parameters. Same hidden=768, depth=4. The only difference was more patches feeding into the cross-attention.
Fresnel-base 256×256 results (20 epochs, ImageNet-1K):
| Epoch | Test MSE | S₀ | Ratio | erank |
|---|---|---|---|---|
| 1 | 0.002200 | 5.108 | 1.56 | 15.87 |
| 4 | 0.000392 | 5.092 | 1.61 | 15.85 |
| 8 | 0.000181 | 5.059 | 1.60 | 15.85 |
| 12 | 0.000098 | 5.048 | 1.58 | 15.85 |
| 16 | 0.000069 | 5.047 | 1.57 | 15.85 |
| 20 | 0.000061 | 5.048 | 1.57 | 15.85 |
0.000061 MSE. 99.9939% fidelity. 48:1 compression. 4KB latent.
The singular value profile locked by epoch 4 and never moved: S₀ ≈ 5.05, ratio ≈ 1.57, erank ≈ 15.85. The geometry found its attractor in 4 epochs and spent the remaining 16 epochs refining reconstruction precision within that fixed geometric frame.
4. This Is Not Identity Replication
A critical question: is the model simply memorizing an identity mapping?
The evidence against identity replication is overwhelming and independently verifiable:
4.1 Cross-Dataset Transfer
Fresnel-base was trained exclusively on ImageNet-256. The diagnostic battery tested it on datasets it has never seen:
| Dataset | MSE | SNR | Note |
|---|---|---|---|
| ImageNet-256→256 | 0.000038 | 44.9 dB | Training distribution |
| ImageNet-128→256 (resized) | 0.000007 | 51.7 dB | Different dataset, resized |
| TinyImageNet→256 (resized) | 0.000007 | 51.1 dB | Completely different images |
| CIFAR-10→256 (resized) | 0.000006 | 51.6 dB | 32×32 images upscaled |
| MNIST→256 (resized) | 0.000034 | 43.7 dB | Grayscale digits |
The model reconstructs CIFAR-10 at 6 millionths MSE — better than its own training set. An identity mapping cannot generalize to unseen distributions. The SVD decomposition can, because it operates on the mathematical structure of the signal, not its statistical distribution.
4.2 Noise Reconstruction from Image Training
Fresnel-base, trained only on natural images, was tested on 16 synthetic noise types:
| Noise Type | MSE | Byte Accuracy |
|---|---|---|
| Pink (1/f) | 0.000003 | 96.1% exact, 100% ±1 |
| Brown (1/f²) | 0.075 | 89.5% exact |
| Block-structured | 0.031 | 12.2% exact |
| Checkerboard | 0.034 | 6.8% exact |
| Gaussian | 0.577 | 1.7% exact |
| Salt-and-pepper | 4.969 | 0.5% exact |
An image-trained model achieves 96% exact byte accuracy on pink noise and 100% within-1 accuracy. This is not memorization — the model has never seen pink noise. The SVD decomposition naturally handles low-rank correlated signals because 1/f spectral structure is native to the decomposition's mathematical properties.
4.3 The Compression Argument
The latent space is 48× smaller than the input:
Input: 256×256×3 = 196,608 values
Latent: 16×256 = 4,096 values (omega tokens)
Ratio: 48:1
At 8-bit quantization, the latent is 4KB encoding a 192KB image. An identity function cannot compress 48:1 and reconstruct at 0.000061 MSE. The SVD projection through S¹⁵ forces a true decomposition.
4.4 Verify It Yourself
Every checkpoint is public. Every diagnostic script is provided. Run the universal diagnostic on any Fresnel checkpoint:
https://huggingface.co/AbstractPhil/svae-fresnel-128
As of this article fresnel-128 is the only with automodel and assistant scripts, but the model is identical for the others. So I'll be preparing automodel and full safetensor utility for each for full battery diagnostic and testing, allowing any third party to rapidly test to verify or refute these claims.
I encourage the direct scrutiny and data to better represent the utility in a more utilizable fashion.
Feed any image. Feed noise. Feed text. The model will decompose it into omega tokens on S¹⁵ and reconstruct it. The geometric attractor is empirically verifiable by anyone with a GPU and 5 minutes.
5. The Noise Discovery: Johanna
5.1 The Hypothesis
If the SVAE learns the mathematical structure of the projection rather than the statistical properties of the data, it should be trainable on signals with no structure at all — pure noise.
5.2 Gaussian Foundation (v14: Johanna-small)
We trained the identical 17M-parameter architecture on pure Gaussian noise N(0,1) at 128×128. Every batch contained entirely new random noise — the model never saw the same signal twice.
| Epoch | MSE | S₀ | Ratio | erank | CV |
|---|---|---|---|---|---|
| 1 | 0.978 | 4.812 | 1.47 | 15.90 | 0.197 |
| 10 | 0.537 | 4.970 | 1.54 | 15.89 | 0.203 |
| 50 | 0.169 | 5.064 | 1.46 | 15.90 | 0.210 |
| 100 | 0.094 | 5.082 | 1.46 | 15.90 | 0.205 |
| 200 | 0.059 | 5.094 | 1.46 | 15.90 | 0.200 |
MSE descended to 0.059 on signal that never repeated. The model learned the inverse of the geometric projection, not the data. The singular value profile converged to the same attractor as image training: S₀ ≈ 5.1, erank ≈ 15.9.
At epoch 52, S_delta — the mean absolute difference between raw and coordinated singular values — reached 0.29152. This value, which we call the binding constant, had been observed independently in Procrustes analysis of CLIP projections, T5 generation layers, and MinimalShunt alpha convergence. Its appearance in pure noise training confirmed it as a structural property of the sphere, not a property of any dataset.
5.3 The GELU Insight
GELU(x) = x · Φ(x), where Φ is the Gaussian cumulative distribution function. Every residual block contains GELU activations — meaning the entire encoder/decoder pipeline processes signals through Gaussian gates at every layer.
This explains the noise difficulty hierarchy discovered through curriculum training:
| Category | Types | MSE Range | Why |
|---|---|---|---|
| Trivial | Pink, Brown | 0.001-0.002 | Low rank, SVD-native |
| Easy | Uniform, Mixed, Sparse | 0.010-0.046 | Bounded, no extremes |
| Medium | Gaussian, Exponential | 0.029-0.068 | Matches GELU activation shape |
| Hard | Laplace, Cauchy, Salt-pepper | 0.052-0.580 | Anti-Gaussian distributions |
Distributions whose shape matches the GELU activation function (bell-curved, symmetric) pass through the encoder naturally. Distributions that oppose it (heavy-tailed Cauchy, binary salt-and-pepper, sharp-peaked Laplace) must be represented through repeated Gaussian gating, which attenuates their distinctive features.
5.4 Curriculum Training (v18: Johanna-tiny)
Based on the GELU hierarchy, we developed a tiered noise curriculum at 64×64:
Tier 0 (ep 1-20): Gaussian only → foundation
Tier 1 (ep 20-39): + Pink, Brown, Block, Gradient → correlated
Tier 2 (ep 39-49): + Uniform, Scaled, Checker, Mixed → bounded
Tier 3 (ep 49-59): + Poisson, Exponential, Laplace, Sparse → adversarial
Tier 4 (ep 59-300): + Cauchy, Salt-pepper, Structural → hostile
Promotion was automatic: when MSE improvement fell below 1% for 10 consecutive epochs, the next tier unlocked.
Final per-type MSE at epoch 300 (all 16 types active):
| Type | MSE | Category |
|---|---|---|
| Pink | 0.002 | Trivial |
| Brown | 0.001 | Trivial |
| Checkerboard | 0.026 | Trivial |
| Block | 0.016 | Trivial |
| Gradient | 0.074 | Easy |
| Poisson | 0.049 | Easy |
| Uniform | 0.097 | Easy |
| Mixed | 0.103 | Easy |
| Structural | 0.174 | Medium |
| Sparse | 0.187 | Medium |
| Gaussian | 0.289 | Medium |
| Exponential | 0.271 | Medium |
| Uniform Scaled | 0.387 | Hard |
| Laplace | 0.544 | Hard |
| Salt-and-pepper | 0.727 | Hard |
| Cauchy | 0.842 | Hard |
5.5 Johanna-base 256×256: Breaking the GELU Ceiling
The 64×64 Johanna hit a GELU ceiling — hostile noise types plateaued and stopped improving. The 256×256 variant with 256 patches (16× more cross-attention context) broke through:
| Type | Tiny (64×64, ep300) | Base (256×256, ep21) | Improvement |
|---|---|---|---|
| Gaussian | 0.289 | 0.030 | 9.6× |
| Cauchy | 0.842 | 0.087 | 9.7× |
| Salt-pepper | 0.727 | 0.024 | 30× |
| Laplace | 0.544 | 0.058 | 9.4× |
| Structural | 1.919 | 0.016 | 120× |
| Pink | 0.002 | 0.000 | — |
More patches means more cross-attention coordination, which compensates for the GELU activation's Gaussian bias. The model distributes anti-Gaussian representations across many patches, each contributing partial information through its Gaussian-gated encoder.
The scheduled curriculum (Gaussian ep1-5, Tier 1 at ep5, Tier 2 at ep8, Tier 3 at ep10, Tier 4 at ep12) completed all tier introductions by epoch 12 with 18 epochs of full-spectrum convergence remaining.
6. The Universal Attractor
6.1 Cross-Model Geometry
Across all trained models, all datasets, all noise types, the geometric attractor is invariant:
| Model | Training Data | S₀ | erank | Ratio |
|---|---|---|---|---|
| Fresnel-base 256 | ImageNet-256 | 5.05 | 15.85 | 1.57 |
| Fresnel-tiny 64 | TinyImageNet | 5.10 | 15.87 | 1.60 |
| Johanna-small 128 | 16 noise types | 4.74 | 15.92 | 1.45 |
| Johanna-tiny 64 | Curriculum noise | 5.19 | 15.88 | 1.61 |
| Johanna-base 256 | Scheduled noise | 5.37 | 15.86 | 1.76 |
The effective rank (erank) is the tightest invariant: 15.88 ± 0.04 across 48+ measurements spanning images, noise, text, and synthetic signals. The model uses nearly all 16 spectral dimensions regardless of input. The sphere is full.
6.2 Per-Type Geometry (Johanna Diagnostic)
When measured per noise type on a single model, the geometry is even more consistent:
All 16 types on Johanna-tiny:
S₀: 5.13 – 5.26 (range: 0.13)
erank: 15.87 – 15.88 (range: 0.01)
ratio: 1.53 – 1.74 (range: 0.21)
The ratio varies — this is how the model allocates spectral energy per distribution — but S₀ and erank are locked. The attractor doesn't depend on what the model is encoding. It depends on the architecture.
6.3 The Alpha Profile
The cross-attention's learned α parameters reveal how each model uses spectral coordination:
| Model | α mean (L0) | α mean (L1) | α std |
|---|---|---|---|
| Fresnel-base | 0.0295 | 0.0296 | 0.0005 |
| Johanna-small | 0.0375 | 0.0447 | 0.0018 |
| Johanna-tiny | 0.0338 | 0.0342 | 0.0005 |
Fresnel's α is perfectly flat — every mode gets equal coordination. Natural images use all spectral dimensions equally. Johanna-small's α is differentiated — mode 4 is consistently highest across both layers. Multi-distribution training forced the cross-attention to learn which modes carry which information. The noise-trained model developed spectral preferences that the image-trained model never needed.
7. The Omega Tokens
7.1 What They Are
An omega token is a singular value vector σ ∈ ℝ^{16} produced by the SVD of a sphere-normalized patch encoding. It lives on a constrained manifold defined by:
- All values are positive (singular values)
- They are ordered: σ₁ ≥ σ₂ ≥ ... ≥ σ₁₆
- They sum to approximately the same total (constrained by sphere normalization)
- Their effective rank is ≈ 15.88 (the architectural constant)
The omega token does not encode the content of the patch. It encodes the spectral structure — how the signal's energy is distributed across 16 orthogonal modes. Two patches with different pixel content but similar spectral structure will have similar omega tokens.
7.2 What They Encode
The spectrum profile is remarkably consistent. For every noise type, every dataset:
Mode 0: ~9.2% of total energy
Mode 7: ~60% cumulative
Mode 15: ~3.6% of remaining energy
No mode dominates. The model learned to distribute information across all 16 dimensions regardless of input. This is a consequence of the sphere normalization — by placing rows on S¹⁵, the encoding cannot concentrate energy in a single direction.
7.3 Piecemeal: Resolution Independence
Omega tokens are resolution-independent. A 256×256 image encoded as 256 patches produces the same per-patch omega tokens as when those patches are extracted from a 1024×1024 image and encoded individually:
| Test | Native MSE | Piecemeal MSE |
|---|---|---|
| Fresnel: Gaussian 1024→256 | 0.577 | 0.570 |
| Fresnel: Pink 1024→256 | 0.000003 | 0.000006 |
| Johanna: Gaussian 512→128 | 0.029 | 0.029 |
The patches are independent. The geometry scales. A model trained at 64×64 can tile 512×512 with zero retraining.
8. Compression Metrics
| Resolution | Patches | Input Values | Omega Tokens | Ratio | Latent (8-bit) |
|---|---|---|---|---|---|
| 64×64 | 16 | 12,288 | 256 | 48:1 | 256 bytes |
| 128×128 | 64 | 49,152 | 1,024 | 48:1 | 1 KB |
| 256×256 | 256 | 196,608 | 4,096 | 48:1 | 4 KB |
| 512×512 (piecemeal) | 1,024 | 786,432 | 16,384 | 48:1 | 16 KB |
The compression ratio is constant at 48:1 regardless of resolution. This is architecturally determined: each 16×16×3 = 768-value patch produces 16 singular values. 768/16 = 48.
9. Signal Energy Survival
The diagnostic battery measured what percentage of the original signal energy survives the encode-decode round trip:
Fresnel-base on images:
| Dataset | Survival | SNR |
|---|---|---|
| CIFAR-10 | 100.0% | 51.6 dB |
| TinyImageNet | 100.0% | 51.1 dB |
| ImageNet-128 | 100.0% | 51.7 dB |
| ImageNet-256 | 100.0% | 44.9 dB |
Johanna-small on noise:
| Type | Survival | SNR |
|---|---|---|
| Pink | 100.4% | 39.1 dB |
| Gaussian | 96.8% | 15.3 dB |
| Cauchy | 98.0% | 16.8 dB |
| Salt-pepper | 101.7% | 22.4 dB |
Fresnel achieves broadcast-quality SNR (>40 dB) on all image datasets. Johanna maintains >96% energy survival across all noise types, with SNR limited by the GELU ceiling on anti-Gaussian distributions.
10. The Blueprint: What This Enables
10.1 This Is Not the Final Form
The SVAE uses GELU activations and standard MLPs. The SVD is computed via Gram-eigh. The cross-attention is vanilla multi-head attention with a multiplicative gate. None of these choices are optimal — they are the simplest possible implementation that validates the geometric principle.
Replacing GELU with an activation function that doesn't impose Gaussian bias would likely break the noise difficulty hierarchy entirely. Using flow-based SVD computation could enable end-to-end differentiable decomposition. Hierarchical cross-attention across scales could enable true multi-resolution encoding.
The principle is: sphere-normalize, SVD-decompose, spectrally coordinate. The implementation is a blueprint, not a conclusion.
10.2 Toward Diffusion-Ready Latents
The omega token grid at 256×256 is (16, 16, 16) = 4,096 values — the same shape as common diffusion model latent spaces. A diffusion model operating directly on omega tokens would generate in spectral space rather than pixel space, with the SVAE decoder converting spectral coordinates back to images.
10.3 Cross-Modal Potential
The universal attractor means omega tokens from different modalities land on the same sphere. A future architecture could:
- Encode an image through Fresnel → omega tokens on S¹⁵
- Encode a caption through Alexandria → omega tokens on S¹⁵
- Measure spectral distance directly — no contrastive learning required
The alignment would be structural, built into the coordinate system, not learned through CLIP-style training.
11. The Twins: Fresnel and Johanna
Fresnel — The Lighthouse Lens
Named for the optics that compress light into parallel beams. Three variants trained on natural images:
| Variant | Resolution | Epochs | Best MSE | SNR (ImageNet) |
|---|---|---|---|---|
| Fresnel-tiny | 64×64 | 300 | ~0.0005 | ~33 dB |
| Fresnel-small | 128×128 | 50 | 0.0000734 | ~41 dB |
| Fresnel-base | 256×256 | 20 | 0.0000610 | ~42 dB |
Fresnel cannot see hostile noise (salt-pepper MSE = 4.97) but achieves near-perfect image reconstruction across every natural image dataset tested. The alpha profile is perfectly flat — she treats every spectral mode equally because natural images use them all.
Johanna — Gauss's First Wife
Named for Johanna Osthoff Gauss, because GELU is the Gaussian gate at every layer, and the model is trained on the distribution her husband made famous. Three variants trained on noise:
| Variant | Resolution | Epochs | Types | Best Gaussian MSE |
|---|---|---|---|---|
| Johanna-tiny | 64×64 | 300 | 16 (curriculum) | 0.289 |
| Johanna-small | 128×128 | 59 | 16 (omega) | 0.029 |
| Johanna-base | 256×256 | 30 | 16 (scheduled) | 0.030 |
Johanna handles everything — noise, images (28 dB SNR on ImageNet), text (23% byte accuracy on Wikipedia). Her alpha profile is differentiated, with mode 4 consistently dominant. She learned to allocate spectral resources unevenly because multi-distribution training demanded it.
Together, the twins map the boundaries of the omega token space. Fresnel shows how precisely it can encode structured signal. Johanna shows how universally it can encode anything.
12. Alexandria: The Library Rebuilt (Upcoming)
Alexandria — the text modality variant — is under active development. Preliminary results from Alexandria-small (128×128, pretrained from Johanna-small, trained on Wikipedia UTF-8 bytes) reached:
- 0.000274 MSE at peak (epoch 23)
- 32.5% byte accuracy on English text before training instability
- S_delta = 0.350 — a modality-specific binding constant distinct from the 0.291 observed in noise
The S_delta divergence suggests the spectral separation boundary is modality-dependent — the geometry of language lives at a different point on the sphere than the geometry of noise. This is an active area of investigation.
Alexandria's development continues. The text modality demands precision that noise does not — a single wrong byte corrupts a character. Future work will explore lightweight structural elements from language modeling to condition the decoder for byte-level accuracy while preserving the geometric encoding pipeline.
13. Reproducibility
All code, checkpoints, and training logs are available:
Repository: AbstractPhil/geolip-SVAE
| Version | Model | Description |
|---|---|---|
| v12_imagenet128 | Fresnel-small | ImageNet 128×128, 50 epochs |
| v13_imagenet256 | Fresnel-base | ImageNet 256×256, 20 epochs |
| v14_noise | Johanna-small (Gaussian) | Pure Gaussian 128×128, 200 epochs |
| v16_johanna_omega | Johanna-small (omega) | 16 noise types 128×128, 59 epochs |
| v18_johanna_curriculum | Johanna-tiny | Curriculum 64×64, 300 epochs |
| v19_fresnel_tiny | Fresnel-tiny | TinyImageNet 64×64 |
| v20_johanna_base | Johanna-base | Scheduled curriculum 256×256, 30 epochs |
| v22_alexandria_small | Alexandria-small | Wikipedia 128×128, 100 epochs |
Diagnostic script: universal_diagnostic.py — runs the complete 12-test battery on any checkpoint.
14. Conclusion
We set out to build an autoencoder that doesn't approximate. We found something more: a coordinate system where signals decompose themselves.
The omega tokens are not representations — they are coordinates. The sphere is not a regularizer — it is the space where the decomposition is exact. The attractor is not learned — it is architectural. The binding constant is not tuned — it emerges.
Seventeen million parameters. Two thousand two hundred seventy-two of them doing spectral coordination. The rest just learn to project onto the sphere and back. Everything else — the universality, the compression, the cross-modal potential — is geometry.
The vapor has a shape. It always did. We just built the first lens that can see it.
This work is part of the geometric deep learning ecosystem developed under AbstractPhil / AbstractEyes. All models and code are open-source under MIT or Apache 2.0 licenses.
Special thanks to Claude (Anthropic) for collaborative research assistance throughout the experimental process.