Omega Tokens: Finding The Self Solving Frame

Community Article Published April 8, 2026

AbstractPhil · April 2026


Abstract

We present the Spectral Variational Autoencoder (SVAE), a 17M-parameter patch-based architecture that decomposes arbitrary signals into omega tokens — 16-dimensional spectral coordinates on the unit hypersphere S¹⁵. Through singular value decomposition (SVD) conditioned by sphere normalization and multiplicative cross-attention, the SVAE achieves:

  • 99.9994% fidelity on ImageNet-256 (MSE = 0.000061, 48:1 compression)
  • Data-independent encoding from pure noise (MSE = 0.059 on Gaussian, never seeing the same signal twice)
  • Universal geometric attractor (S₀ ≈ 5.1, erank ≈ 15.88) identical across images, noise, and text
  • Cross-domain transfer — a noise-trained model reconstructs real images at 28 dB SNR without ever seeing one

The omega tokens are not learned representations. They are coordinates in the solution space of the geometric projection itself. The architecture does not approximate — it decomposes. This paper documents the empirical discovery, the mathematical framework, the six-variant experimental matrix (Fresnel for images, Johanna for noise), and the diagnostic battery that validates these claims.

All models, checkpoints, and training code are open-source at AbstractPhil/geolip-SVAE.


1. The Problem: Solving for the Solver

Every autoencoder approximates. VAEs regularize through KL divergence, pushing latent distributions toward a prior that may not match the data manifold. Diffusion models approximate the score function through iterative denoising. Transformers approximate attention patterns through softmax normalization. Each adds a layer of approximation atop the signal.

The question we asked: what if the encoder didn't approximate at all?

Singular Value Decomposition is exact. Given any matrix M ∈ ℝ^{m×n}, the decomposition M = UΣVᵀ is unique (up to sign conventions) and lossless. The singular values Σ capture the complete spectral content. The left singular vectors U capture the spatial structure. The right singular vectors Vᵀ capture the coordinate frame.

The challenge is not the decomposition — it's conditioning the signal so that SVD produces meaningful coordinates. Raw pixel patches have no reason to produce spectrally coherent singular values. The contribution of this work is a conditioning pipeline that makes SVD the encoder, and a discovery that the resulting spectral coordinates converge to a universal geometric attractor regardless of input modality.

We are not solving for representations. We are solving for the frame in which all signals can be decomposed — and we found that this frame solves itself.


2. Mathematical Framework

2.1 Patch Extraction

An input image x ∈ ℝ^{3×H×W} is divided into non-overlapping patches of size p×p:

x → {x₁, x₂, ..., xₙ}    where N = (H/p) × (W/p), xᵢ ∈ ℝ^{3p²}

For 256×256 images with p=16: N=256 patches, each a 768-dimensional vector.

2.2 Encoding to the Matrix Manifold

Each patch vector is projected through a residual MLP encoder into a matrix:

h = GELU(W_in · xᵢ)
h = h + Block_k(h)         for k = 1, ..., depth
Mᵢ = (W_out · h).reshape(V, D)

where V=256 rows and D=16 columns. The critical conditioning step is row-wise sphere normalization:

M̂ᵢ = normalize(Mᵢ, dim=-1)     i.e., each row → S^{D-1}

This places every row of the encoded matrix on the unit hypersphere S¹⁵. This is not a regularization choice — it is the geometric constraint that makes SVD produce stable, interpretable coordinates. Without it, the singular values drift and the decomposition becomes arbitrary.

2.3 SVD via Gram-Eigh in fp64

We compute thin SVD through the Gram matrix approach in double precision:

G = M̂ᵀM̂ + εI          (ε = 10⁻¹² for numerical stability)
Gv = λv                  (eigendecomposition via torch.linalg.eigh)
σᵢ = √(max(λᵢ, 10⁻²⁴))  (singular values)
U = M̂V / σ               (left singular vectors)

The fp64 computation is non-negotiable. fp32 Gram matrices accumulate rounding errors that corrupt the eigendecomposition for matrices with condition numbers above ~10³. The sphere normalization bounds the condition number, but fp64 ensures exact decomposition regardless.

This yields the spectral decomposition:

M̂ᵢ = Uᵢ · diag(σᵢ) · Vᵢᵀ

where σᵢ ∈ ℝ^D are the omega tokens — the 16 singular values that encode the complete spectral content of patch i.

2.4 Spectral Cross-Attention

The raw singular values σᵢ describe each patch independently. Cross-patch coordination is achieved through multiplicative spectral attention:

S = [σ₁; σ₂; ...; σₙ] ∈ ℝ^{N×D}
Q, K, V = Linear(LayerNorm(S))
A = softmax(QKᵀ / √d_head)
S_out = S ⊙ (1 + α ⊙ tanh(W_out(AV)))

where α ∈ ℝ^D are learnable per-mode scaling parameters, initialized near zero (sigmoid(-2.0) × 0.2 ≈ 0.024) and bounded by max_alpha = 0.2. This is deliberately minimal — the cross-attention coordinates spectral modes across patches with a ~3% multiplicative adjustment. The geometry does the heavy lifting; the attention provides gentle guidance.

Total cross-attention parameters: 2,272 out of 16,942,419 total (0.013%).

2.5 Reconstruction

The coordinated singular values are recombined with the original U and Vᵀ:

M̂_reconstructed = U · diag(S_coordinated) · Vᵀ

This matrix is decoded through the mirror MLP decoder and stitched back into the image grid. A zero-initialized boundary smoothing convolution (3×3, ~600 parameters) handles patch seam artifacts.

2.6 Geometric Monitoring: Cayley-Menger CV

To monitor the geometric health of the embedding space, we measure the coefficient of variation (CV) of pentachoron volumes in the encoded matrix rows:

Given 5 points {p₁,...,p₅} sampled from M̂ rows:

         | 0  1   1   1   1   1  |
         | 1  0   d₁₂² d₁₃² d₁₄² d₁₅² |
CM_det = | 1  d₂₁² 0   d₂₃² d₂₄² d₂₅² |
         | 1  d₃₁² d₃₂² 0   d₃₄² d₃₅² |
         | 1  d₄₁² d₄₂² d₄₃² 0   d₄₅² |
         | 1  d₅₁² d₅₂² d₅₃² d₅₄² 0   |

Vol² = (-1)^(k+1) · det(CM) / (2^k · (k!)²)    where k = 4

CV = std(Vol) / mean(Vol)    over 200 random 5-point subsets

This CV measures the uniformity of the geometric embedding. A CV near 0 means all simplices have equal volume (perfectly uniform packing). A CV near 1 means highly irregular geometry.

2.7 Loss Function: The Soft Hand

Training uses reconstruction MSE with a CV-proximity reward system:

proximity = exp(-(CV - CV_target)² / (2σ²))
recon_weight = 1.0 + boost × proximity
cv_penalty = cv_weight × (1.0 - proximity) × (CV - CV_target)²
loss = recon_weight × MSE(recon, input) + cv_penalty

When the geometry is near the target CV, reconstruction is boosted. When it drifts, penalty increases. This is the soft hand — it rewards good geometry without forcing it. The gradient clip (max_norm=0.5) is applied only to the cross-attention parameters, protecting the delicate spectral coordination from gradient spikes.

No KL divergence. No distributional prior. No reparameterization trick. The sphere normalization IS the constraint. The SVD IS the bottleneck. The loss is pure reconstruction with geometric guidance.


3. Architecture Evolution: From SVAE to PatchSVAE

3.1 Early Stages (v1-v5): Monolithic Matrix Encoding

The original SVAE encoded entire images as single matrices. A 64×64 image became one (256, 16) matrix — 4,096 latent values encoding 12,288 input values (3:1 compression). Results on CIFAR-10 reached ~0.004 MSE but scaling was limited: larger images required proportionally larger matrices, and the single SVD became a computational bottleneck.

3.2 The Patchwork Insight (v6-v8)

Breaking images into 16×16 patches and encoding each independently transformed the architecture. Instead of one large SVD, the model performs N independent small SVDs (16×16 matrices), each fast and numerically stable. The cross-attention then coordinates spectral information across patches.

This yielded an immediate breakthrough:

  • TinyImageNet 64×64: 16 patches, 0.000478 MSE in 50 epochs
  • ImageNet 128×128: 64 patches, 0.0000734 MSE in 50 epochs (99.993% fidelity)

3.3 Critical Architectural Rules Discovered

Through extensive experimentation, several non-obvious rules were established:

Never use global average pooling. Spatial structure must be preserved through the bottleneck. GAP dropped accuracy from ~70% to ~29% in geometric encoders. Use flatten or spatial statistics (mean+std per channel minimum).

fp64 for all Gram and eigendecomposition operations. No exceptions. The sphere normalization helps, but fp64 is the only guarantee against silent corruption in the spectral decomposition.

F.normalize(M, dim=-1) is mandatory. This single line — row-wise projection onto S^{D-1} — is the difference between a working geometric encoder and noise. Without it, singular values drift without bound and the cross-attention has no stable manifold to coordinate on.

Gradient clipping only on cross-attention. The encoder and decoder MLPs are robust to large gradients. The 2,272 cross-attention parameters are not. max_norm=0.5 prevents spectral coordination collapse.

Orthogonal initialization on enc_out. The projection from hidden space to the matrix manifold benefits from an orthogonal starting point, giving the SVD a well-conditioned initial decomposition.

3.4 The 256×256 Breakthrough (v12-v13: Fresnel)

Scaling to ImageNet-256 with 256 patches of 16×16 required no architectural changes. Same 17M parameters. Same hidden=768, depth=4. The only difference was more patches feeding into the cross-attention.

Fresnel-base 256×256 results (20 epochs, ImageNet-1K):

Epoch Test MSE S₀ Ratio erank
1 0.002200 5.108 1.56 15.87
4 0.000392 5.092 1.61 15.85
8 0.000181 5.059 1.60 15.85
12 0.000098 5.048 1.58 15.85
16 0.000069 5.047 1.57 15.85
20 0.000061 5.048 1.57 15.85

0.000061 MSE. 99.9939% fidelity. 48:1 compression. 4KB latent.

The singular value profile locked by epoch 4 and never moved: S₀ ≈ 5.05, ratio ≈ 1.57, erank ≈ 15.85. The geometry found its attractor in 4 epochs and spent the remaining 16 epochs refining reconstruction precision within that fixed geometric frame.


4. This Is Not Identity Replication

A critical question: is the model simply memorizing an identity mapping?

The evidence against identity replication is overwhelming and independently verifiable:

4.1 Cross-Dataset Transfer

Fresnel-base was trained exclusively on ImageNet-256. The diagnostic battery tested it on datasets it has never seen:

Dataset MSE SNR Note
ImageNet-256→256 0.000038 44.9 dB Training distribution
ImageNet-128→256 (resized) 0.000007 51.7 dB Different dataset, resized
TinyImageNet→256 (resized) 0.000007 51.1 dB Completely different images
CIFAR-10→256 (resized) 0.000006 51.6 dB 32×32 images upscaled
MNIST→256 (resized) 0.000034 43.7 dB Grayscale digits

The model reconstructs CIFAR-10 at 6 millionths MSE — better than its own training set. An identity mapping cannot generalize to unseen distributions. The SVD decomposition can, because it operates on the mathematical structure of the signal, not its statistical distribution.

4.2 Noise Reconstruction from Image Training

Fresnel-base, trained only on natural images, was tested on 16 synthetic noise types:

Noise Type MSE Byte Accuracy
Pink (1/f) 0.000003 96.1% exact, 100% ±1
Brown (1/f²) 0.075 89.5% exact
Block-structured 0.031 12.2% exact
Checkerboard 0.034 6.8% exact
Gaussian 0.577 1.7% exact
Salt-and-pepper 4.969 0.5% exact

An image-trained model achieves 96% exact byte accuracy on pink noise and 100% within-1 accuracy. This is not memorization — the model has never seen pink noise. The SVD decomposition naturally handles low-rank correlated signals because 1/f spectral structure is native to the decomposition's mathematical properties.

4.3 The Compression Argument

The latent space is 48× smaller than the input:

Input:   256×256×3 = 196,608 values
Latent:  16×256   =   4,096 values (omega tokens)
Ratio:   48:1

At 8-bit quantization, the latent is 4KB encoding a 192KB image. An identity function cannot compress 48:1 and reconstruct at 0.000061 MSE. The SVD projection through S¹⁵ forces a true decomposition.

4.4 Verify It Yourself

Every checkpoint is public. Every diagnostic script is provided. Run the universal diagnostic on any Fresnel checkpoint:

https://huggingface.co/AbstractPhil/svae-fresnel-128

As of this article fresnel-128 is the only with automodel and assistant scripts, but the model is identical for the others. So I'll be preparing automodel and full safetensor utility for each for full battery diagnostic and testing, allowing any third party to rapidly test to verify or refute these claims.

I encourage the direct scrutiny and data to better represent the utility in a more utilizable fashion.

Feed any image. Feed noise. Feed text. The model will decompose it into omega tokens on S¹⁵ and reconstruct it. The geometric attractor is empirically verifiable by anyone with a GPU and 5 minutes.


5. The Noise Discovery: Johanna

5.1 The Hypothesis

If the SVAE learns the mathematical structure of the projection rather than the statistical properties of the data, it should be trainable on signals with no structure at all — pure noise.

5.2 Gaussian Foundation (v14: Johanna-small)

We trained the identical 17M-parameter architecture on pure Gaussian noise N(0,1) at 128×128. Every batch contained entirely new random noise — the model never saw the same signal twice.

Epoch MSE S₀ Ratio erank CV
1 0.978 4.812 1.47 15.90 0.197
10 0.537 4.970 1.54 15.89 0.203
50 0.169 5.064 1.46 15.90 0.210
100 0.094 5.082 1.46 15.90 0.205
200 0.059 5.094 1.46 15.90 0.200

MSE descended to 0.059 on signal that never repeated. The model learned the inverse of the geometric projection, not the data. The singular value profile converged to the same attractor as image training: S₀ ≈ 5.1, erank ≈ 15.9.

At epoch 52, S_delta — the mean absolute difference between raw and coordinated singular values — reached 0.29152. This value, which we call the binding constant, had been observed independently in Procrustes analysis of CLIP projections, T5 generation layers, and MinimalShunt alpha convergence. Its appearance in pure noise training confirmed it as a structural property of the sphere, not a property of any dataset.

5.3 The GELU Insight

GELU(x) = x · Φ(x), where Φ is the Gaussian cumulative distribution function. Every residual block contains GELU activations — meaning the entire encoder/decoder pipeline processes signals through Gaussian gates at every layer.

This explains the noise difficulty hierarchy discovered through curriculum training:

Category Types MSE Range Why
Trivial Pink, Brown 0.001-0.002 Low rank, SVD-native
Easy Uniform, Mixed, Sparse 0.010-0.046 Bounded, no extremes
Medium Gaussian, Exponential 0.029-0.068 Matches GELU activation shape
Hard Laplace, Cauchy, Salt-pepper 0.052-0.580 Anti-Gaussian distributions

Distributions whose shape matches the GELU activation function (bell-curved, symmetric) pass through the encoder naturally. Distributions that oppose it (heavy-tailed Cauchy, binary salt-and-pepper, sharp-peaked Laplace) must be represented through repeated Gaussian gating, which attenuates their distinctive features.

5.4 Curriculum Training (v18: Johanna-tiny)

Based on the GELU hierarchy, we developed a tiered noise curriculum at 64×64:

Tier 0 (ep 1-20):   Gaussian only             → foundation
Tier 1 (ep 20-39):  + Pink, Brown, Block, Gradient    → correlated
Tier 2 (ep 39-49):  + Uniform, Scaled, Checker, Mixed → bounded
Tier 3 (ep 49-59):  + Poisson, Exponential, Laplace, Sparse → adversarial
Tier 4 (ep 59-300): + Cauchy, Salt-pepper, Structural → hostile

Promotion was automatic: when MSE improvement fell below 1% for 10 consecutive epochs, the next tier unlocked.

Final per-type MSE at epoch 300 (all 16 types active):

Type MSE Category
Pink 0.002 Trivial
Brown 0.001 Trivial
Checkerboard 0.026 Trivial
Block 0.016 Trivial
Gradient 0.074 Easy
Poisson 0.049 Easy
Uniform 0.097 Easy
Mixed 0.103 Easy
Structural 0.174 Medium
Sparse 0.187 Medium
Gaussian 0.289 Medium
Exponential 0.271 Medium
Uniform Scaled 0.387 Hard
Laplace 0.544 Hard
Salt-and-pepper 0.727 Hard
Cauchy 0.842 Hard

5.5 Johanna-base 256×256: Breaking the GELU Ceiling

The 64×64 Johanna hit a GELU ceiling — hostile noise types plateaued and stopped improving. The 256×256 variant with 256 patches (16× more cross-attention context) broke through:

Type Tiny (64×64, ep300) Base (256×256, ep21) Improvement
Gaussian 0.289 0.030 9.6×
Cauchy 0.842 0.087 9.7×
Salt-pepper 0.727 0.024 30×
Laplace 0.544 0.058 9.4×
Structural 1.919 0.016 120×
Pink 0.002 0.000

More patches means more cross-attention coordination, which compensates for the GELU activation's Gaussian bias. The model distributes anti-Gaussian representations across many patches, each contributing partial information through its Gaussian-gated encoder.

The scheduled curriculum (Gaussian ep1-5, Tier 1 at ep5, Tier 2 at ep8, Tier 3 at ep10, Tier 4 at ep12) completed all tier introductions by epoch 12 with 18 epochs of full-spectrum convergence remaining.


6. The Universal Attractor

6.1 Cross-Model Geometry

Across all trained models, all datasets, all noise types, the geometric attractor is invariant:

Model Training Data S₀ erank Ratio
Fresnel-base 256 ImageNet-256 5.05 15.85 1.57
Fresnel-tiny 64 TinyImageNet 5.10 15.87 1.60
Johanna-small 128 16 noise types 4.74 15.92 1.45
Johanna-tiny 64 Curriculum noise 5.19 15.88 1.61
Johanna-base 256 Scheduled noise 5.37 15.86 1.76

The effective rank (erank) is the tightest invariant: 15.88 ± 0.04 across 48+ measurements spanning images, noise, text, and synthetic signals. The model uses nearly all 16 spectral dimensions regardless of input. The sphere is full.

6.2 Per-Type Geometry (Johanna Diagnostic)

When measured per noise type on a single model, the geometry is even more consistent:

All 16 types on Johanna-tiny:
  S₀:    5.13 – 5.26   (range: 0.13)
  erank: 15.87 – 15.88  (range: 0.01)
  ratio: 1.53 – 1.74   (range: 0.21)

The ratio varies — this is how the model allocates spectral energy per distribution — but S₀ and erank are locked. The attractor doesn't depend on what the model is encoding. It depends on the architecture.

6.3 The Alpha Profile

The cross-attention's learned α parameters reveal how each model uses spectral coordination:

Model α mean (L0) α mean (L1) α std
Fresnel-base 0.0295 0.0296 0.0005
Johanna-small 0.0375 0.0447 0.0018
Johanna-tiny 0.0338 0.0342 0.0005

Fresnel's α is perfectly flat — every mode gets equal coordination. Natural images use all spectral dimensions equally. Johanna-small's α is differentiated — mode 4 is consistently highest across both layers. Multi-distribution training forced the cross-attention to learn which modes carry which information. The noise-trained model developed spectral preferences that the image-trained model never needed.


7. The Omega Tokens

7.1 What They Are

An omega token is a singular value vector σ ∈ ℝ^{16} produced by the SVD of a sphere-normalized patch encoding. It lives on a constrained manifold defined by:

  1. All values are positive (singular values)
  2. They are ordered: σ₁ ≥ σ₂ ≥ ... ≥ σ₁₆
  3. They sum to approximately the same total (constrained by sphere normalization)
  4. Their effective rank is ≈ 15.88 (the architectural constant)

The omega token does not encode the content of the patch. It encodes the spectral structure — how the signal's energy is distributed across 16 orthogonal modes. Two patches with different pixel content but similar spectral structure will have similar omega tokens.

7.2 What They Encode

The spectrum profile is remarkably consistent. For every noise type, every dataset:

Mode  0:  ~9.2% of total energy
Mode  7:  ~60% cumulative
Mode 15:  ~3.6% of remaining energy

No mode dominates. The model learned to distribute information across all 16 dimensions regardless of input. This is a consequence of the sphere normalization — by placing rows on S¹⁵, the encoding cannot concentrate energy in a single direction.

7.3 Piecemeal: Resolution Independence

Omega tokens are resolution-independent. A 256×256 image encoded as 256 patches produces the same per-patch omega tokens as when those patches are extracted from a 1024×1024 image and encoded individually:

Test Native MSE Piecemeal MSE
Fresnel: Gaussian 1024→256 0.577 0.570
Fresnel: Pink 1024→256 0.000003 0.000006
Johanna: Gaussian 512→128 0.029 0.029

The patches are independent. The geometry scales. A model trained at 64×64 can tile 512×512 with zero retraining.


8. Compression Metrics

Resolution Patches Input Values Omega Tokens Ratio Latent (8-bit)
64×64 16 12,288 256 48:1 256 bytes
128×128 64 49,152 1,024 48:1 1 KB
256×256 256 196,608 4,096 48:1 4 KB
512×512 (piecemeal) 1,024 786,432 16,384 48:1 16 KB

The compression ratio is constant at 48:1 regardless of resolution. This is architecturally determined: each 16×16×3 = 768-value patch produces 16 singular values. 768/16 = 48.


9. Signal Energy Survival

The diagnostic battery measured what percentage of the original signal energy survives the encode-decode round trip:

Fresnel-base on images:

Dataset Survival SNR
CIFAR-10 100.0% 51.6 dB
TinyImageNet 100.0% 51.1 dB
ImageNet-128 100.0% 51.7 dB
ImageNet-256 100.0% 44.9 dB

Johanna-small on noise:

Type Survival SNR
Pink 100.4% 39.1 dB
Gaussian 96.8% 15.3 dB
Cauchy 98.0% 16.8 dB
Salt-pepper 101.7% 22.4 dB

Fresnel achieves broadcast-quality SNR (>40 dB) on all image datasets. Johanna maintains >96% energy survival across all noise types, with SNR limited by the GELU ceiling on anti-Gaussian distributions.


10. The Blueprint: What This Enables

10.1 This Is Not the Final Form

The SVAE uses GELU activations and standard MLPs. The SVD is computed via Gram-eigh. The cross-attention is vanilla multi-head attention with a multiplicative gate. None of these choices are optimal — they are the simplest possible implementation that validates the geometric principle.

Replacing GELU with an activation function that doesn't impose Gaussian bias would likely break the noise difficulty hierarchy entirely. Using flow-based SVD computation could enable end-to-end differentiable decomposition. Hierarchical cross-attention across scales could enable true multi-resolution encoding.

The principle is: sphere-normalize, SVD-decompose, spectrally coordinate. The implementation is a blueprint, not a conclusion.

10.2 Toward Diffusion-Ready Latents

The omega token grid at 256×256 is (16, 16, 16) = 4,096 values — the same shape as common diffusion model latent spaces. A diffusion model operating directly on omega tokens would generate in spectral space rather than pixel space, with the SVAE decoder converting spectral coordinates back to images.

10.3 Cross-Modal Potential

The universal attractor means omega tokens from different modalities land on the same sphere. A future architecture could:

  • Encode an image through Fresnel → omega tokens on S¹⁵
  • Encode a caption through Alexandria → omega tokens on S¹⁵
  • Measure spectral distance directly — no contrastive learning required

The alignment would be structural, built into the coordinate system, not learned through CLIP-style training.


11. The Twins: Fresnel and Johanna

Fresnel — The Lighthouse Lens

Named for the optics that compress light into parallel beams. Three variants trained on natural images:

Variant Resolution Epochs Best MSE SNR (ImageNet)
Fresnel-tiny 64×64 300 ~0.0005 ~33 dB
Fresnel-small 128×128 50 0.0000734 ~41 dB
Fresnel-base 256×256 20 0.0000610 ~42 dB

Fresnel cannot see hostile noise (salt-pepper MSE = 4.97) but achieves near-perfect image reconstruction across every natural image dataset tested. The alpha profile is perfectly flat — she treats every spectral mode equally because natural images use them all.

Johanna — Gauss's First Wife

Named for Johanna Osthoff Gauss, because GELU is the Gaussian gate at every layer, and the model is trained on the distribution her husband made famous. Three variants trained on noise:

Variant Resolution Epochs Types Best Gaussian MSE
Johanna-tiny 64×64 300 16 (curriculum) 0.289
Johanna-small 128×128 59 16 (omega) 0.029
Johanna-base 256×256 30 16 (scheduled) 0.030

Johanna handles everything — noise, images (28 dB SNR on ImageNet), text (23% byte accuracy on Wikipedia). Her alpha profile is differentiated, with mode 4 consistently dominant. She learned to allocate spectral resources unevenly because multi-distribution training demanded it.

Together, the twins map the boundaries of the omega token space. Fresnel shows how precisely it can encode structured signal. Johanna shows how universally it can encode anything.


12. Alexandria: The Library Rebuilt (Upcoming)

Alexandria — the text modality variant — is under active development. Preliminary results from Alexandria-small (128×128, pretrained from Johanna-small, trained on Wikipedia UTF-8 bytes) reached:

  • 0.000274 MSE at peak (epoch 23)
  • 32.5% byte accuracy on English text before training instability
  • S_delta = 0.350 — a modality-specific binding constant distinct from the 0.291 observed in noise

The S_delta divergence suggests the spectral separation boundary is modality-dependent — the geometry of language lives at a different point on the sphere than the geometry of noise. This is an active area of investigation.

Alexandria's development continues. The text modality demands precision that noise does not — a single wrong byte corrupts a character. Future work will explore lightweight structural elements from language modeling to condition the decoder for byte-level accuracy while preserving the geometric encoding pipeline.


13. Reproducibility

All code, checkpoints, and training logs are available:

Repository: AbstractPhil/geolip-SVAE

Version Model Description
v12_imagenet128 Fresnel-small ImageNet 128×128, 50 epochs
v13_imagenet256 Fresnel-base ImageNet 256×256, 20 epochs
v14_noise Johanna-small (Gaussian) Pure Gaussian 128×128, 200 epochs
v16_johanna_omega Johanna-small (omega) 16 noise types 128×128, 59 epochs
v18_johanna_curriculum Johanna-tiny Curriculum 64×64, 300 epochs
v19_fresnel_tiny Fresnel-tiny TinyImageNet 64×64
v20_johanna_base Johanna-base Scheduled curriculum 256×256, 30 epochs
v22_alexandria_small Alexandria-small Wikipedia 128×128, 100 epochs

Diagnostic script: universal_diagnostic.py — runs the complete 12-test battery on any checkpoint.


14. Conclusion

We set out to build an autoencoder that doesn't approximate. We found something more: a coordinate system where signals decompose themselves.

The omega tokens are not representations — they are coordinates. The sphere is not a regularizer — it is the space where the decomposition is exact. The attractor is not learned — it is architectural. The binding constant is not tuned — it emerges.

Seventeen million parameters. Two thousand two hundred seventy-two of them doing spectral coordination. The rest just learn to project onto the sphere and back. Everything else — the universality, the compression, the cross-modal potential — is geometry.

The vapor has a shape. It always did. We just built the first lens that can see it.


This work is part of the geometric deep learning ecosystem developed under AbstractPhil / AbstractEyes. All models and code are open-source under MIT or Apache 2.0 licenses.

Special thanks to Claude (Anthropic) for collaborative research assistance throughout the experimental process.

Community

Sign up or log in to comment