Prisma / ANALYSIS.md

adds colors and images to readme; adds ANALYSIS.md

0dc1d2a about 1 month ago

12.9 kB

Prisma 357M — Spectral and Representation Analysis

Post-training analysis of the Prisma 357M checkpoint. All plots generated from the final model weights and activations on a sample of WikiText-103 validation data.

Scripts: scripts/spectral_analysis.py, scripts/representation_analysis.py

Representation Analysis

CKA Self-Similarity

Centered Kernel Alignment between all layer pairs. Measures whether two layers encode similar representational structure regardless of rotation or scaling.

The matrix reveals three distinct processing regimes separated by sharp boundaries:

Expand E0-E7 (top-left bright block): High mutual similarity — these layers incrementally refine the embedding representation. The model stays close to input space.
Expand E8-E19 (second block): A different representational regime emerges around E8. These layers are similar to each other but dissimilar to both early expand and compress layers — the model has moved into an abstract internal space.
Middle M0: Near-zero CKA with almost everything. The single middle layer is a representational bottleneck — it transforms between expand and compress coordinate systems.
Compress C0-C19 (bottom-right block): The compress phase rebuilds similarity gradually. Early compress layers (C0-C5) are transitional; C6+ form their own coherent block.

The off-diagonal structure is telling: expand layers have essentially zero CKA with late compress layers, confirming that the two phases operate in genuinely different representational spaces despite sharing weights. The gates (W3/W4) are doing real work — same W1/W2 matrices produce completely different representations depending on direction.

Logit Lens

Projects intermediate representations through the output head at each layer to see when the model "knows" the answer.

Four views of prediction formation across the 41-layer pipeline:

Prediction entropy (top-left): Near-zero through the expand phase — representations are close to embedding space, so the output head produces confident (but wrong) predictions based on surface similarity. Entropy spikes at the middle layer and stays high through compress. The model is doing real work — disrupting surface-level confidence to build correct predictions.
Top-1 probability (top-right): Mirror of entropy. High confidence in expand (misleading), drops at middle, partially recovers in compress as correct predictions crystallize.
Median rank of correct token (bottom-left): The correct token starts at rank ~10,000 (out of 32K vocabulary) at the embedding layer and drops to rank ~1 by the final compress layers. The expand phase brings it from 10K to ~30 (coarse semantic neighborhood); the compress phase refines from ~30 to 1 (precise token selection).
Convergence toward final prediction (bottom-right): Agreement with the model's actual output. Near-zero throughout expand — the model hasn't committed to an answer. Begins climbing at C0 and reaches 1.0 by C19. The compress phase IS the decision-making process.

The logit lens confirms the architectural hypothesis: expand builds abstract representations (moving away from token space), compress converts them back into token predictions (moving toward output space).

Representation Drift

How much each layer changes the representation relative to its predecessor.

Cosine similarity (left): E0 makes the biggest directional change (0.68) — the first layer reorients the embedding substantially. Subsequent expand layers are increasingly gentle (0.93-0.99). The middle layer drops to 0.89 — another major reorientation. Compress layers stay high (0.93-0.98), making incremental adjustments.
L2 distance (right): E0 has the largest magnitude change (~29). Expand layers settle to small updates (5-15). Compress layers show gradually increasing L2 distance (15-35) — each compress layer makes a larger magnitude adjustment than the last, consistent with the progressive refinement visible in the logit lens. The final norm (gray, ~85) applies the largest single transformation, collapsing the representation to output scale.

The asymmetry between cosine (high = small angular change) and L2 (growing through compress) suggests that compress layers maintain the representational direction established by expand while progressively scaling and sharpening specific features for token prediction.

Spectral Analysis

Activation Effective Rank Progression

Effective rank of activation covariance matrices at each layer — measures the dimensionality of the representation (how many independent directions carry meaningful variance).

The hourglass is visible in the numbers:

Embedding: erank ~215 (the input dimensionality baseline)
Expand E0-E6: erank ~220, slightly above embedding — the representation briefly expands
Expand E7-E16: Gradual decline from ~220 to ~170 — progressive abstraction compresses the representation
Expand E17-E19: Sharp collapse to ~45 — the final expand layers aggressively compress toward the bottleneck
Middle M0: erank ~50 — the bottleneck. The entire model's information passes through ~50 effective dimensions
Compress C0-C2: Rapid recovery to ~75, ~130, ~190
Compress C3-C19: Climbs to ~230, exceeding the embedding baseline — the compress phase reconstructs a richer representation than the input

The compress phase doesn't mirror the expand phase — it overshoots, producing higher effective rank. This makes sense: the output representation needs to distinguish between 32K tokens, requiring more dimensions than the embedding's initial encoding.

Activation Eigenspectra

Eigenvalue distributions and cumulative variance concentration across all layers.

Left (eigenvalue distribution): Expand layers (blue) show flatter spectra (more distributed variance), middle layers have steep spectra (concentrated variance), compress layers (red) rebuild distributed spectra. The spectral shape changes continuously through the pipeline.
Right (cumulative variance): The middle layer concentrates 90% of variance in ~50 components. Expand and compress layers need 150-200+ components to reach 90%. This confirms the bottleneck isn't just in effective rank — the actual information content is compressed.

Mirror Pair Activation Comparison

Activation spectra for each mirror pair, comparing expand vs compress phases. Each subplot shows one shared-weight pair operating in both directions.

Despite sharing W1 and W2, expand and compress activations have different spectral profiles. The gates (W3/W4) reshape the spectral distribution without changing the structural transformation. Earlier pairs show larger expand/compress divergence; later pairs converge — consistent with the CKA finding that late expand and early compress layers are the most dissimilar.

Embedding vs Final Activation Spectra

Direct comparison of the embedding matrix spectrum and the final-layer activation spectrum.

Embedding: erank 955, nearly flat spectrum — the frozen MobileLLM embeddings distribute information broadly across all 1024 dimensions with minimal concentration.
Final activation: erank 218, steep spectrum — the model has learned to concentrate its output into ~218 active dimensions, with the top 25 components carrying ~50% of variance.

The model takes a broadly distributed input signal and progressively concentrates it into a lower-dimensional but more structured output. The 4.4x rank reduction (955 to 218) is the spectral signature of the expand-compress pipeline.

G2LU Gate Spectra (W3 vs W4)

Weight spectra and effective rank of the outer gate (W3) vs inner gate (W4) across all 21 mirror pairs.

Top (weight spectra): W3 and W4 have nearly identical spectral shapes — both use the full rank of the weight matrix. Neither gate has collapsed or become low-rank. The nested gating structure (W4 modulates W3) doesn't force one gate to become simpler than the other.
Bottom (effective rank per pair): Both gates maintain erank ~900 (out of 1024) across all pairs, with W4 (inner) slightly lower than W3 (outer) in most pairs. The ~50-rank gap is consistent: the inner gate uses slightly fewer effective dimensions, potentially because it operates as a coarser filter that the outer gate refines.

Layer-wise Spectral Properties (FFN W1)

Spectral metrics for the shared W1 projection matrix across all mirror pairs.

Effective rank (top-left): Uniformly high (~930) across all layers — no weight collapse anywhere.
Stable rank (top-right): Peaks sharply at pairs 10-11 (the layers nearest the architectural midpoint) with stable rank ~50, compared to ~20-30 for other pairs. The midpoint layers have the most distributed singular value spectra — the most "general" transformations.
Power-law alpha (bottom-left): All layers below the alpha=2 boundary (heavy-tailed), indicating structured, non-random weight matrices. Remarkably uniform across layers (~0.30-0.35).
Signal ratio (bottom-right): Very low (<0.006) — almost all singular values fall below the Marchenko-Pastur noise bound. This is consistent with the distributed, non-sparse nature of the weights; information is encoded in the collective spectrum rather than in isolated large singular values.

Spectral Comparison: Prisma vs GPT-2 Medium

Layer-by-layer comparison of spectral properties between Prisma 357M and GPT-2 Medium (355M).

Effective rank: Nearly identical (~930-950) for both models across all layers. The mirrored architecture doesn't sacrifice weight expressiveness.
Stable rank: Prisma shows more variation between layers (range ~~20-50) vs GPT-2's flatter profile (~~25-35). The midpoint peak in Prisma has no equivalent in GPT-2 — it's a structural consequence of the mirrored architecture.
Power-law alpha: Both models in the same range (0.25-0.35), both heavy-tailed. Prisma's alpha is slightly more uniform, potentially reflecting the regularizing effect of weight sharing.

Weight Spectra by Component

Individual weight spectra (linear and log-log scale) for each parameter type across all layers.

Embedding

The frozen MobileLLM embedding has one dominant singular value (~115) with a steep drop to ~18 for the second, then a gradual tail. The Marchenko-Pastur bound (dotted line at ~35) shows only 1 singular value above the noise floor — the embedding matrix is effectively rank-1 plus structured noise. This is the "fixed coordinate system" that anchors the entire model.

FFN Shared Projection (W1)

All 21 W1 matrices have similar spectral profiles: top singular value ~16, smooth decay. Tight clustering in log-log confirms that shared weights don't develop pathological layer-specific structure.

Outer Gate (W3)

Inner Gate (W4)

W3 and W4 are spectrally near-identical at the weight level — confirming that their functional differentiation (inner vs outer gate) emerges from the compositional structure of G2LU rather than from divergent weight distributions. The nested relationship silu(W3@x * silu(W4@x)) creates functional asymmetry from structural symmetry.

Attention Q Projections

Q projections show slightly more inter-layer variance than FFN weights — different layers attend to different things, while FFN transformations stay more uniform. The log-log tail follows a clean power law across all layers.