AI & ML interests

A community of env builders.

Recent Activity

Teen-Differentย 
posted an update 2 days ago
view post
Post
97
I wrote a note on something Iโ€™ve been experimenting with: EqPropMomentum

Itโ€™s a new optimizer:
take Equilibrium Propagation gradients, then update parameters with classical momentum instead of plain naive steps.

Why I cared:
predictive coding / EqProp style methods are interesting because they move away from standard backprop assumptions, but they often feel slow, noisy, and hard to scale

So this was my attempt at a small practical bridge: keep the energy-based flavor, improve optimization behavior

I put together the intuition, math, code, and experiments here: https://teendifferent.substack.com/p/the-revival-of-predictive-coding

Would love feedback from anyone working on predictive coding, biologically plausible learning, or energy-based training โœŒ๏ธ
Teen-Differentย 
posted an update about 1 month ago
view post
Post
199
Adaptive Attention at Inference Time: Does It Actually Work?

A hypernetwork that rewires GPT's value heads on every forward pass. The answer: not a clean win โ€” but not a failure either.

Blog post: https://teendifferent.substack.com/p/adaptive-attention-at-inference-time
Code: https://github.com/REDDITARUN/a-gpt
Weights: Teen-Different/adaptive-gpts


What This Is

Five small language model variants trained for 12k steps on a 300M token mixed corpus, answering one question: can the residual stream be used to slightly rewrite the model's own computation while it's running?

Instead of a fixed W_v for every context, a TinyHeadTransformer hypernetwork generates low-rank (LoRA-style) updates to the value projection of each attention head โ€” conditioned on the current residual stream. Each token gets a dynamically adapted value transformation.


The Five Models

Base GPT โ€” 28.9M params, 139 tok/s, val loss ~3.82
Matched GPT (+2 layers) โ€” 30.5M params, 204 tok/s, val loss ~3.80
Adaptive GPT โ€” 30.5M params, 38.7 tok/s, val loss ~3.88โ€“3.92
Diffusion GPT โ€” 28.9M params, 110 tok/s, val loss ~5.0โ€“5.2
Adaptive Diffusion GPT โ€” 30.5M params, 40.4 tok/s, val loss ~5.0โ€“5.2

Architecture: 4 layers, 4 heads, d_model=256, context=256, RoPE, GPT-2 tokenizer.


How the Hypernetwork Works

For each attention head, a TinyHeadTransformer encodes the head's residual stream slice, mean-pools it to a conditioning vector, then projects into low-rank factors A (dร—r) and B (rร—d) at rank=8. The dynamic value update follows LoRA conventions with alpha/r scaling. B is zero-initialized so the adaptive path starts inert and the model begins as a vanilla GPT โ€” critical for training stability.

The diffusion variant uses bidirectional attention, RMSNorm, squared ReLU, and a learned timestep embedding.

ssffaย 

Update index.html

#1 opened about 1 month ago by
SaiPranavSripathi
aarushguptaย 
updated a Space about 1 month ago
aarushguptaย 
published a Space about 1 month ago
aarushguptaย 
updated a Space about 1 month ago