Spaces:

jane-street
/

droppedaneuralnet

Running

App Files Files Community

Got the MSE to 0.16 but after pairing

by ShubhamRasal - opened Feb 6

Discussion

ShubhamRasal

Feb 6

The pairing is the easy part, but the next search space is huge still (48!)

ShubhamRasal

Feb 6

Few things I am trying now

Strength of update (since some weights affect the output more than others)
compute each block’s Jacobian magnitude and using it to order by gradient.

Not sure if this will work. Will update soon.

ShubhamRasal changed discussion title from Got the MSE to 1.6 but after pairing to Got the MSE to 0.16 but after pairing Feb 6

ShubhamRasal

Feb 6

Got to 0.01456313207745552 using simulated annealing. I had used any colony optimisation to solve the travelling salesman problem a while ago and now I remembered that annealing was also a good heuristic for NP hard problems.

ElliotSlusky

Feb 6

rookie numbers :)

stiege

19 days ago

Really good work Shubham, there was a paper published on this but you're really close - basically solved, you just need to do a hill climb from where you're at and you'd get the solution.

The insight from the paper was that sorting by delta-norm (ascending) also gets you in the right place for hill-climbing.

Circuitrinos

13 days ago

•

edited 13 days ago

Hi, are there research papers that explain the principles behind why pairing works?

stiege

13 days ago

•

edited 13 days ago

Yes! The short version: training forces structure onto the weights, and
that structure is detectable.

Each block in this network has two weight matrices — one that expands the
data (48→96 dimensions) and one that compresses it back (96→48). The
puzzle separates all these layers and shuffles them, so you have 48
"expand" matrices and 48 "compress" matrices and need to figure out which
pairs were originally trained together.

The trick is: when you multiply a correctly paired compress × expand
matrix, the result looks close to a scaled identity matrix. This isn't a
coincidence — it's a property called dynamic isometry that emerges from
training. For gradients to flow stably through a deep residual network (no
exploding, no vanishing), each block's combined transformation has to be
well-behaved, and that leaves a fingerprint in the weights.

So you just build a 48×48 score table — every compress matrix multiplied
by every expand matrix — and measure how "identity-like" each product is:
|trace(P)| / frobenius_norm(P). Correct pairs score high (1.76–3.23),
wrong pairs score near zero. The Hungarian algorithm then finds the
optimal one-to-one matching.

Once you've paired the layers back into 48 complete blocks, you still need
to figure out what order they go in. That's where delta-norm sorting and
hill climbing come in.

If you want to read more:

Why residual networks have this property: He et al., "Deep Residual
Learning" (2015) and "Identity Mappings in Deep Residual Networks" (2016)
Dynamic isometry theory: Saxe, McClelland & Ganguli, "Exact solutions to
the nonlinear dynamics of learning in deep linear neural networks"
(2014); Pennington et al., "Resurrecting the sigmoid in deep learning
through dynamical isometry" (2017)

Circuitrinos

13 days ago

Thanks

catmanisacatlord

12 days ago

For sorting the block space and simplifying it further, I recommend sorting the out blocks by the variance of the weights, then looking at the l2 distance between their biases. That should allow you to reconstruct the correct order of the out blocks. If you have the correct pairing between the out and inp layers, you could construct the correct solution directly without any machine learning techniques!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment