Got the MSE to 0.16 but after pairing

#1
by ShubhamRasal - opened

The pairing is the easy part, but the next search space is huge still (48!)

Few things I am trying now

  1. Strength of update (since some weights affect the output more than others)
  2. compute each blockโ€™s Jacobian magnitude and using it to order by gradient.

Not sure if this will work. Will update soon.

ShubhamRasal changed discussion title from Got the MSE to 1.6 but after pairing to Got the MSE to 0.16 but after pairing

Got to 0.01456313207745552 using simulated annealing. I had used any colony optimisation to solve the travelling salesman problem a while ago and now I remembered that annealing was also a good heuristic for NP hard problems.

rookie numbers :)

Really good work Shubham, there was a paper published on this but you're really close - basically solved, you just need to do a hill climb from where you're at and you'd get the solution.

The insight from the paper was that sorting by delta-norm (ascending) also gets you in the right place for hill-climbing.

Hi, are there research papers that explain the principles behind why pairing works?

Yes! The short version: training forces structure onto the weights, and
that structure is detectable.

Each block in this network has two weight matrices โ€” one that expands the
data (48โ†’96 dimensions) and one that compresses it back (96โ†’48). The
puzzle separates all these layers and shuffles them, so you have 48
"expand" matrices and 48 "compress" matrices and need to figure out which
pairs were originally trained together.

The trick is: when you multiply a correctly paired compress ร— expand
matrix, the result looks close to a scaled identity matrix. This isn't a
coincidence โ€” it's a property called dynamic isometry that emerges from
training. For gradients to flow stably through a deep residual network (no
exploding, no vanishing), each block's combined transformation has to be
well-behaved, and that leaves a fingerprint in the weights.

So you just build a 48ร—48 score table โ€” every compress matrix multiplied
by every expand matrix โ€” and measure how "identity-like" each product is:
|trace(P)| / frobenius_norm(P). Correct pairs score high (1.76โ€“3.23),
wrong pairs score near zero. The Hungarian algorithm then finds the
optimal one-to-one matching.

Once you've paired the layers back into 48 complete blocks, you still need
to figure out what order they go in. That's where delta-norm sorting and
hill climbing come in.

If you want to read more:

  • Why residual networks have this property: He et al., "Deep Residual
    Learning" (2015) and "Identity Mappings in Deep Residual Networks" (2016)
  • Dynamic isometry theory: Saxe, McClelland & Ganguli, "Exact solutions to
    the nonlinear dynamics of learning in deep linear neural networks"
    (2014); Pennington et al., "Resurrecting the sigmoid in deep learning
    through dynamical isometry" (2017)

For sorting the block space and simplifying it further, I recommend sorting the out blocks by the variance of the weights, then looking at the l2 distance between their biases. That should allow you to reconstruct the correct order of the out blocks. If you have the correct pairing between the out and inp layers, you could construct the correct solution directly without any machine learning techniques!

Sign up or log in to comment