new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

May 15

Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits

Chinchilla Approach 2 is among the most widely used methods for fitting neural scaling laws. Its parabolic approximation introduces systematic biases in compute-optimal allocation estimates, even on noise-free synthetic data. Applied to published Llama 3 IsoFLOP data at open frontier compute scales, these biases imply a parameter underallocation corresponding to 6.5% of the 3.8times10^{25} FLOP training budget and \1.4M (90% CI: 412K-\2.9M) in unnecessary compute at 50% H100 MFU. Simulated multimodal model misallocations show even greater opportunity costs due to higher loss surface asymmetry. Three sources of this error are examined: IsoFLOP sampling grid width (Taylor approximation accuracy), uncentered IsoFLOP sampling, and loss surface asymmetry (α\neq β$). Chinchilla Approach 3 largely eliminates these biases but is often regarded as less data-efficient, numerically unstable, prone to local minima, and harder to implement. Each concern is shown to be unfounded or addressable, especially when the partially linear structure of the objective is exploited via Variable Projection, enabling unbiased inference on all five loss surface parameters through a two-dimensional optimization that is well-conditioned, analytically differentiable, and amenable to dense, or even exhaustive, grid search. It may serve as a more convenient replacement for Approach 2 or a more scalable alternative for adaptations of Approach 3 to richer scaling law formulations.

  • 5 authors
·
Mar 21

Searching For Anisotropic Gravitational-wave Backgrounds Using Pulsar Timing Arrays

We present the results of simulated injections testing the first Bayesian search-pipeline capable of investigating the angular-structure of a gravitational-wave (GW) background influencing pulsar signals. A stochastic background of GWs from the incoherent superposition of many inspiraling supermassive black hole binaries at nHz frequencies is likely to be the dominant GW signal detectable by pulsar timing arrays (PTAs). Even though one might expect a background composed of a high-redshift cosmological population of sources to be fairly isotropic, deviations from isotropy may be indicative of local GW hotspots or some form of continuous anisotropy in the angular-distribution of GW-power. A GWB induces time-of-arrival deviations in pulsar signals which are correlated between separated pulsars. In an isotropic background this cross-correlation follows a distinctive relationship, known as the Hellings and Downs curve, that depends only on the angular separation of the pulsars. If the background is anisotropic, the cross-correlation is different, but predictable, and also depends on the absolute position of the pulsars. By simulating datasets containing GWBs with various anisotropic configurations, we have explored the prospects for constraining anisotropy using near future data. We find that at moderate to high signal to noise ratio the assumption of isotropy is no longer an appropriate description of the simulated background. Furthermore, we can recover the nature of the injected anisotropy in a Bayesian parameter-estimation search, and propose a prior on the anisotropy search-space motivated by the physicality of the implied distribution of sources.

  • 2 authors
·
Jun 23, 2013

FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer

Training large-scale neural networks requires solving nonconvex optimization where the choice of optimizer fundamentally determines both convergence behavior and computational efficiency. While adaptive methods like Adam have long dominated practice, the recently proposed Muon optimizer achieves superior performance through orthogonalized momentum updates that enforce isotropic geometry with uniform singular values. However, this strict isotropy discards potentially valuable curvature information encoded in gradient spectra, motivating optimization methods that balance geometric structure with adaptivity. We introduce FISMO (Fisher-Structured Momentum-Orthogonalized) optimizer, which generalizes isotropic updates to incorporate anisotropic curvature information through Fisher information geometry. By reformulating the optimizer update as a trust-region problem constrained by a Kronecker-factored Fisher metric, FISMO achieves structured preconditioning that adapts to local loss landscape geometry while maintaining computational tractability. We establish convergence guarantees for FISMO in stochastic nonconvex settings, proving an O(1/T) rate for the expected squared gradient norm with explicit characterization of variance reduction through mini-batching. Empirical evaluation on image classification and language modeling benchmarks demonstrates that FISMO achieves superior training efficiency and final performance compared to established baselines.

  • 3 authors
·
Jan 29