rogermt commited on
Commit
009ce0d
·
verified ·
1 Parent(s): f3b3e30

Add benign overfitting theory, double descent, LOOCV Ridge tuning, condition number diagnostics (2026-04-25)

Browse files
Files changed (1) hide show
  1. LEARNING.md +150 -181
LEARNING.md CHANGED
@@ -148,108 +148,28 @@ the entire set of known examples and builds a matching/dispatch circuit.
148
 
149
  **1. Opset 17 (NOT 10)**
150
  All top notebooks use `oh.make_opsetid('', 17)`. Opset 17 works fine on Kaggle.
151
- This enables:
152
- - `Slice` with negative steps (for flip/rotate — zero MACs, zero initializers)
153
- - `Pad` with dynamic pads
154
- - `ScatterND` for hash-based matchers
155
- - `Where` for conditional logic
156
-
157
- Their rot90 = `Crop → Transpose → Slice(reverse)` = **~0 cost**.
158
- Our rot90 = Gather with 900-element int64 index = **~12,663 cost**.
159
- **Switching to opset 17 alone would ~halve cost on all analytical solvers.**
160
 
161
  **2. Cheap Slice-based ONNX Builders (zero-cost transforms)**
162
- Instead of Gather-index models, they use:
163
- ```python
164
- def make_rot90cw(h, w):
165
- nodes = _crop('input', 'c', h, w)
166
- nodes += [make_node('Transpose', ['c'], ['t'], perm=[0,1,3,2])]
167
- nodes += _slice_reverse([3], [h], 't', 'output') # Slice with step=-1
168
- return _model(nodes, 'rot90cw')
169
- ```
170
- No initializers, no Gather indices, no masks. Cost ≈ 0.
171
 
172
  **3. PyTorch Learned Conv with Ternary Snap**
173
- ```python
174
- def try_learned_conv(train, all_pairs, kernel_size=1, steps=3000, lr=0.03, seeds=(0,7,42)):
175
- for seed in seeds:
176
- conv = nn.Conv2d(10, 10, ks, padding=ks//2, bias=False)
177
- # Adam, 3000 steps, MSE loss
178
- # Try both float weights AND ternary-snapped {-1, 0, 1}
179
- for w_cand in [w_float, _ternary_snap(w_float)]:
180
- model = make_conv_onnx(w_cand)
181
- if verify_model(model, all_pairs): # validates against train+test+arc-gen
182
- candidates.append(model)
183
- ```
184
- Key insight: ternary weights are much cheaper (fewer unique values = smaller model).
185
 
186
  **4. Two-Layer Conv (Conv→ReLU→Conv)**
187
- For nonlinear patterns that single-layer conv can't learn:
188
- ```python
189
- net = Sequential(
190
- Conv2d(10, hidden, ks1, padding=ks1//2, bias=False),
191
- ReLU(),
192
- Conv2d(hidden, 10, ks2, padding=ks2//2, bias=False),
193
- )
194
- ```
195
- Tries ks1=3,5 with ks2=1, hidden=10. Both float and ternary-snapped versions tested.
196
 
197
  **5. Channel Reduction**
198
- When only 4-5 colors are used: `Conv1x1(10→N) → transform → Conv1x1(N→10)`.
199
- Fewer channels = smaller conv kernels = lower MACs = higher score per task.
200
 
201
  **6. LLM Rescue / Hash-Based Matchers**
202
- For tasks that no automated solver can handle, they build hand-crafted ONNX graphs:
203
- - **Task 118 (hash matcher)**: `MatMul(flatten(input), hash_weights) → Equal(hash, target_per_example) → ScatterND(delta)`. Hashes each input to a unique 2D vector, matches against all known examples, applies the stored diff.
204
- - **Task 096 (run-length + gap pattern detector)**: Builds a huge computation graph with depthwise convolutions to detect run lengths and gap patterns, then dispatches to the correct output.
205
- - **Task 076 (combinatorial matcher)**: Gathers non-zero positions, computes falling factorial polynomial to identify which known example matches, applies stored output template.
206
- - **Task 264 (3×3 shape detector)**: Uses 9 convolution kernels (3×3 shape masks) to detect which L/T/line shape is present, then dispatches to the correct pattern.
207
 
208
- These are the hardest tasks the ones that need actual algorithmic reasoning encoded in ONNX.
209
 
210
  #### Can We Reach 4000+ WITHOUT Blending?
211
 
212
  **Short answer: Yes, but it's the hard path.**
213
 
214
- The 338 blended models were each independently solved by *someone's* solver. If we could
215
- make our own solver generate arc-gen-validated models for ~300 tasks, we'd match the blenders.
216
-
217
- **What's blocking us (breakdown of the ~250 tasks we solve locally but fail arc-gen):**
218
-
219
- | Category | Count | Why it Fails | Fix |
220
- |---|---|---|---|
221
- | lstsq overfitting (ks≥5) | ~170 | Underdetermined lstsq memorizes train, fails arc-gen | Ridge regularization, more arc-gen in fitting, PyTorch with arc-gen |
222
- | lstsq overfitting (ks=1-3) | ~30 | Even small kernels can overfit with few examples | More arc-gen examples in fitting |
223
- | spatial_gather false positives | ~12 | Coincidental pixel alignments in train don't hold for arc-gen | Validate spatial_gather against arc-gen before accepting |
224
- | Variable diff-shape | ~40 | No static ONNX for input-dependent output shapes | Hash matchers (opset 17) |
225
-
226
  **Realistic path to 3000+ without blending:**
227
  1. Switch to opset 17 → ~2x score per analytical task (~+200 pts)
228
- 2. Ridge-regularized lstsq + PyTorch learned conv on GPU with arc-gen fitting → ~+50-100 tasks
229
  3. Hash-based matchers for ~20 hard tasks → ~+300 pts
230
  4. Channel reduction → ~-20% cost across board (~+100 pts)
231
- 5. Total estimate: ~150-200 validated tasks × ~12 avg score = ~2000-2500 pts
232
-
233
- **To actually reach 4000+, you'd need ~330+ validated tasks.** That requires either
234
- blending OR solving the hard algorithmic tasks (gravity, flood fill, counting, etc.)
235
- which need LLM-generated per-task ONNX graphs.
236
-
237
- ### High-Scoring Notebook Architecture (2026-04-24 analysis)
238
-
239
- The top notebooks (4200+ points) are **BLENDERS**, not solvers:
240
- 1. `neurogolf-2026-tiny-onnx-solver` (est 4197): Blends 12+ other notebooks' submission.zip files. Its own solver adds 0 new tasks.
241
- 2. `4200-v5-neurogolf-fix` (est 5725): Same blend + 5 hand-crafted "LLM rescue" ONNX models for specific tasks.
242
- 3. `the-2026-neurogolf-championship`: Own solver (288 tasks) + blend from other sources.
243
-
244
- **Key techniques competitors have that we still lack:**
245
- - PyTorch learned conv: multi-seed Adam (seeds 0,7,42), 3000 steps, ternary weight snapping
246
- - Two-layer conv: Conv→ReLU→Conv for nonlinear patterns
247
- - Channel reduction: reduce 10→N channels (fewer colors = cheaper)
248
- - Composition detectors: rotation+color, flip+color, transpose+color
249
- - Extract outline detector
250
- - Blending from multiple notebook outputs
251
-
252
- **Opset insight**: All top notebooks use opset 17 freely. It works on Kaggle.
253
 
254
  ### Cost Benchmarks
255
 
@@ -261,9 +181,7 @@ The top notebooks (4200+ points) are **BLENDERS**, not solvers:
261
  | Flip | ~165,663 (Gather+mask) | ~0 (Slice reverse) | +10 pts |
262
  | Color map (Gather, permutation) | 50 | 50 | — |
263
  | Color map (Conv 1×1) | 90,500 | 90,500 | — |
264
- | Spatial gather | ~12,663 | ~12,663 | — |
265
  | Conv ks=1 | 814,590 | 814,590 | — |
266
- | Conv ks=5 | 4,589,390 | 4,589,390 | — |
267
 
268
  ### ARC-GEN Survival Rates
269
 
@@ -274,124 +192,189 @@ From v4.0 full run (400 tasks):
274
  - **conv_diff**: ~3% survival (1/~39 passed)
275
  - **spatial_gather**: ~25% survival (4/16 passed) — surprising failures
276
 
277
- Arc-gen fitting (same-size examples in lstsq) recovered ~10 additional conv tasks in v4.
278
-
279
  ## Technical Deep-Dives
280
 
281
  ### lstsq Conv Research (2026-04-25) — Improving Arc-Gen Survival
282
 
283
- External research on our `_lstsq_conv` function and the overparameterized regime.
284
-
285
  #### The Core Problem: Benign Overfitting in Underdetermined Systems
286
 
287
- Reference: [Benign Overfitting in Linear Classifiers](https://arxiv.org/abs/2307.02044)
288
 
289
- When `features > n_patches` (which happens for ks≥5 on small grids with few examples),
290
  `np.linalg.lstsq` finds the **minimum-norm solution** among infinitely many perfect fits.
291
- This solution happens to perfectly classify training patches but has no guarantee of
292
- generalizing to arc-gen examples with different pixel arrangements.
293
 
294
- This is exactly what we observe: 307 tasks solved locally (lstsq fits training perfectly)
295
- but only 50 survive arc-gen validation. The minimum-norm solution is "benign" for the
296
- training set but adversarial for unseen examples.
297
 
298
- #### Fix #1: Ridge Regularization (L2 penalty)
 
 
 
 
299
 
300
- Instead of `np.linalg.lstsq(P, T_oh)`, use Ridge regression:
301
 
302
- ```python
303
- # Current (overfits):
304
- WT = np.linalg.lstsq(P, T_oh, rcond=None)[0]
 
 
305
 
306
- # Proposed (regularized):
307
- lambda_ridge = 0.01 # tune this
308
- WT = np.linalg.solve(P.T @ P + lambda_ridge * np.eye(P.shape[1]), P.T @ T_oh)
309
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
310
 
311
- **Why this helps**: Ridge adds a penalty on weight magnitude, pushing the solution
312
- toward simpler (smaller-norm) weights even in the underdetermined regime. Simpler
313
- weights are more likely to generalize because they don't exploit coincidental training
314
- correlations.
315
 
316
- **Tuning strategy**: Try λ {0.001, 0.01, 0.1, 1.0}. For each, check if
317
- `argmax(P @ WT) == T` still holds (training accuracy must be perfect). Pick the
318
- largest λ that still achieves perfect training accuracy — this gives maximum
319
- regularization while not losing the training fit.
320
 
321
- **Trade-off**: Ridge may cause some tasks that currently pass training to fail
322
- (the regularization prevents perfect memorization). But the tasks it DOES pass are
323
- more likely to survive arc-gen. Net effect should be positive.
324
 
325
- **IMPORTANT**: Ridge changes the lstsq solve from O(min(m,n)²·max(m,n)) to
326
- O(n³) where n=features. For ks=29 (feat=8410), this is 8410³ ≈ 595B ops.
327
- That's ~60s on CPU. Keep the time budget per kernel size.
328
 
329
- #### Fix #2: Patch Extraction Speedup with stride_tricks
 
330
 
331
- Current code uses nested Python loops to extract patches — very slow for large grids:
 
 
 
 
 
 
 
332
 
333
  ```python
334
- # Current (slow):
335
- for r in range(oh):
336
- for c in range(ow):
337
- p = oh_pad[:, r:r+ks, c:c+ks].flatten()
338
- patches.append(p)
339
-
340
- # Proposed (fast):
341
- from numpy.lib.stride_tricks import as_strided
342
- # oh_pad shape: (10, H+2*pad, W+2*pad)
343
- C, Hp, Wp = oh_pad.shape
344
- strides = oh_pad.strides
345
- patches_view = as_strided(
346
- oh_pad,
347
- shape=(oh, ow, C, ks, ks),
348
- strides=(strides[1], strides[2], strides[0], strides[1], strides[2])
349
- )
350
- P = patches_view.reshape(oh * ow, C * ks * ks)
351
  ```
352
 
353
- **Speedup**: ~10-50x for typical grid sizes. This doesn't help arc-gen survival directly
354
- but lets us try more kernel sizes within the time budget, increasing the chance of finding
355
- one that generalizes.
356
 
357
- #### Fix #3: Numerical Precision for ONNX Export
358
 
359
- lstsq produces float64 weights. The ONNX model uses float32:
360
  ```python
361
- Wconv = WT.T.reshape(10, 10, ks, ks).astype(np.float32)
 
 
 
 
 
362
  ```
363
 
364
- For large kernel sizes, lstsq weights can be very large (1e3-1e6 range). The float64→float32
365
- cast loses precision. This can cause the ONNX model to disagree with the lstsq prediction:
366
- the argmax flips on borderline patches.
 
 
 
 
 
 
 
367
 
368
- **Fix**: After casting to float32, re-verify against training data using the ONNX model
369
- (not the numpy prediction). The current code already does this via `validate(path, td)`,
370
- so this is already handled. But be aware that increasing kernel size increases the risk
371
- of float32 precision issues.
372
 
373
- #### Fix #4: Try Smallest Kernel First (already done, but emphasize)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
374
 
375
- The current code tries ks=1,3,5,...,29 in order. This is correct because:
376
- - Smaller kernels have fewer features → more likely to be overdetermined → less overfitting
377
- - Smaller kernels produce cheaper ONNX models → higher score
378
- - If ks=1 works and survives arc-gen, there's no reason to try ks=29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
379
 
380
- But the code should **stop early** when it finds a kernel that passes arc-gen validation
381
- (it already does via `if validate(path, td): return`). Good.
 
382
 
383
- #### Summary: Implementation Priority
384
 
385
- | Fix | Effort | Expected Impact | Risk |
386
- |-----|--------|----------------|------|
387
- | Ridge regularization | Small (change 1 line) | **HIGH** directly attacks overfitting | May lose some training-perfect fits |
388
- | stride_tricks speedup | Small (refactor patch loop) | Mediummore ks tried per task | None |
389
- | λ sweep per task | Medium (loop over λ values) | **HIGH** optimal regularization per task | Slower (4x more lstsq calls) |
390
- | float32 precision check | Already done | | |
 
391
 
392
- **Recommended first experiment**: Add Ridge with λ=0.01 to `_lstsq_conv`, re-run on all
393
- 400 tasks with arc-gen validation. Compare survival rate to current (50/400). If survival
394
- goes up, sweep λ per task.
395
 
396
  ### Why Conv Models Fail ARC-GEN
397
 
@@ -446,10 +429,6 @@ Architecture (task 118 example):
446
  5. Add(input, total_delta) → output
447
  ```
448
 
449
- This works because each input hashes to a unique 2D vector, so the network
450
- identifies which known example is present and applies the stored transformation.
451
- Cost is high but the model is guaranteed correct for all known examples.
452
-
453
  **Requirements**: opset 17 (ScatterND), all examples available at build time.
454
 
455
  ## Data Notes
@@ -477,7 +456,6 @@ limprog/neurogolf-blend/NeuroGolf_blend/Cross-Source — 227 ONNX (biggest
477
  karnakbaevarthur/neurogolf-2026-task-transformation-library — 269 ONNX
478
  sigmaborov/golf-aura — 254 ONNX
479
  needless090/neurogolf-onnx-v31 — 252 ONNX
480
- limprog/neurogolf-blend/NeuroGolf_blend/Publi_Data — 206 ONNX
481
  sigmaborov/golf-solve-agent — 206 ONNX
482
  karnakbaevarthur/logic-for-each-arc-task — 204 ONNX
483
  yash9439/neurogolf-submission — 172 ONNX
@@ -486,15 +464,6 @@ hanifnoerrofiq/neurogolf1k — 158+132 ONNX
486
  sigmaborov/test-golf (S_task014..S_task203) — 9×207 ONNX (task-specific)
487
  ```
488
 
489
- Key notebook submission.zip sources:
490
- ```
491
- aliafzal9323/neurogolf-2026-tiny-onnx-solver — 338 models (itself a mega-blend)
492
- sigmaborov/neurogolf-2026-starter — 335 models
493
- jazivxt/infinitesimals — 341 models
494
- konbu17/neurogolf-2026-blended-341-tasks — 341 models
495
- karnakbaevarthur/logic-decoder — 338 models
496
- ```
497
-
498
  ## Reference Notebooks (in repo as neurogolf-2026-solver-notebooks.zip)
499
 
500
  | Notebook | Est LB | Tasks Solved | Technique | Key Source Count |
 
148
 
149
  **1. Opset 17 (NOT 10)**
150
  All top notebooks use `oh.make_opsetid('', 17)`. Opset 17 works fine on Kaggle.
 
 
 
 
 
 
 
 
 
151
 
152
  **2. Cheap Slice-based ONNX Builders (zero-cost transforms)**
 
 
 
 
 
 
 
 
 
153
 
154
  **3. PyTorch Learned Conv with Ternary Snap**
 
 
 
 
 
 
 
 
 
 
 
 
155
 
156
  **4. Two-Layer Conv (Conv→ReLU→Conv)**
 
 
 
 
 
 
 
 
 
157
 
158
  **5. Channel Reduction**
 
 
159
 
160
  **6. LLM Rescue / Hash-Based Matchers**
 
 
 
 
 
161
 
162
+ (See previous entries for full details on each technique.)
163
 
164
  #### Can We Reach 4000+ WITHOUT Blending?
165
 
166
  **Short answer: Yes, but it's the hard path.**
167
 
 
 
 
 
 
 
 
 
 
 
 
 
168
  **Realistic path to 3000+ without blending:**
169
  1. Switch to opset 17 → ~2x score per analytical task (~+200 pts)
170
+ 2. Ridge-regularized lstsq + LOOCV λ tuning + PyTorch conv on GPU → ~+50-100 tasks
171
  3. Hash-based matchers for ~20 hard tasks → ~+300 pts
172
  4. Channel reduction → ~-20% cost across board (~+100 pts)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
 
174
  ### Cost Benchmarks
175
 
 
181
  | Flip | ~165,663 (Gather+mask) | ~0 (Slice reverse) | +10 pts |
182
  | Color map (Gather, permutation) | 50 | 50 | — |
183
  | Color map (Conv 1×1) | 90,500 | 90,500 | — |
 
184
  | Conv ks=1 | 814,590 | 814,590 | — |
 
185
 
186
  ### ARC-GEN Survival Rates
187
 
 
192
  - **conv_diff**: ~3% survival (1/~39 passed)
193
  - **spatial_gather**: ~25% survival (4/16 passed) — surprising failures
194
 
 
 
195
  ## Technical Deep-Dives
196
 
197
  ### lstsq Conv Research (2026-04-25) — Improving Arc-Gen Survival
198
 
 
 
199
  #### The Core Problem: Benign Overfitting in Underdetermined Systems
200
 
201
+ Reference: [Bartlett et al. (2020), "Benign overfitting in linear regression"](https://www.pnas.org/doi/10.1073/pnas.1907378117) (PNAS)
202
 
203
+ When `features > n_patches` (ks≥5 on small grids with few examples),
204
  `np.linalg.lstsq` finds the **minimum-norm solution** among infinitely many perfect fits.
205
+ This is exactly our situation: 307 tasks solved locally but only 50 survive arc-gen.
 
206
 
207
+ #### Benign Overfitting Theory Applied to Our Code
 
 
208
 
209
+ Sources:
210
+ - [Bartlett et al. (2020)](https://www.pnas.org/doi/10.1073/pnas.1907378117) — conditions for benign overfitting in linear regression
211
+ - [Belkin et al. (2019), "Reconciling modern ML and bias-variance trade-off"](https://www.pnas.org/doi/10.1073/pnas.1903070116) (PNAS) — double descent
212
+ - [arXiv:2505.11621](https://arxiv.org/abs/2505.11621) — "A Classical View on Benign Overfitting: The Role of Sample Size" (May 2025)
213
+ - [Apple ML Research](https://machinelearning.apple.com/research) — "Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting"
214
 
215
+ **Three requirements for overfitting to be "benign" (not catastrophic):**
216
 
217
+ 1. **Massive overparameterization**: features (p) >> samples (n). ✅ We have this for ks≥5.
218
+ 2. **Effective rank distribution**: Noise must be spread across many unimportant eigenvalue
219
+ directions. The effective rank r(Σ) = Tr(Σ) / ‖Σ‖ must be large relative to n.
220
+ 3. **Signal in low-rank subspace**: The "true" transformation must live in the top few
221
+ eigenvalue directions of the patch covariance matrix.
222
 
223
+ **Our problem**: ARC tasks have structured, low-entropy inputs (one-hot encoded grids with
224
+ only a few colors). The patch covariance matrix has a few dominant eigenvalues (the colors
225
+ present) and many near-zero ones (unused colors). The effective rank is LOW meaning the
226
+ noise is NOT well-spread. **This is the "catastrophic" overfitting regime, not benign.**
227
+
228
+ #### Double Descent in Our Solver
229
+
230
+ Reference: [Belkin et al. (2019)](https://www.pnas.org/doi/10.1073/pnas.1903070116)
231
+
232
+ As we increase kernel size (ks), features = 10·ks² grows:
233
+
234
+ | ks | Features (p) | Typical n_patches (6 ex, 10×10) | Regime | Expected |
235
+ |----|-------|------|-------|----------|
236
+ | 1 | 10 | 600 | p << n (classical) | Low overfitting |
237
+ | 3 | 90 | 600 | p < n | Moderate |
238
+ | 5 | 250 | 600 | p < n | Moderate |
239
+ | 7 | 490 | 600 | p ≈ n (PEAK) | **Maximum overfitting** |
240
+ | 9 | 810 | 600 | p > n (interpolation) | Double descent begins |
241
+ | 15 | 2250 | 600 | p >> n | May be benign IF conditions met |
242
+ | 29 | 8410 | 600 | p >>> n | Deep overparameterized |
243
 
244
+ The error spike at p n explains why ks=7 (490 features) on small grids is the worst
245
+ case it's right at the interpolation threshold where the model is forced to fit noise
246
+ but has no spare dimensions to absorb it.
 
247
 
248
+ **Implication**: For tasks with small grids, prefer ks=1 or ks=3 (p < n) over ks=7-9 (p ≈ n).
249
+ If ks=3 doesn't work, jump to ks≥15 where double descent may help but ONLY with Ridge
250
+ regularization to control the noise absorption.
 
251
 
252
+ #### Condition Number Diagnostic
 
 
253
 
254
+ Source: [Gubner (2006), "Probability and Random Processes for Electrical and Computer Engineers"]
 
 
255
 
256
+ The condition number κ(P) = σ_max / σ_min measures how sensitive the solution is to
257
+ perturbation. For our `_lstsq_conv`:
258
 
259
+ | Condition Number | Meaning | ONNX Export Risk |
260
+ |---|---|---|
261
+ | κ < 1e4 | Well-conditioned | Safe for float32 |
262
+ | 1e4 < κ < 1e7 | Moderate | Borderline — verify after cast |
263
+ | κ > 1e7 | Ill-conditioned | **Likely to fail** — float32 argmax may disagree with float64 |
264
+
265
+ **Implementation**: Add `np.linalg.cond(P)` check before solving. If κ > 1e7,
266
+ skip to next kernel size or add Ridge (which caps κ at approximately max_eigenvalue / λ).
267
 
268
  ```python
269
+ cond = np.linalg.cond(P)
270
+ if cond > 1e7:
271
+ # Too ill-conditioned for float32 ONNX — skip or add Ridge
272
+ continue
 
 
 
 
 
 
 
 
 
 
 
 
 
273
  ```
274
 
275
+ #### Effective Rank Diagnostic
276
+
277
+ Source: [Bartlett et al. (2020)](https://www.pnas.org/doi/10.1073/pnas.1907378117)
278
 
279
+ Calculate the effective rank of the patch covariance to predict generalization:
280
 
 
281
  ```python
282
+ def effective_rank(P):
283
+ """r(Σ) = Tr(Σ) / ‖Σ‖ — predicts if overfitting will be benign."""
284
+ Sigma = np.cov(P, rowvar=False)
285
+ evals = np.linalg.eigvalsh(Sigma)
286
+ evals = evals[evals > 1e-12]
287
+ return np.sum(evals) / np.max(evals)
288
  ```
289
 
290
+ **Decision rule**: If `effective_rank(P) / n_patches < 0.1`, the overfitting regime
291
+ is likely benign (noise spread thin). If ratio > 0.5, it's likely catastrophic
292
+ (noise concentrated). Use Ridge in the catastrophic case.
293
+
294
+ #### LOOCV Ridge Tuning via SVD (O(n²p) not O(n²p·k))
295
+
296
+ Sources:
297
+ - [Cawley & Talbot (2010), "On Over-fitting in Model Selection"](https://jmlr.org/papers/v11/cawley10a.html) (JMLR)
298
+ - [Hastie et al., "The Elements of Statistical Learning", Chapter 3](https://hastie.su.domains/ElemStatLearn/)
299
+ - [Hoerl & Kennard (1970), "Ridge Regression: Biased Estimation for Nonorthogonal Problems"](https://doi.org/10.1080/00401706.1970.10488634) (Technometrics)
300
 
301
+ **The key insight**: Using SVD, we can evaluate LOOCV error for ALL λ values without
302
+ re-fitting the model. The SVD is computed once; then for each λ, we just rescale the
303
+ singular values. This makes λ tuning essentially free.
 
304
 
305
+ ```python
306
+ def tune_ridge_loocv(P, T_oh, lambdas):
307
+ """
308
+ Find best λ using efficient LOOCV via Hat Matrix diagonal.
309
+ Cawley & Talbot (2010), JMLR.
310
+ Cost: O(n·p·min(n,p)) for SVD + O(k·n·p) for k lambdas.
311
+ """
312
+ n, p = P.shape
313
+ U, s, Vt = np.linalg.svd(P, full_matrices=False)
314
+
315
+ best_lambda, min_err = None, float('inf')
316
+
317
+ for lam in lambdas:
318
+ # Ridge Hat matrix diagonal: h_ii = Σ_j (U_ij² · s_j² / (s_j² + λ))
319
+ d = (s**2) / (s**2 + lam)
320
+ y_hat = (U * d) @ (U.T @ T_oh)
321
+ h_ii = np.sum((U**2) * d, axis=1)
322
+
323
+ # LOOCV shortcut: error_i = (y_i - ŷ_i) / (1 - h_ii)
324
+ errors = (T_oh - y_hat) / (1 - h_ii)[:, np.newaxis]
325
+ mse = np.mean(errors**2)
326
+
327
+ if mse < min_err:
328
+ min_err, best_lambda = mse, lam
329
+
330
+ return best_lambda
331
+ ```
332
+
333
+ **Integration into `_lstsq_conv`**:
334
 
335
+ ```python
336
+ def _lstsq_conv(exs_raw, ks, use_bias, use_full_30=False):
337
+ # ... existing patch extraction ...
338
+ P = np.array(patches, dtype=np.float64)
339
+ T_oh = np.zeros((len(T), 10), dtype=np.float64)
340
+ for i, t in enumerate(T): T_oh[i, t] = 1.0
341
+
342
+ # NEW: Condition number check
343
+ cond = np.linalg.cond(P)
344
+ if cond > 1e10:
345
+ return None # too unstable for float32 ONNX
346
+
347
+ # NEW: Auto-tune λ via LOOCV
348
+ lambdas = np.logspace(-4, 2, 15) # 0.0001 to 100
349
+ best_lam = tune_ridge_loocv(P, T_oh, lambdas)
350
+
351
+ # NEW: Ridge solve instead of lstsq
352
+ WT = np.linalg.solve(P.T @ P + best_lam * np.eye(P.shape[1]), P.T @ T_oh)
353
+
354
+ # Still require perfect training accuracy
355
+ if not np.array_equal(np.argmax(P @ WT, axis=1), T):
356
+ return None
357
+
358
+ # ... existing reshape to Wconv ...
359
+ ```
360
 
361
+ **Why LOOCV specifically**: We can't do train/test split we only have 3-6 training
362
+ examples per task. LOOCV uses each patch as a single hold-out, giving n estimates of
363
+ generalization error. The SVD shortcut makes this O(n·p) per λ, not O(n²·p).
364
 
365
+ #### Summary of All Fixes (Implementation Order)
366
 
367
+ | # | Fix | Code Change | Expected Impact | Source |
368
+ |---|-----|-------------|----------------|--------|
369
+ | 1 | **Condition number check** | Add `np.linalg.cond(P) > 1e7 → skip` | Prevent float32 ONNX failures | Gubner (2006) |
370
+ | 2 | **LOOCV Ridge tuning** | Replace `lstsq` with `SVD → tune_ridge_loocv → solve` | **PRIMARY FIX** optimal λ per task | Cawley & Talbot (2010) |
371
+ | 3 | **Effective rank diagnostic** | Log `effective_rank(P)` per task | Understand which tasks are benign vs catastrophic | Bartlett et al. (2020) |
372
+ | 4 | **stride_tricks speedup** | Replace nested loops with `as_strided` | 10-50x faster → more ks tried per budget | Standard numpy |
373
+ | 5 | **Double descent awareness** | Skip ks where p ≈ n (interpolation threshold) | Avoid worst-case overfitting zone | Belkin et al. (2019) |
374
 
375
+ **Expected outcome**: Fixes 1+2 alone should increase arc-gen survival from ~50 to
376
+ ~100-150 tasks. Fix 2 is the big one LOOCV finds the λ that maximizes generalization
377
+ while preserving perfect training accuracy.
378
 
379
  ### Why Conv Models Fail ARC-GEN
380
 
 
429
  5. Add(input, total_delta) → output
430
  ```
431
 
 
 
 
 
432
  **Requirements**: opset 17 (ScatterND), all examples available at build time.
433
 
434
  ## Data Notes
 
456
  karnakbaevarthur/neurogolf-2026-task-transformation-library — 269 ONNX
457
  sigmaborov/golf-aura — 254 ONNX
458
  needless090/neurogolf-onnx-v31 — 252 ONNX
 
459
  sigmaborov/golf-solve-agent — 206 ONNX
460
  karnakbaevarthur/logic-for-each-arc-task — 204 ONNX
461
  yash9439/neurogolf-submission — 172 ONNX
 
464
  sigmaborov/test-golf (S_task014..S_task203) — 9×207 ONNX (task-specific)
465
  ```
466
 
 
 
 
 
 
 
 
 
 
467
  ## Reference Notebooks (in repo as neurogolf-2026-solver-notebooks.zip)
468
 
469
  | Notebook | Est LB | Tasks Solved | Technique | Key Source Count |