rogermt commited on
Commit
f3b3e30
·
verified ·
1 Parent(s): a6398e3

Add lstsq conv research: Ridge regularization, stride_tricks, benign overfitting theory (2026-04-25)

Browse files
Files changed (1) hide show
  1. LEARNING.md +118 -3
LEARNING.md CHANGED
@@ -218,14 +218,14 @@ make our own solver generate arc-gen-validated models for ~300 tasks, we'd match
218
 
219
  | Category | Count | Why it Fails | Fix |
220
  |---|---|---|---|
221
- | lstsq overfitting (ks≥5) | ~170 | Underdetermined lstsq memorizes train, fails arc-gen | Train on arc-gen data (need GPU for PyTorch), or find smaller ks that generalizes |
222
  | lstsq overfitting (ks=1-3) | ~30 | Even small kernels can overfit with few examples | More arc-gen examples in fitting |
223
  | spatial_gather false positives | ~12 | Coincidental pixel alignments in train don't hold for arc-gen | Validate spatial_gather against arc-gen before accepting |
224
- | Variable diff-shape | ~40 | No static ONNX for input-dependent output shapes | Fundamentally unsolvable with static ONNX (need hash matchers) |
225
 
226
  **Realistic path to 3000+ without blending:**
227
  1. Switch to opset 17 → ~2x score per analytical task (~+200 pts)
228
- 2. PyTorch learned conv on GPU with arc-gen fitting → ~+50-100 tasks
229
  3. Hash-based matchers for ~20 hard tasks → ~+300 pts
230
  4. Channel reduction → ~-20% cost across board (~+100 pts)
231
  5. Total estimate: ~150-200 validated tasks × ~12 avg score = ~2000-2500 pts
@@ -278,6 +278,121 @@ Arc-gen fitting (same-size examples in lstsq) recovered ~10 additional conv task
278
 
279
  ## Technical Deep-Dives
280
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
281
  ### Why Conv Models Fail ARC-GEN
282
 
283
  Conv models fitted via lstsq on 6 train+test examples learn weights that perfectly separate those examples. But arc-gen has 250+ examples with:
 
218
 
219
  | Category | Count | Why it Fails | Fix |
220
  |---|---|---|---|
221
+ | lstsq overfitting (ks≥5) | ~170 | Underdetermined lstsq memorizes train, fails arc-gen | Ridge regularization, more arc-gen in fitting, PyTorch with arc-gen |
222
  | lstsq overfitting (ks=1-3) | ~30 | Even small kernels can overfit with few examples | More arc-gen examples in fitting |
223
  | spatial_gather false positives | ~12 | Coincidental pixel alignments in train don't hold for arc-gen | Validate spatial_gather against arc-gen before accepting |
224
+ | Variable diff-shape | ~40 | No static ONNX for input-dependent output shapes | Hash matchers (opset 17) |
225
 
226
  **Realistic path to 3000+ without blending:**
227
  1. Switch to opset 17 → ~2x score per analytical task (~+200 pts)
228
+ 2. Ridge-regularized lstsq + PyTorch learned conv on GPU with arc-gen fitting → ~+50-100 tasks
229
  3. Hash-based matchers for ~20 hard tasks → ~+300 pts
230
  4. Channel reduction → ~-20% cost across board (~+100 pts)
231
  5. Total estimate: ~150-200 validated tasks × ~12 avg score = ~2000-2500 pts
 
278
 
279
  ## Technical Deep-Dives
280
 
281
+ ### lstsq Conv Research (2026-04-25) — Improving Arc-Gen Survival
282
+
283
+ External research on our `_lstsq_conv` function and the overparameterized regime.
284
+
285
+ #### The Core Problem: Benign Overfitting in Underdetermined Systems
286
+
287
+ Reference: [Benign Overfitting in Linear Classifiers](https://arxiv.org/abs/2307.02044)
288
+
289
+ When `features > n_patches` (which happens for ks≥5 on small grids with few examples),
290
+ `np.linalg.lstsq` finds the **minimum-norm solution** among infinitely many perfect fits.
291
+ This solution happens to perfectly classify training patches but has no guarantee of
292
+ generalizing to arc-gen examples with different pixel arrangements.
293
+
294
+ This is exactly what we observe: 307 tasks solved locally (lstsq fits training perfectly)
295
+ but only 50 survive arc-gen validation. The minimum-norm solution is "benign" for the
296
+ training set but adversarial for unseen examples.
297
+
298
+ #### Fix #1: Ridge Regularization (L2 penalty)
299
+
300
+ Instead of `np.linalg.lstsq(P, T_oh)`, use Ridge regression:
301
+
302
+ ```python
303
+ # Current (overfits):
304
+ WT = np.linalg.lstsq(P, T_oh, rcond=None)[0]
305
+
306
+ # Proposed (regularized):
307
+ lambda_ridge = 0.01 # tune this
308
+ WT = np.linalg.solve(P.T @ P + lambda_ridge * np.eye(P.shape[1]), P.T @ T_oh)
309
+ ```
310
+
311
+ **Why this helps**: Ridge adds a penalty on weight magnitude, pushing the solution
312
+ toward simpler (smaller-norm) weights even in the underdetermined regime. Simpler
313
+ weights are more likely to generalize because they don't exploit coincidental training
314
+ correlations.
315
+
316
+ **Tuning strategy**: Try λ ∈ {0.001, 0.01, 0.1, 1.0}. For each, check if
317
+ `argmax(P @ WT) == T` still holds (training accuracy must be perfect). Pick the
318
+ largest λ that still achieves perfect training accuracy — this gives maximum
319
+ regularization while not losing the training fit.
320
+
321
+ **Trade-off**: Ridge may cause some tasks that currently pass training to fail
322
+ (the regularization prevents perfect memorization). But the tasks it DOES pass are
323
+ more likely to survive arc-gen. Net effect should be positive.
324
+
325
+ **IMPORTANT**: Ridge changes the lstsq solve from O(min(m,n)²·max(m,n)) to
326
+ O(n³) where n=features. For ks=29 (feat=8410), this is 8410³ ≈ 595B ops.
327
+ That's ~60s on CPU. Keep the time budget per kernel size.
328
+
329
+ #### Fix #2: Patch Extraction Speedup with stride_tricks
330
+
331
+ Current code uses nested Python loops to extract patches — very slow for large grids:
332
+
333
+ ```python
334
+ # Current (slow):
335
+ for r in range(oh):
336
+ for c in range(ow):
337
+ p = oh_pad[:, r:r+ks, c:c+ks].flatten()
338
+ patches.append(p)
339
+
340
+ # Proposed (fast):
341
+ from numpy.lib.stride_tricks import as_strided
342
+ # oh_pad shape: (10, H+2*pad, W+2*pad)
343
+ C, Hp, Wp = oh_pad.shape
344
+ strides = oh_pad.strides
345
+ patches_view = as_strided(
346
+ oh_pad,
347
+ shape=(oh, ow, C, ks, ks),
348
+ strides=(strides[1], strides[2], strides[0], strides[1], strides[2])
349
+ )
350
+ P = patches_view.reshape(oh * ow, C * ks * ks)
351
+ ```
352
+
353
+ **Speedup**: ~10-50x for typical grid sizes. This doesn't help arc-gen survival directly
354
+ but lets us try more kernel sizes within the time budget, increasing the chance of finding
355
+ one that generalizes.
356
+
357
+ #### Fix #3: Numerical Precision for ONNX Export
358
+
359
+ lstsq produces float64 weights. The ONNX model uses float32:
360
+ ```python
361
+ Wconv = WT.T.reshape(10, 10, ks, ks).astype(np.float32)
362
+ ```
363
+
364
+ For large kernel sizes, lstsq weights can be very large (1e3-1e6 range). The float64→float32
365
+ cast loses precision. This can cause the ONNX model to disagree with the lstsq prediction:
366
+ the argmax flips on borderline patches.
367
+
368
+ **Fix**: After casting to float32, re-verify against training data using the ONNX model
369
+ (not the numpy prediction). The current code already does this via `validate(path, td)`,
370
+ so this is already handled. But be aware that increasing kernel size increases the risk
371
+ of float32 precision issues.
372
+
373
+ #### Fix #4: Try Smallest Kernel First (already done, but emphasize)
374
+
375
+ The current code tries ks=1,3,5,...,29 in order. This is correct because:
376
+ - Smaller kernels have fewer features → more likely to be overdetermined → less overfitting
377
+ - Smaller kernels produce cheaper ONNX models → higher score
378
+ - If ks=1 works and survives arc-gen, there's no reason to try ks=29
379
+
380
+ But the code should **stop early** when it finds a kernel that passes arc-gen validation
381
+ (it already does via `if validate(path, td): return`). Good.
382
+
383
+ #### Summary: Implementation Priority
384
+
385
+ | Fix | Effort | Expected Impact | Risk |
386
+ |-----|--------|----------------|------|
387
+ | Ridge regularization | Small (change 1 line) | **HIGH** — directly attacks overfitting | May lose some training-perfect fits |
388
+ | stride_tricks speedup | Small (refactor patch loop) | Medium — more ks tried per task | None |
389
+ | λ sweep per task | Medium (loop over λ values) | **HIGH** — optimal regularization per task | Slower (4x more lstsq calls) |
390
+ | float32 precision check | Already done | — | — |
391
+
392
+ **Recommended first experiment**: Add Ridge with λ=0.01 to `_lstsq_conv`, re-run on all
393
+ 400 tasks with arc-gen validation. Compare survival rate to current (50/400). If survival
394
+ goes up, sweep λ per task.
395
+
396
  ### Why Conv Models Fail ARC-GEN
397
 
398
  Conv models fitted via lstsq on 6 train+test examples learn weights that perfectly separate those examples. But arc-gen has 250+ examples with: