rogermt
/

neurogolf-solver

Model card Files Files and versions

xet

Community

rogermt commited on 16 days ago

Commit

f3b3e30

verified ·

1 Parent(s): a6398e3

Add lstsq conv research: Ridge regularization, stride_tricks, benign overfitting theory (2026-04-25)

Browse files

Files changed (1) hide show

LEARNING.md +118 -3

LEARNING.md CHANGED Viewed

@@ -218,14 +218,14 @@ make our own solver generate arc-gen-validated models for ~300 tasks, we'd match
 | Category | Count | Why it Fails | Fix |
 |---|---|---|---|
-| lstsq overfitting (ks≥5) | ~170 | Underdetermined lstsq memorizes train, fails arc-gen | Train on arc-gen data (need GPU for PyTorch), or find smaller ks that generalizes |
 | lstsq overfitting (ks=1-3) | ~30 | Even small kernels can overfit with few examples | More arc-gen examples in fitting |
 | spatial_gather false positives | ~12 | Coincidental pixel alignments in train don't hold for arc-gen | Validate spatial_gather against arc-gen before accepting |
-| Variable diff-shape | ~40 | No static ONNX for input-dependent output shapes | Fundamentally unsolvable with static ONNX (need hash matchers) |
 **Realistic path to 3000+ without blending:**
 1. Switch to opset 17 → ~2x score per analytical task (~+200 pts)
-2. PyTorch learned conv on GPU with arc-gen fitting → ~+50-100 tasks
 3. Hash-based matchers for ~20 hard tasks → ~+300 pts
 4. Channel reduction → ~-20% cost across board (~+100 pts)
 5. Total estimate: ~150-200 validated tasks × ~12 avg score = ~2000-2500 pts
@@ -278,6 +278,121 @@ Arc-gen fitting (same-size examples in lstsq) recovered ~10 additional conv task
 ## Technical Deep-Dives
 ### Why Conv Models Fail ARC-GEN
 Conv models fitted via lstsq on 6 train+test examples learn weights that perfectly separate those examples. But arc-gen has 250+ examples with:

 | Category | Count | Why it Fails | Fix |
 |---|---|---|---|
+| lstsq overfitting (ks≥5) | ~170 | Underdetermined lstsq memorizes train, fails arc-gen | Ridge regularization, more arc-gen in fitting, PyTorch with arc-gen |
 | lstsq overfitting (ks=1-3) | ~30 | Even small kernels can overfit with few examples | More arc-gen examples in fitting |
 | spatial_gather false positives | ~12 | Coincidental pixel alignments in train don't hold for arc-gen | Validate spatial_gather against arc-gen before accepting |
+| Variable diff-shape | ~40 | No static ONNX for input-dependent output shapes | Hash matchers (opset 17) |
 **Realistic path to 3000+ without blending:**
 1. Switch to opset 17 → ~2x score per analytical task (~+200 pts)
+2. Ridge-regularized lstsq + PyTorch learned conv on GPU with arc-gen fitting → ~+50-100 tasks
 3. Hash-based matchers for ~20 hard tasks → ~+300 pts
 4. Channel reduction → ~-20% cost across board (~+100 pts)
 5. Total estimate: ~150-200 validated tasks × ~12 avg score = ~2000-2500 pts
 ## Technical Deep-Dives
+### lstsq Conv Research (2026-04-25) — Improving Arc-Gen Survival
+External research on our `_lstsq_conv` function and the overparameterized regime.
+#### The Core Problem: Benign Overfitting in Underdetermined Systems
+Reference: [Benign Overfitting in Linear Classifiers](https://arxiv.org/abs/2307.02044)
+When `features > n_patches` (which happens for ks≥5 on small grids with few examples),
+`np.linalg.lstsq` finds the **minimum-norm solution** among infinitely many perfect fits.
+This solution happens to perfectly classify training patches but has no guarantee of
+generalizing to arc-gen examples with different pixel arrangements.
+This is exactly what we observe: 307 tasks solved locally (lstsq fits training perfectly)
+but only 50 survive arc-gen validation. The minimum-norm solution is "benign" for the
+training set but adversarial for unseen examples.
+#### Fix #1: Ridge Regularization (L2 penalty)
+Instead of `np.linalg.lstsq(P, T_oh)`, use Ridge regression:
+```python
+# Current (overfits):
+WT = np.linalg.lstsq(P, T_oh, rcond=None)[0]
+# Proposed (regularized):
+lambda_ridge = 0.01  # tune this
+WT = np.linalg.solve(P.T @ P + lambda_ridge * np.eye(P.shape[1]), P.T @ T_oh)
+```
+**Why this helps**: Ridge adds a penalty on weight magnitude, pushing the solution
+toward simpler (smaller-norm) weights even in the underdetermined regime. Simpler
+weights are more likely to generalize because they don't exploit coincidental training
+correlations.
+**Tuning strategy**: Try λ ∈ {0.001, 0.01, 0.1, 1.0}. For each, check if
+`argmax(P @ WT) == T` still holds (training accuracy must be perfect). Pick the
+largest λ that still achieves perfect training accuracy — this gives maximum
+regularization while not losing the training fit.
+**Trade-off**: Ridge may cause some tasks that currently pass training to fail
+(the regularization prevents perfect memorization). But the tasks it DOES pass are
+more likely to survive arc-gen. Net effect should be positive.
+**IMPORTANT**: Ridge changes the lstsq solve from O(min(m,n)²·max(m,n)) to
+O(n³) where n=features. For ks=29 (feat=8410), this is 8410³ ≈ 595B ops.
+That's ~60s on CPU. Keep the time budget per kernel size.
+#### Fix #2: Patch Extraction Speedup with stride_tricks
+Current code uses nested Python loops to extract patches — very slow for large grids:
+```python
+# Current (slow):
+for r in range(oh):
+    for c in range(ow):
+        p = oh_pad[:, r:r+ks, c:c+ks].flatten()
+        patches.append(p)
+# Proposed (fast):
+from numpy.lib.stride_tricks import as_strided
+# oh_pad shape: (10, H+2*pad, W+2*pad)
+C, Hp, Wp = oh_pad.shape
+strides = oh_pad.strides
+patches_view = as_strided(
+    oh_pad,
+    shape=(oh, ow, C, ks, ks),
+    strides=(strides[1], strides[2], strides[0], strides[1], strides[2])
+)
+P = patches_view.reshape(oh * ow, C * ks * ks)
+```
+**Speedup**: ~10-50x for typical grid sizes. This doesn't help arc-gen survival directly
+but lets us try more kernel sizes within the time budget, increasing the chance of finding
+one that generalizes.
+#### Fix #3: Numerical Precision for ONNX Export
+lstsq produces float64 weights. The ONNX model uses float32:
+```python
+Wconv = WT.T.reshape(10, 10, ks, ks).astype(np.float32)
+```
+For large kernel sizes, lstsq weights can be very large (1e3-1e6 range). The float64→float32
+cast loses precision. This can cause the ONNX model to disagree with the lstsq prediction:
+the argmax flips on borderline patches.
+**Fix**: After casting to float32, re-verify against training data using the ONNX model
+(not the numpy prediction). The current code already does this via `validate(path, td)`,
+so this is already handled. But be aware that increasing kernel size increases the risk
+of float32 precision issues.
+#### Fix #4: Try Smallest Kernel First (already done, but emphasize)
+The current code tries ks=1,3,5,...,29 in order. This is correct because:
+- Smaller kernels have fewer features → more likely to be overdetermined → less overfitting
+- Smaller kernels produce cheaper ONNX models → higher score
+- If ks=1 works and survives arc-gen, there's no reason to try ks=29
+But the code should **stop early** when it finds a kernel that passes arc-gen validation
+(it already does via `if validate(path, td): return`). Good.
+#### Summary: Implementation Priority
+| Fix | Effort | Expected Impact | Risk |
+|-----|--------|----------------|------|
+| Ridge regularization | Small (change 1 line) | **HIGH** — directly attacks overfitting | May lose some training-perfect fits |
+| stride_tricks speedup | Small (refactor patch loop) | Medium — more ks tried per task | None |
+| λ sweep per task | Medium (loop over λ values) | **HIGH** — optimal regularization per task | Slower (4x more lstsq calls) |
+| float32 precision check | Already done | — | — |
+**Recommended first experiment**: Add Ridge with λ=0.01 to `_lstsq_conv`, re-run on all
+400 tasks with arc-gen validation. Compare survival rate to current (50/400). If survival
+goes up, sweep λ per task.
 ### Why Conv Models Fail ARC-GEN
 Conv models fitted via lstsq on 6 train+test examples learn weights that perfectly separate those examples. But arc-gen has 250+ examples with: