Add lstsq conv research: Ridge regularization, stride_tricks, benign overfitting theory (2026-04-25)
Browse files- LEARNING.md +118 -3
LEARNING.md
CHANGED
|
@@ -218,14 +218,14 @@ make our own solver generate arc-gen-validated models for ~300 tasks, we'd match
|
|
| 218 |
|
| 219 |
| Category | Count | Why it Fails | Fix |
|
| 220 |
|---|---|---|---|
|
| 221 |
-
| lstsq overfitting (ks≥5) | ~170 | Underdetermined lstsq memorizes train, fails arc-gen |
|
| 222 |
| lstsq overfitting (ks=1-3) | ~30 | Even small kernels can overfit with few examples | More arc-gen examples in fitting |
|
| 223 |
| spatial_gather false positives | ~12 | Coincidental pixel alignments in train don't hold for arc-gen | Validate spatial_gather against arc-gen before accepting |
|
| 224 |
-
| Variable diff-shape | ~40 | No static ONNX for input-dependent output shapes |
|
| 225 |
|
| 226 |
**Realistic path to 3000+ without blending:**
|
| 227 |
1. Switch to opset 17 → ~2x score per analytical task (~+200 pts)
|
| 228 |
-
2. PyTorch learned conv on GPU with arc-gen fitting → ~+50-100 tasks
|
| 229 |
3. Hash-based matchers for ~20 hard tasks → ~+300 pts
|
| 230 |
4. Channel reduction → ~-20% cost across board (~+100 pts)
|
| 231 |
5. Total estimate: ~150-200 validated tasks × ~12 avg score = ~2000-2500 pts
|
|
@@ -278,6 +278,121 @@ Arc-gen fitting (same-size examples in lstsq) recovered ~10 additional conv task
|
|
| 278 |
|
| 279 |
## Technical Deep-Dives
|
| 280 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 281 |
### Why Conv Models Fail ARC-GEN
|
| 282 |
|
| 283 |
Conv models fitted via lstsq on 6 train+test examples learn weights that perfectly separate those examples. But arc-gen has 250+ examples with:
|
|
|
|
| 218 |
|
| 219 |
| Category | Count | Why it Fails | Fix |
|
| 220 |
|---|---|---|---|
|
| 221 |
+
| lstsq overfitting (ks≥5) | ~170 | Underdetermined lstsq memorizes train, fails arc-gen | Ridge regularization, more arc-gen in fitting, PyTorch with arc-gen |
|
| 222 |
| lstsq overfitting (ks=1-3) | ~30 | Even small kernels can overfit with few examples | More arc-gen examples in fitting |
|
| 223 |
| spatial_gather false positives | ~12 | Coincidental pixel alignments in train don't hold for arc-gen | Validate spatial_gather against arc-gen before accepting |
|
| 224 |
+
| Variable diff-shape | ~40 | No static ONNX for input-dependent output shapes | Hash matchers (opset 17) |
|
| 225 |
|
| 226 |
**Realistic path to 3000+ without blending:**
|
| 227 |
1. Switch to opset 17 → ~2x score per analytical task (~+200 pts)
|
| 228 |
+
2. Ridge-regularized lstsq + PyTorch learned conv on GPU with arc-gen fitting → ~+50-100 tasks
|
| 229 |
3. Hash-based matchers for ~20 hard tasks → ~+300 pts
|
| 230 |
4. Channel reduction → ~-20% cost across board (~+100 pts)
|
| 231 |
5. Total estimate: ~150-200 validated tasks × ~12 avg score = ~2000-2500 pts
|
|
|
|
| 278 |
|
| 279 |
## Technical Deep-Dives
|
| 280 |
|
| 281 |
+
### lstsq Conv Research (2026-04-25) — Improving Arc-Gen Survival
|
| 282 |
+
|
| 283 |
+
External research on our `_lstsq_conv` function and the overparameterized regime.
|
| 284 |
+
|
| 285 |
+
#### The Core Problem: Benign Overfitting in Underdetermined Systems
|
| 286 |
+
|
| 287 |
+
Reference: [Benign Overfitting in Linear Classifiers](https://arxiv.org/abs/2307.02044)
|
| 288 |
+
|
| 289 |
+
When `features > n_patches` (which happens for ks≥5 on small grids with few examples),
|
| 290 |
+
`np.linalg.lstsq` finds the **minimum-norm solution** among infinitely many perfect fits.
|
| 291 |
+
This solution happens to perfectly classify training patches but has no guarantee of
|
| 292 |
+
generalizing to arc-gen examples with different pixel arrangements.
|
| 293 |
+
|
| 294 |
+
This is exactly what we observe: 307 tasks solved locally (lstsq fits training perfectly)
|
| 295 |
+
but only 50 survive arc-gen validation. The minimum-norm solution is "benign" for the
|
| 296 |
+
training set but adversarial for unseen examples.
|
| 297 |
+
|
| 298 |
+
#### Fix #1: Ridge Regularization (L2 penalty)
|
| 299 |
+
|
| 300 |
+
Instead of `np.linalg.lstsq(P, T_oh)`, use Ridge regression:
|
| 301 |
+
|
| 302 |
+
```python
|
| 303 |
+
# Current (overfits):
|
| 304 |
+
WT = np.linalg.lstsq(P, T_oh, rcond=None)[0]
|
| 305 |
+
|
| 306 |
+
# Proposed (regularized):
|
| 307 |
+
lambda_ridge = 0.01 # tune this
|
| 308 |
+
WT = np.linalg.solve(P.T @ P + lambda_ridge * np.eye(P.shape[1]), P.T @ T_oh)
|
| 309 |
+
```
|
| 310 |
+
|
| 311 |
+
**Why this helps**: Ridge adds a penalty on weight magnitude, pushing the solution
|
| 312 |
+
toward simpler (smaller-norm) weights even in the underdetermined regime. Simpler
|
| 313 |
+
weights are more likely to generalize because they don't exploit coincidental training
|
| 314 |
+
correlations.
|
| 315 |
+
|
| 316 |
+
**Tuning strategy**: Try λ ∈ {0.001, 0.01, 0.1, 1.0}. For each, check if
|
| 317 |
+
`argmax(P @ WT) == T` still holds (training accuracy must be perfect). Pick the
|
| 318 |
+
largest λ that still achieves perfect training accuracy — this gives maximum
|
| 319 |
+
regularization while not losing the training fit.
|
| 320 |
+
|
| 321 |
+
**Trade-off**: Ridge may cause some tasks that currently pass training to fail
|
| 322 |
+
(the regularization prevents perfect memorization). But the tasks it DOES pass are
|
| 323 |
+
more likely to survive arc-gen. Net effect should be positive.
|
| 324 |
+
|
| 325 |
+
**IMPORTANT**: Ridge changes the lstsq solve from O(min(m,n)²·max(m,n)) to
|
| 326 |
+
O(n³) where n=features. For ks=29 (feat=8410), this is 8410³ ≈ 595B ops.
|
| 327 |
+
That's ~60s on CPU. Keep the time budget per kernel size.
|
| 328 |
+
|
| 329 |
+
#### Fix #2: Patch Extraction Speedup with stride_tricks
|
| 330 |
+
|
| 331 |
+
Current code uses nested Python loops to extract patches — very slow for large grids:
|
| 332 |
+
|
| 333 |
+
```python
|
| 334 |
+
# Current (slow):
|
| 335 |
+
for r in range(oh):
|
| 336 |
+
for c in range(ow):
|
| 337 |
+
p = oh_pad[:, r:r+ks, c:c+ks].flatten()
|
| 338 |
+
patches.append(p)
|
| 339 |
+
|
| 340 |
+
# Proposed (fast):
|
| 341 |
+
from numpy.lib.stride_tricks import as_strided
|
| 342 |
+
# oh_pad shape: (10, H+2*pad, W+2*pad)
|
| 343 |
+
C, Hp, Wp = oh_pad.shape
|
| 344 |
+
strides = oh_pad.strides
|
| 345 |
+
patches_view = as_strided(
|
| 346 |
+
oh_pad,
|
| 347 |
+
shape=(oh, ow, C, ks, ks),
|
| 348 |
+
strides=(strides[1], strides[2], strides[0], strides[1], strides[2])
|
| 349 |
+
)
|
| 350 |
+
P = patches_view.reshape(oh * ow, C * ks * ks)
|
| 351 |
+
```
|
| 352 |
+
|
| 353 |
+
**Speedup**: ~10-50x for typical grid sizes. This doesn't help arc-gen survival directly
|
| 354 |
+
but lets us try more kernel sizes within the time budget, increasing the chance of finding
|
| 355 |
+
one that generalizes.
|
| 356 |
+
|
| 357 |
+
#### Fix #3: Numerical Precision for ONNX Export
|
| 358 |
+
|
| 359 |
+
lstsq produces float64 weights. The ONNX model uses float32:
|
| 360 |
+
```python
|
| 361 |
+
Wconv = WT.T.reshape(10, 10, ks, ks).astype(np.float32)
|
| 362 |
+
```
|
| 363 |
+
|
| 364 |
+
For large kernel sizes, lstsq weights can be very large (1e3-1e6 range). The float64→float32
|
| 365 |
+
cast loses precision. This can cause the ONNX model to disagree with the lstsq prediction:
|
| 366 |
+
the argmax flips on borderline patches.
|
| 367 |
+
|
| 368 |
+
**Fix**: After casting to float32, re-verify against training data using the ONNX model
|
| 369 |
+
(not the numpy prediction). The current code already does this via `validate(path, td)`,
|
| 370 |
+
so this is already handled. But be aware that increasing kernel size increases the risk
|
| 371 |
+
of float32 precision issues.
|
| 372 |
+
|
| 373 |
+
#### Fix #4: Try Smallest Kernel First (already done, but emphasize)
|
| 374 |
+
|
| 375 |
+
The current code tries ks=1,3,5,...,29 in order. This is correct because:
|
| 376 |
+
- Smaller kernels have fewer features → more likely to be overdetermined → less overfitting
|
| 377 |
+
- Smaller kernels produce cheaper ONNX models → higher score
|
| 378 |
+
- If ks=1 works and survives arc-gen, there's no reason to try ks=29
|
| 379 |
+
|
| 380 |
+
But the code should **stop early** when it finds a kernel that passes arc-gen validation
|
| 381 |
+
(it already does via `if validate(path, td): return`). Good.
|
| 382 |
+
|
| 383 |
+
#### Summary: Implementation Priority
|
| 384 |
+
|
| 385 |
+
| Fix | Effort | Expected Impact | Risk |
|
| 386 |
+
|-----|--------|----------------|------|
|
| 387 |
+
| Ridge regularization | Small (change 1 line) | **HIGH** — directly attacks overfitting | May lose some training-perfect fits |
|
| 388 |
+
| stride_tricks speedup | Small (refactor patch loop) | Medium — more ks tried per task | None |
|
| 389 |
+
| λ sweep per task | Medium (loop over λ values) | **HIGH** — optimal regularization per task | Slower (4x more lstsq calls) |
|
| 390 |
+
| float32 precision check | Already done | — | — |
|
| 391 |
+
|
| 392 |
+
**Recommended first experiment**: Add Ridge with λ=0.01 to `_lstsq_conv`, re-run on all
|
| 393 |
+
400 tasks with arc-gen validation. Compare survival rate to current (50/400). If survival
|
| 394 |
+
goes up, sweep λ per task.
|
| 395 |
+
|
| 396 |
### Why Conv Models Fail ARC-GEN
|
| 397 |
|
| 398 |
Conv models fitted via lstsq on 6 train+test examples learn weights that perfectly separate those examples. But arc-gen has 250+ examples with:
|