# NeuroGolf Solver — Learning & History

> This file accumulates everything learned across sessions. Read it to avoid repeating mistakes and to understand what techniques work. Newest entries first within each section.

## Version History

| Version | Date | Tasks (arc-gen validated) | Est LB | Key Changes |
|---------|------|--------------------------|--------|-------------|
| **v5.1** | **2026-04-26** | **49** | **~603.6** | Exp 3: PCA/SVD tested on 400 tasks, 0 PCR solves. Refactored conv.py into composable primitives. PCR fallback added (deferred 2nd pass). No regressions. |
| v5.0 | 2026-04-26 | 49 | ~603.6 | Refactored to 16-file package, opset 17 (IR 8), Slice-based flip/rotate (0 MACs), tensor-based Pad & ReduceSum, lstsq crash fix |
| v4.3 | 2026-04-25 | 50 | ~670 | Updated TODO.md + SKILL.md + LEARNING.md with closed-loop methodology. NO code changes. |
| v4.2 | 2026-04-24 | 50 | ~670 | Added PyTorch learned conv (single+two-layer, multi-seed, ternary snap). Needs GPU. |
| v4.1 | 2026-04-24 | 50 | ~670 | Color map Gather for permutations (+15 pts) |
| v4.0 | 2026-04-24 | 50 | ~656 | ARC-GEN validation, new analytical solvers, s_flip fix, static profiler, submission.csv |
| v3 | 2026-04-24 | 307 (local) / ~40 (LB) | 501 | Added concat_enhanced, varshape_spatial_gather, conv_var_diff |
| v2 | prior | 294 (local) | unknown | Spatial_gather, variable-shape conv, diff-shape conv |
| v1 | prior | 128 | unknown | Conv solver only |

## Mistakes Log (DO NOT REPEAT)

### 2026-04-26: Agent put entire 1400-line codebase into a single file, repeatedly overwrote user's code

- **What**: When implementing v5 opset 17 changes, agent uploaded the entire solver as a single `neurogolf_solver.py` file — three times. Each upload overwrote the user's `run_tasks`, `main`, and W&B code that the agent couldn't read (the read tool truncates at ~1000 lines).
- **Result**: User's W&B logging code was deleted. User's `run_tasks` function was deleted. User had to point agent to a specific commit (3f3d372) to recover.
- **Root cause**: (1) Agent couldn't read the tail of the file due to tool truncation, so it rewrote the entire file from scratch instead of making surgical edits. (2) No Python best practice says "put all code in one file" — the opposite is true. (3) Agent prioritized "getting it done" over preserving existing working code.
- **Rule**: NEVER rewrite an entire file when you can't read all of it. Use the `edit` tool for targeted string replacements. If the file is too large to read, split it into smaller files FIRST (which is what the user ultimately had to specify). NEVER destroy code you can't see.

### 2026-04-26: lstsq SVD non-convergence crash on task 313

- **What**: `np.linalg.lstsq(P, T_oh, rcond=None)` raised `LinAlgError: SVD did not converge` during `solve_conv_variable` for task 313.
- **Result**: Entire solver crashed, no further tasks processed.
- **Root cause**: The `_lstsq_conv` function had no try/except around the lstsq call. `solve_conv_var_diff` already had one, but `_lstsq_conv` (used by `solve_conv_fixed` and `solve_conv_variable`) did not.
- **Fix**: Wrapped lstsq in `try/except (np.linalg.LinAlgError, ValueError): return None` in all three call sites (`_lstsq_conv`, `solve_conv_diffshape` inline lstsq).
- **Rule**: EVERY lstsq call must be guarded. SVD non-convergence is rare but real, especially for ill-conditioned patch matrices from unusual grid patterns.

### 2026-04-26: ReduceSum axes attribute invalid in opset 17

- **What**: Code used `ReduceSum(['data'], ['output'], axes=[1,2,3], keepdims=1)` which puts axes as a node attribute. In opset 13+, axes must be a tensor input, not an attribute.
- **Result**: Models would fail ONNX checker validation and potentially fail on Kaggle inference server.
- **Fix**: Created `_build_reducesum()` helper that adds axes as an int64 initializer tensor and passes it as the 2nd input to ReduceSum. Applied to `s_constant` (axes=[1,2,3]), `solve_conv_variable` (axes=[1]), `solve_conv_var_diff` (axes=[1]).
- **Rule**: When changing opset version, audit ALL operators for breaking API changes. Key opset 13 changes: ReduceSum, ReduceMean, ReduceMax all moved axes from attribute to tensor input. Pad moved pads from attribute to tensor input at opset 11. Slice added steps input at opset 13.

### 2026-04-25: Agent wrote 1919 lines of v5 code WITHOUT running full 400-task arc-gen validation
- **What**: Generated neurogolf_solver_v5.py with opset 17 Slice-based transforms, LOOCV Ridge tuning, stride_tricks, composition detectors, channel reduction wrapper — claimed all features were "working" in the docstring and README
- **Result**: Uploaded to repo, overwrote neurogolf_solver.py. Tested only 10 individual tasks manually. 3/10 FAILED arc-gen validation (tasks 4, 6, 241 conv models). NEVER ran full 400 with arc-gen validation. LOOCV Ridge theory in code was never tested against actual data. Estimated LB score is UNKNOWN — cannot claim improvement over v4's proven ~670.
- **Rule**: NEVER mark a feature as done until it is validated against full arc-gen data on a representative sample of tasks.

### 2026-04-25: Agent created version-named file (neurogolf_solver_v5.py) violating project convention
- **Rule**: No version numbers in filenames. Use git commits for version tracking.

### 2026-04-25: Agent claimed LOOCV Ridge tuning would improve arc-gen survival without evidence
- **Rule**: Theory from papers is NOT proof for our specific data. Run A/B experiments first.

### 2026-04-25: Agent misrepresented user's intent — BLENDING is NOT the user's strategy
- **Rule**: LEARNING.md must reflect the USER'S strategy. Competitive intelligence goes in "What Others Do" section only.

### 2026-04-25: Composition detectors, channel reduction wrapper — untested dead code
- **Rule**: Only add a solver if it demonstrably solves ≥1 task. Delete dead code.

### 2026-04-25: Agent delivered untested code and asked user to validate it
- **Rule**: VALIDATE FIRST, DELIVER SECOND.

### 2026-04-24: PyTorch 2-layer conv — fits training but doesn't generalize to arc-gen
### 2026-04-24: Arc-gen in lstsq fitting exposes overfitting
### 2026-04-24: CuPy/GPU for lstsq — DOES NOT HELP
### 2026-04-24: Channel Gather for non-permutation color maps — WRONG OUTPUT
### 2026-04-24: ARC-GEN not loaded — THE #1 SCORE KILLER (v3→v4 fix)
### 2026-04-24: s_flip used GatherElements — OPSET 11 BUG
### 2026-04-24: score_network fallback returned (0,0,0)
### 2026-04-24: Ignored EXCLUDED tasks

## Competitive Intelligence

### What Others Do (For Awareness Only — We Do NOT Blend)

#### Why top notebooks score 4000+ and we score ~670

Top notebooks are **BLENDERS** — they assemble pre-solved ONNX models from public sources.

**Our strategy**: Build our own solver. No blending. No public datasets.

#### The 6 Key Techniques They Have That We Lack

1. **Opset 17** — ✅ DONE in v5. Slice+Transpose for near-zero cost transforms.
2. **Channel Reduction Wrapper** — 🔲 Not yet. Conv1x1(10→N) → transform → Conv1x1(N→10).
3. **Composition Detectors** — 🔲 Not yet. Need to scan 400 tasks to find actual instances first.
4. **Best-of-N Model Selection** — 🔲 Not yet. Generate 20+ candidates, keep cheapest valid.
5. **ONNX Optimizer Pass** — 🔲 Not yet. onnxoptimizer.optimize() for dead-code elimination.
6. **LLM Rescue** — 🔲 Not yet. Per-task ONNX graphs for algorithmic tasks (gravity, outline, etc.)

## Deep Research Findings

### Exp 3: PCA/Truncated SVD Before lstsq — FULL RESULTS (2026-04-26)

**Implementation:** Refactored conv.py into composable primitives:
- `_build_patch_matrix(exs, ks, bias, full_30)` → P, T, T_oh
- `_solve_weights(P, T, T_oh)` → WT via raw lstsq
- `_solve_weights_pcr(P, T, T_oh, thresholds)` → WT via PCA regression
- `_extract_weights(WT, ks, bias)` → Wconv, B for ONNX

All 4 conv solvers use deferred 2-pass design:
- Pass 1: raw lstsq at all ks (identical behavior to baseline)
- Pass 2: PCR on ks values where lstsq fit train but failed arc-gen validation

**PCR algorithm:**
```python
U, s, Vt = SVD(P)
cumvar = cumsum(s²) / sum(s²)
for thresh in [0.999, 0.99, 0.95]:
    k = searchsorted(cumvar, thresh) + 1
    k = max(k, 5)
    P_red = U[:,:k] * s[:k]  # project to top-k components
    w_red = lstsq(P_red, T_oh)
    w_full = Vt[:k].T @ w_red  # map back to full p-dim
```

**Diagnostic results on 25 solved conv tasks:**

| p/n regime | # Tasks | PCR train-fit? | Arc-gen impact |
|------------|---------|----------------|----------------|
| < 0.5 | 17 | Yes (0.99 thresh) | Already 100% — no improvement |
| 0.5-1.0 | 0 | N/A | N/A |
| > 1.0 | 8 | 4/8 fail at ALL thresholds | PCR removes signal-carrying dimensions |

Key observation: at p/n > 1.0, the "noise" dimensions PCA removes actually carry part of the training signal. Truncation causes train_fail — the model can't even fit training data after dimensionality reduction.

**Diagnostic results on 345 unsolved tasks (same-shape, ks≤9):**

- Only **10 tasks** have any ks where lstsq fits training
- PCR improves arc-gen on **4 tasks** but none reach 100%:
  - Task 32: 87.5% → 94.9% (+7.4%)
  - Task 389: 87.2% → 95.7% (+8.5%)  
  - Task 129: 59.6% → 63.0% (+3.4%)
  - Task 229: 57.0% → 60.0% (+3.0%)

**Full 400-task run:** 0 PCR solves, 0 regressions, 49/49 baseline tasks preserved.

**Why it failed:** Three distinct failure modes:
1. **p/n < 0.5 (17/25 solved tasks):** lstsq already generalizes perfectly. PCR is unnecessary overhead.
2. **p/n > 1.0 (8/25 solved tasks):** Signal requires ALL dimensions. PCA truncation destroys the training fit. The minimum-norm solution from lstsq distributes weight across ALL singular vectors, and removing any causes prediction errors.
3. **335/345 unsolved tasks:** No ks fits training at all. The task requires non-local operations (flood fill, mode counting, conditional logic) that conv can't represent regardless of regularization.

**Conclusion:** The "overfitting hypothesis" from Nakkiran 2019 was correct in theory but inapplicable. The tasks where conv fails arc-gen fail because conv is architecturally wrong, not because of bad regularization. Regularization experiments (Ridge, PCA, skip-ks) are exhausted.

### lstsq Conv Research (2026-04-25)

**Key Finding: Our overfitting is CATASTROPHIC, not benign.**
- Bartlett et al. benign overfitting requires high effective rank of covariance. Our one-hot patches have LOW effective rank.
- Double descent peak at ks=5,7,9 (p ≈ n).
- Ridge predicted to fail; Lasso (ℓ₁) theoretically better for sparse signals.

### ONNX Opset 17 Migration Notes (2026-04-26)

**Breaking changes from opset 10:**
| Operator | Opset 10 | Opset 13+ (incl. 17) |
|----------|----------|----------------------|
| ReduceSum | axes as **attribute** | axes as **tensor input** |
| ReduceMean | axes as **attribute** | axes as **tensor input** |
| Pad | pads as **attribute** | pads as **tensor input** (since opset 11) |
| Slice | no steps input | **steps** added as 5th tensor input |
| Conv | pads as attribute | pads as attribute ✅ (unchanged) |
| Transpose | perm as attribute | perm as attribute ✅ (unchanged) |
| Gather | unchanged | unchanged ✅ |

**IR version**: Opset 17 requires IR ≤ 8. We use IR=8.

**Slice(step=-1) for reversing:**
- `starts=[dim-1], ends=[INT64_MIN], axes=[ax], steps=[-1]` — reverses entire axis
- INT64_MIN as end sentinel (not -1, which means dim-1 in ONNX)
- Zero MACs, zero params, near-zero memory (just 4 int64 scalars)

## What Has NOT Worked

| Technique | Result | Why |
|-----------|--------|-----|
| **PCA/Truncated SVD (Exp 3)** | **0/400 PCR solves** | **Signal in noise dims; unsolved tasks = architecture mismatch** |
| Ridge/LOOCV λ | Fails arc-gen | Catastrophic, not benign overfitting |
| Skip ks=5,7,9 (Exp 1) | Hurts 2 tasks | Some tasks genuinely need interpolation-regime ks |
| CuPy GPU lstsq | OOM + same speed | O(n³) SVD bottleneck |
| PyTorch 2-layer (no arc-gen) | 0/30 arc-gen pass | Memorizes training |
| Composition detectors | No tasks found | May not exist in dataset |
| Channel reduction wrapper | Never executed | Disabled due to Gather incompatibility |

## Technical Notes

### ARC-AGI Task Statistics
- 400 tasks total, 6 excluded: {21, 55, 80, 184, 202, 366}
- ~25 analytical tasks, ~25 conv tasks that survive arc-gen, ~350 unsolved

### Score Calculation
```python
score = max(1.0, 25.0 - math.log(macs + memory_bytes + params))
```

### Conv Solver SVD Spectrum Analysis (Exp 3 data, 2026-04-26)

Effective rank at 99% variance for solved conv tasks:
| Task | ks | n patches | p features | p/n | eff_rank_99 | arc-gen acc |
|------|----|-----------|-----------:|----:|------------:|------------:|
| 171 | 3 | 799 | 90 | 0.11 | 5 | 100% |
| 120 | 3 | 4103 | 90 | 0.02 | 22 | 100% |
| 305 | 9 | 3584 | 810 | 0.23 | 416 | 100% |
| 60 | 11 | 715 | 1210 | 1.69 | 245 | 98.5% |
| 136 | 15 | 1400 | 2250 | 1.61 | 237 | 99.6% |
| 322 | 5 | 126 | 250 | 1.98 | 100 | 97.0% |

Key pattern: tasks with p/n < 0.5 → 100% arc-gen. Tasks with p/n > 1.0 → 97-99.6% arc-gen. The 0.4-3% error is the interpolation-regime overfitting, but it still passes validation.

### Lstsq Matrix Sizes (for reference)
| Grid | Examples | Patches (n) | ks=3 (p=90) | ks=7 (p=490) | ks=29 (p=8410) |
|------|----------|-------------|-------------|--------------|----------------|
| 7×7  | 4        | 196         | 196×90      | **196×490 (under!)** | 196×8410 |
| 12×12| 6        | 576         | 576×90      | 576×490      | 576×8410 |
| 21×21| 16       | 7056        | 7056×90     | 7056×490     | **7056×8410** |

## Session Notes for Future Agents

**Before touching code:**
1. Read this file (LEARNING.md) — all the way through
2. Read SKILL.md — especially "Development Methodology" and "Submission Checklist"
3. Read TODO.md — check experiment log and research queue
4. Run the current solver on 20-50 tasks to establish baseline
5. Only then: design experiment, implement, validate, compare

**Code structure (v5.1):**
- The solver is a Python package at `neurogolf_solver/`
- Run with `python -m neurogolf_solver.main [args]`
- **conv.py** now has composable primitives: `_build_patch_matrix` + `_solve_weights` + `_extract_weights`
- To add new fitting methods: implement `_solve_weights_XXX(P, T, T_oh)` returning WT or None
- Edit individual files surgically — NEVER rewrite the whole package
- The legacy `neurogolf_solver.py` at root is v4, kept for reference — do NOT edit it

**Before claiming a feature works:**
- Must pass arc-gen on ≥20 tasks (or full 400 if cheap)
- Must show >10% improvement in arc-gen survival rate OR total score
- Must include A/B comparison

**Before uploading code:**
- Must have run full 400-task arc-gen validation
- Must confirm total score ≥ previous best

**What to focus on next (post Exp 3):**
1. **Phase 3: New solver types** — hash matchers, pattern detectors, LLM rescue
2. **Phase 1a: Opset 17 analytical conversions** — reduce cost on existing 24 analytical tasks
3. **Phase 4: ONNX optimizer** — reduce cost on all 49 solved tasks
4. Lasso (Exp 5) is low priority — only 10 unsolved tasks even have lstsq fits, ceiling is very low