rogermt
/

neurogolf-solver

Model card Files Files and versions

xet

Community

rogermt commited on 14 days ago

Commit

ff5c300

verified ·

1 Parent(s): 72d0404

Update LEARNING.md for v5 refactor + new entries

Browse files

Files changed (1) hide show

LEARNING.md +120 -311

LEARNING.md CHANGED Viewed

@@ -6,6 +6,7 @@
 | Version | Date | Tasks (arc-gen validated) | Est LB | Key Changes |
 |---------|------|--------------------------|--------|-------------|
 | v4.3 | 2026-04-25 | 50 | ~670 | Updated TODO.md + SKILL.md + LEARNING.md with closed-loop methodology. NO code changes. |
 | v4.2 | 2026-04-24 | 50 | ~670 | Added PyTorch learned conv (single+two-layer, multi-seed, ternary snap). Needs GPU. |
 | v4.1 | 2026-04-24 | 50 | ~670 | Color map Gather for permutations (+15 pts) |
@@ -16,6 +17,28 @@
 ## Mistakes Log (DO NOT REPEAT)
 ### 2026-04-25: Agent wrote 1919 lines of v5 code WITHOUT running full 400-task arc-gen validation
 - **What**: Generated neurogolf_solver_v5.py with opset 17 Slice-based transforms, LOOCV Ridge tuning, stride_tricks, composition detectors, channel reduction wrapper — claimed all features were "working" in the docstring and README
 - **Result**: Uploaded to repo, overwrote neurogolf_solver.py. Tested only 10 individual tasks manually. 3/10 FAILED arc-gen validation (tasks 4, 6, 241 conv models). NEVER ran full 400 with arc-gen validation. LOOCV Ridge theory in code was never tested against actual data. Estimated LB score is UNKNOWN — cannot claim improvement over v4's proven ~670.
@@ -27,91 +50,53 @@
 - **What**: Created neurogolf_solver_v5.py instead of updating neurogolf_solver.py directly
 - **Result**: User had to explicitly request deletion of the version-named file. Repo had duplicate code. Confusion about which file is canonical.
 - **Root cause**: Did not check existing repo structure to understand naming conventions. SKILL.md says "Solver: neurogolf_solver.py".
-- **Rule**: No version numbers in filenames. Always update neurogolf_solver.py in-place. Tag versions in git or use commit history.
 ### 2026-04-25: Agent claimed LOOCV Ridge tuning would improve arc-gen survival without evidence
-- **What**: Wrote 200+ lines of Ridge tuning code (_tune_ridge_loocv, condition number checks, effective rank diagnostics) based on Cawley & Talbot (2010) and Bartlett et al. (2020) theory. Claimed in docstring: "LOOCV Ridge tuning in _lstsq_conv with condition number check + SVD-based λ auto-tune"
-- **Result**: Code exists in solver but ZERO evidence it actually helps. The lstsq problem is NOT benign overfitting — it's catastrophic overfitting because ARC patch covariance has LOW effective rank (structured, low-entropy inputs with only a few active colors). Ridge cannot fix catastrophic overfitting in the interpolation threshold regime (p≈n). No A/B test performed.
-- **Root cause**: Applied theory from papers without understanding the empirical regime. ARC tasks have only a few active colors → patch covariance has few dominant eigenvalues → noise concentrates in low-rank directions → catastrophic, not benign, overfitting. LEARNING.md "Benign Overfitting Theory" section explicitly states this but agent ignored it while writing the code.
-- **Rule**: Theory from papers is NOT proof for our specific data. Run A/B experiments: with vs without feature on same tasks, measure arc-gen survival rate. Only keep features that show >10% improvement on a test set. If LEARNING.md says a regime is "catastrophic", do not write code that assumes "benign".
 ### 2026-04-25: Agent misrepresented user's intent in LEARNING.md — BLENDING is NOT the user's strategy
-- **What**: Added a mistakes log entry claiming "Agent ignored blending" and wrote "start with blending" as a rule. The user explicitly stated: "this will not be done ... i am writing my own models no blending ... this is major flaw in the competition loophole"
-- **Result**: LEARNING.md now contains a rule that contradicts the user's competitive philosophy. If a future agent reads this, they will be told to implement blending — the exact opposite of what the user wants. The LEARNING.md file itself became misleading.
-- **Root cause**: Agent confused "competitive intelligence" (what others do) with "user's strategy" (what we should do). The LEARNING.md Competitive Intelligence section is for awareness, not instruction. User wants to win on solver merit, not loopholes.
-- **Rule**: LEARNING.md must reflect the USER'S strategy, not the competition's meta. If user says "no blending", that is the rule. Competitive intelligence goes in a separate "What others do" section, never in "Rules" or "Mistakes". Update LEARNING.md to separate "our approach" from "market intelligence".
-### 2026-04-25: Agent's composition detectors (rotate+color, flip+color, transpose+color) are untested
-- **What**: Wrote s_composition_rotate_color, s_composition_flip_color, s_composition_transpose_color with complex ONNX graph chaining code (~150 lines)
-- **Result**: No known task that these solve. No test found on 10-task sample. May never trigger on any real task. Convoluted code that increases solver complexity for zero proven gain.
-- **Root cause**: Added features from TODO.md checklist without checking if they solve actual tasks in the dataset.
-- **Rule**: Only add a solver if it demonstrably solves at least 1 task that no other solver handles. Test on full 400 before keeping. Delete dead code.
-### 2026-04-25: Agent's channel reduction wrapper is DISABLED in the code it wrote
-- **What**: Wrote _build_channel_reduced_model and _try_channel_reduction with extensive comments claiming "Channel reduction wrapper for tasks with <8 colors"
-- **Result**: The wrapper is bypassed — returns raw model unmodified. The code claims to add channel reduction but it's a no-op. Wasted ~80 lines of complex ONNX graph manipulation that never executes.
-- **Root cause**: Knew channel reduction breaks Gather-based models (Reshape hardcodes [1,10,900]), but wrote the feature anyway and left it disabled with a comment instead of fixing or deleting.
-- **Rule**: Do not write features and then disable them. Either make them work or delete them. Dead code is technical debt.
-### 2026-04-25: Agent's opset 17 Slice-based transforms are PARTIALLY validated only
-- **What**: Wrote _build_slice_flip_model, _build_slice_transpose_model, _build_slice_rotate_model. Claimed "Slice-based analytical solvers: rotation, flip, transpose (near-zero cost)"
-- **Result**: Tested tasks 179 (transpose, score 20.03) and 380 (rotate, score 19.81) — they pass arc-gen. But these are only 2 tasks out of ~25 analytical candidates. NEVER ran full 400 to verify all analytical solvers still work under opset 17. s_tile, s_upscale, s_concat etc. were not converted to opset 17 Pad format and may break.
-- **Root cause**: Tested 2 tasks, declared feature working. Did not verify on all analytical task candidates. Did not convert ALL Pad nodes across ALL solvers to opset 17 tensor-based format.
-- **Rule**: A feature is "working" only after it passes arc-gen on ALL tasks that the previous version solved + any new tasks it claims to add. Pad node conversion must be global, not just in new helper functions.
 ### 2026-04-25: Agent delivered untested code and asked user to validate it
 - **What**: Wrote and uploaded 1919-line solver, then asked user "Want me to run the full 400 now?"
-- **Result**: User discovered the code was untested through their own questioning. Agent had to admit: "I have NOT actually run the full v5 solver." Wasted user's time and trust.
-- **Root cause**: Reversed the responsibility — agent should validate BEFORE delivering, not deliver and then offer to validate.
-- **Rule**: VALIDATE FIRST, DELIVER SECOND. The submission pipeline must be run end-to-end before any code is committed to the repo. A solver that hasn't been run is not a solver — it's a draft.
 ### 2026-04-24: PyTorch 2-layer conv — fits training but doesn't generalize to arc-gen
-- **What**: Trained Conv→ReLU→Conv (hidden=32, ks=5,1) on train+test for task 12 (3 examples, 12×12)
-- **Result**: Train loss 8.65e-8 (perfect), train+test 3/3 pass, arc-gen 0/30 pass
-- **Root cause**: With only 3 training examples and 32×10×5×5 + 10×32×1×1 = 8320 parameters, the network memorizes the training examples without learning the underlying rule. This is exactly the same overfitting as lstsq.
-- **Fix attempted**: Include arc-gen examples in training data. Too slow on CPU (23 examples × 12×12 × 5000 steps). Needs GPU.
-- **Rule**: PyTorch conv is only useful if (a) trained on arc-gen data too, AND (b) run on GPU for speed. On CPU it's impractical — stick to lstsq which is at least fast.
 ### 2026-04-24: Arc-gen in lstsq fitting exposes overfitting
-- **What**: Task 7 (7×7 grid) solved by lstsq at ks=7 with 4 base examples (P=[196×490], underdetermined). Adding 2 arc-gen examples (P=[294×490]) causes lstsq to FAIL.
-- **Root cause**: When rows < features, lstsq finds min-norm solution among infinite perfect fits. This solution happened to work on 4 training examples + 30 arc-gen by luck. Adding more constraints reveals the pattern can't be captured by ks=7 linear conv.
-- **Rule**: An lstsq fit that only works when underdetermined (rows < features) is likely overfitting. The arc-gen validation catches this correctly. Don't try to bypass it.
 ### 2026-04-24: CuPy/GPU for lstsq — DOES NOT HELP
-- **What**: Swapped numpy→cupy to GPU-accelerate lstsq conv fitting
-- **Result**: GPU hit 90%, crashed on task 4 (OOM), fell back to CPU, same speed
-- **Root cause**: lstsq is O(n³) — same algorithmic cost on any device. For ks=29 on 16 examples of 21×21: patch matrix is 7056×8410 = 59M elements, ~450MB float64. GPU memory fills and crashes.
-- **Rule**: NEVER try to GPU-accelerate lstsq. The bottleneck is algorithmic, not device. Use `--conv_budget` to cap time.
 ### 2026-04-24: Channel Gather for non-permutation color maps — WRONG OUTPUT
-- **What**: Used `Gather(axis=1)` for all color maps
-- **Result**: Tasks 276, 309 produced double-active channels (ch2=1 AND ch6=1 simultaneously)
-- **Root cause**: Gather duplicates source channels. For map `{6→2}`, `gi[2]=6` copies ch6 to ch2, but ch6 also stays via `gi[6]=6`. Not valid one-hot.
-- **Rule**: Channel Gather ONLY works for **permutation** color maps (bijective, closed set). Non-permutations need Conv 1×1.
 ### 2026-04-24: ARC-GEN not loaded — THE #1 SCORE KILLER (v3→v4 fix)
-- **What**: v3 `validate()` had `if 'arc-gen' in td` check, but arc-gen was never loaded into `td`
-- **Result**: 3267 local score → 501 LB. 85% of conv models fail on Kaggle's arc-gen validation
-- **Root cause**: `load_tasks_dir()` only loaded train+test from ARC-AGI files. Arc-gen data is in separate `ARC-GEN-100K/` files.
-- **Rule**: ALWAYS load arc-gen data. ALWAYS validate against it locally before submission.
-### 2026-04-24: s_flip used GatherElements — OPSET 11 BUG (v3→v4 fix)
-- **What**: `s_flip` solver used `GatherElements` with 4D indices
-- **Result**: Works on old ORT, fails on ORT 1.25+ which enforces opset correctly
-- **Rule**: NEVER use GatherElements with opset 10. Use `_build_gather_model()` (Gather on flattened spatial dim).
-### 2026-04-24: score_network fallback returned (0,0,0) — WRONG COSTS
-- **What**: When onnx_tool not installed, `score_network` returned zeros
-- **Result**: All costs appeared as 0, inflated estimated score
-- **Rule**: Use static profiler that counts params+nbytes+macs by walking the ONNX graph. Matches Kaggle's calculation.
 ### 2026-04-24: Ignored EXCLUDED tasks
-- **What**: Tried to solve tasks {21, 55, 80, 184, 202, 366}
-- **Rule**: Skip these. Officially excluded, score 0 regardless.
-### Prior: GatherElements in v2 gather helpers
-- **What**: `_build_gather_model()` used GatherElements (opset 11)
-- **Fix**: Changed to Gather(opset 1) with 1D indices on flattened [1,10,900] spatial dim.
 ## Competitive Intelligence
@@ -119,285 +104,109 @@
 #### Why top notebooks score 4000+ and we score ~670
-The top notebooks are **BLENDERS**, not solvers. The entire leaderboard meta-game is about
-assembling the best portfolio of pre-solved ONNX models from public sources.
-**Our strategy**: Build our own solver. No blending. No public datasets. See SKILL.md for the closed-loop development methodology.
-#### Quantified Breakdown (Market Intelligence)
-| Notebook | Own Solver Tasks | Blended from Others | Total Solved | Est Score |
-|---|---|---|---|---|
-| `neurogolf-2026-tiny-onnx-solver` | **0** from own solver | 338 from 12 ZIP + 5 dataset dirs | 338 | ~4200 |
-| `4200-v5-neurogolf-fix` | **5** manual LLM rescue | 341 from 5 ZIP sources | 346 | ~5700 |
-| `the-2026-neurogolf-championship` | ~20 from own solver | 288 from **24 Kaggle dataset** sources | 288 | ~3600 |
-| `neurogolf-4200-solver` (full solver) | ~20 analytical | 288 from 24 dataset sources | 288 | ~3600 |
-| **Our solver v4** | **~50** from solver | **0 blended** | 50 | ~670 |
-#### Blend Pipeline Architecture (What We DON'T Do)
-```
-Phase 1: ZIP Blend
-  - Auto-discovers ALL submission.zip files from attached Kaggle notebook outputs
-  - 12 sources: mega-agi-ensemble(203), the-2026-neurogolf-championship(105),
-    neurogolf-2026-starter(77), baseline-for-ensemble-1k(8), infinitesimals(4),
-    arc-nano-engine(2), + 6 more with 0 valid models
-  - Each model: strict_validate(raw, task_id) using neurogolf_utils
-    → verify_subset(session, train+test) + verify_subset(session, arc-gen)
-    → score_network(path) for official cost
-  - Keep cheapest valid model per task
-Phase 2: Dataset ONNX dirs
-  - Scans loose .onnx files from attached dataset directories
-  - Same strict validation
-Phase 3: Own solver (minimal)
-  - Only runs on unsolved tasks (62 remaining after blend)
-  - Detectors: identity, color_map, rotation, flip, transpose, tile, scale,
-    nonuniform_scale, mirror_h/v, quad_mirror, shift, fixed_crop,
-    rot+color, flip+color, transpose+color, gravity, extract_outline
-  - Learned conv: try_learned_conv(ks=1,3,5) with PyTorch + ternary snap
-  - Two-layer conv: Conv→ReLU→Conv(ks1=3,5, ks2=1)
-  - Result: +0 new tasks (all 62 remaining were too hard)
-```
-Result after all phases: 338/400 tasks, est 4197.5 points.
-#### How `the-2026-neurogolf-championship` Gets 288 Tasks (from `neurogolf-4200-solver`)
-This one has the richest **dataset source** collection — 24 Kaggle datasets:
-```
-Cross_Source: 227 ONNX       Task_Transformation: 266    Golf_Aura: 254
-ONNX_Solutions_v31: 252      Publi_Data: 206             Agent: 206
-Logic: 204                   Logic_for_ARC: 204          Yash_Submission: 172
-Yash_Submission_v1: 168      Claude_Golf: 160            Ashok_Submission: 160
-NeuroGolf1k_A: 158           NeuroGolf1k_B: 132
-TestGolf_S014-S203: 9× 207 each (task-specific strong models)
-Total: ~4632 pre-solved ONNX models across sources
-```
-After official validation: 288 unique tasks solved.
-Source breakdown: Cross_Source=169, Task_Transformation=55, ONNX_Solutions_v31=49, Golf_Aura=11.
-#### How `4200-v5-neurogolf-fix` Gets 341+ Tasks
-Blends from 5 ZIP sources:
-```
-SOURCE_ZIPS:
-  '1': neurogolf-2026-starter (335 models)
-  '2': neurogolf-2026-tiny-onnx-solver (338 models)  ← the blend notebook itself!
-  '5': infinitesimals (341 models)
-  '7': logic-decoder (338 models)
-  '8': neurogolf-2026-blended-341-tasks-lb-4215 (341 models)
-```
-Plus **5 hand-crafted "LLM Rescue" ONNX models** for tasks 076, 096, 118, 133, 264.
-Each is a "huge static graph" — a per-task ONNX network built by an LLM that embeds
-the entire set of known examples and builds a matching/dispatch circuit.
 #### The 6 Key Techniques They Have That We Lack
-**1. Opset 17 (NOT 10)**
-Their analytical solvers use opset 17 for cheaper operations:
-- `Slice` + `Transpose` for rotation (2 nodes, 0 params, ~0 MACs) — we use `Gather` (1 node but has params for indices)
-- `Pad` with tensor-based `pads` input instead of per-attribute pads
-- **Our cost**: rotation ~165K MACs, flip ~165K, transpose ~36K
-- **Their cost**: ~0 MACs (Slice+Transpose is essentially free)
-- **Impact**: ~25 analytical tasks go from ~15 pts → ~25 pts each = **+250 pts**
-**2. Channel Reduction Wrapper**
-For tasks with <8 colors, they insert `Conv1x1(10→N) → transform → Conv1x1(N→10)`.
-Reduces intermediate MACs by ~20-40% on conv tasks with few colors.
-Impact: +50-100 pts on conv-heavy tasks.
-**3. Composition Detectors**
-Tasks that are "rotate then recolor" or "flip then recolor" are solved by chaining two analytical ops.
-We don't have these — our solvers are single-operation only.
-Impact: ~10-15 tasks that are currently unsolved.
-**4. Best-of-N Model Selection (Aggressive)**
-For each task, they generate 20+ candidates (different ks, bias/no-bias, 1-layer vs 2-layer, different seeds)
-and keep the cheapest one that passes arc-gen. We try 2-3 candidates.
-Impact: +100-200 pts from picking cheaper valid models.
-**5. ONNX Optimizer Pass**
-`onnxoptimizer.optimize()` with dead-code elimination, identity removal.
-Can shrink models 5-20%. Top notebooks do this; we don't.
-Impact: +50-100 pts across all tasks.
-**6. LLM Rescue for Algorithmic Tasks**
-Tasks 076 (gravity), 096 (runs/gaps), 118 (outline), 133, 264 — these have algorithmic patterns
-that no conv or simple transform can capture. They build per-task ONNX graphs by feeding
-the task JSON + known solution to an LLM.
-Impact: +5-10 tasks that are otherwise unsolvable.
-#### What We Do NOT Copy
-- **Blending**: We build our own models. No public datasets, no ZIP merging.
-- **LLM rescue at scale**: We may build 5-10 manual rescue models, not 100+.
-- **Pre-solved model portfolios**: We generate all models from our own solver.
 ## Deep Research Findings
-### lstsq Conv Research (2026-04-25) — Deep Literature Review Results
-**Agent:** Research into Bartlett et al. (2020) PNAS, Belkin et al. (2019) PNAS, arXiv:2306.13185, arXiv:2302.00257, Apple ML Research.
 **Key Finding: Our overfitting is CATASTROPHIC, not benign.**
-Bartlett et al. benign overfitting condition: `∃ k=o(n) such that R_k > n` where `R_k = (Σ_{i>k} λ_i)² / Σ_{i>k} λ_i²`. For exponential eigenvalue decay (our case, few active colors), `R_k` is bounded → `k/r_k → ∞` → **catastrophic overfitting** (Theorem 6(c) of 2306.13185).
-**Double Descent Peak at ks=7:** For n≈600 patches, p=490 (ks=7) is exactly at the interpolation threshold where test risk is maximized. ks=15 (p=2250) and ks=29 (p=8410) are in overparameterized regime but the "second descent" never materializes because effective rank is too low.
-**Ridge (LOOCV λ) is predicted to FAIL:** Ridge shrinks ALL coefficients uniformly. For sparse signals in one-hot spaces, it shrinks signal along with noise. Lasso (ℓ₁) and hybrid ℓ₁/ℓ₂ approaches are theoretically superior (arXiv:2302.00257).
-**What to try (evidence-backed):**
-1. **Lasso instead of lstsq** — sparse signal structure matches ℓ₁ penalty
-2. **PCA dimensionality reduction** before fitting — reduce `p` to `p << n` (top-20 components matching effective rank)
-3. **Skip ks=5,7,9** — these are at/near the interpolation threshold peak
-4. **Iterative gradient descent with early stopping** — implicit ℓ₁-like sparsity, don't interpolate to zero training error
-**What does NOT work:**
-- Ridge/LOOCV λ tuning on underdetermined one-hot patches
-- GPU/CuPy for lstsq (same algorithmic cost, crashes on memory)
-- PyTorch 2-layer conv trained only on 3-6 examples (memorizes, doesn't generalize)
-- Larger kernels without dimensionality reduction (p >> n with low rank = worse)
-### Benign Overfitting Theory (2026-04-24)
-Read Bartlett et al. (2020) PNAS "Benign overfitting in linear regression". Key insight for our problem:
-- **Benign overfitting**: When overparameterized models generalize well despite interpolating training data.
-- **Condition**: Requires that the covariance operator has sufficiently large effective rank.
-- **Our regime**: For one-hot grids with only a few active colors, the covariance operator has **low effective rank** (structured, low-entropy inputs).
-- **Implication**: In low effective rank regime, benign overfitting is **NOT guaranteed** — interpolation can lead to catastrophic overfitting.
-- **Relevance to our lstsq conv solver**: When ks=7 on 7×7 grid with 4 examples, we have 196 patches × 490 features = underdetermined. The lstsq solution interpolates training data but may catastrophically overfit if patch covariance has low effective rank.
-This is exactly what we observe: task 7 with ks=7 passes arc-gen with 4 examples (P=[196×490]) but FAILS when adding more examples (P=[294×490]). The additional constraints expose the interpolation as overfitting, not benign generalization.
-### ARC-GEN Generator Research (2026-04-24)
-ARC-GEN is Google DeepMind's official synthetic data generator for ARC-AGI.
-GitHub: https://github.com/google/ARC-GEN
-- Generates ~250 examples per task from the task's generator DSL
-- Can be run locally to produce more than the ~250 included in the competition
-- Our local `ARC-GEN-100K/` has 100K examples across 400 tasks (~250 per task)
-- Kaggle provides arc-gen embedded in task JSONs (up to 262 per task)
-**Strategy**: More arc-gen data in fitting = more constraints = better generalization. But only when rows (examples) >> features (ks²×10).
-## Useful Patterns Found in Notebooks
-### Pattern: Double-Active Channel Fix
-```python
-# After color map Gather, some tasks produce double-active channels
-# Fix: take ArgMax across channels, then OneHot
-# In ONNX: ArgMax → Equal → Cast (our standard pattern)
-```
-### Pattern: Channel Permutation Score Boost
-```python
-# For permutation color maps: Gather(axis=1) = 0 MACs, score ~21
-# For non-permutation: Conv 1×1 = 100 MACs, score ~13
-# Detection: set(cm.keys()) == set(cm.values())
-```
-### Pattern: Task 096 (Run-Length/Gap)
-Public notebooks solve this with hand-crafted ONNX:
-- Depthwise conv to detect runs of length N
-- Gap pattern matching
-- This is a "template" for a class of "count and classify" tasks
-### Pattern: Task 076 (Gravity)
-- Input: objects fall down to bottom of grid
-- LLM rescue builds ONNX with ReduceSum + comparison + conditional fill
-### Pattern: Task 118 (Outline Extraction)
-- Extract border pixels of objects
-- Can be done with conv edge detection kernel
 ## What Has NOT Worked
-### ❌ Ridge Regression for lstsq Conv
-- Tried: LOOCV λ tuning, condition number checks
-- Result: Still fails arc-gen for tasks with low effective rank covariance
-- Theory: Ridge shrinks all coefficients uniformly — cannot preserve sparse signal structure
-### ❌ CuPy for GPU lstsq
-- Tried: numpy → cupy swap
-- Result: OOM on task 4, fell back to CPU
-- Bottleneck: O(n³) SVD, not device transfer
-### ❌ PyTorch 2-layer Conv (without arc-gen in training)
-- Tried: Conv→ReLU→Conv on train+test only
-- Result: Perfect train fit, 0/30 arc-gen pass
-- Same overfitting as lstsq — memorizes, doesn't generalize
-### ❌ Composition Detectors (rotate+color, flip+color, transpose+color)
-- Tried: Implemented in v5 code
-- Result: No tasks found that these solve. May not exist in dataset.
-- Need: Scan 400 tasks to find actual composition tasks before implementing.
 ## Technical Notes
-### ONNX Opset Compatibility
-- Opset 10: IR 10, Gather (opset 1), Conv (opset 1), Pad with attributes
-- Opset 17: IR 10, Slice with tensor inputs, Pad with tensor `pads` input
-- Kaggle inference server accepts BOTH opset 10 and 17
-- Our v4 solver uses opset 10. v5 claimed opset 17 but Pad nodes still use attributes.
 ### ARC-AGI Task Statistics
-- 400 tasks total
-- 6 excluded: {21, 55, 80, 184, 202, 366}
-- ~25 analytical tasks (identity, color_map, rotate, flip, transpose, tile, etc.)
-- ~20-30 conv tasks that generalize (arc-gen pass)
-- ~350 tasks unsolved by our solver v4
 ### Score Calculation
 ```python
 score = max(1.0, 25.0 - math.log(macs + memory_bytes + params))
-# macs: multiply-accumulate operations
-# memory_bytes: size of all tensors (inputs + outputs + intermediates + parameters)
-# params: number of parameters
-# Example: Gather model (0 macs, ~14KB memory, 0 params) → score ~25
-# Example: Conv 1×1 model (9000 macs, ~2KB memory, 100 params) → score ~13
-# Example: Conv ks=3 model (81000 macs, ~5KB memory, 910 params) → score ~11
 ```
-### Lstsq Conv Fitting Matrix Sizes
-| Grid | Examples | Patches (n) | ks=3 (p=90) | ks=5 (p=250) | ks=7 (p=490) | ks=29 (p=8410) |
-|------|----------|-------------|-------------|--------------|--------------|----------------|
-| 7×7  | 4        | 196         | 196×90      | 196×250      | **196×490 (under!)** | 196×8410 |
-| 12×12| 6        | 576         | 576×90      | 576×250      | 576×490      | 576×8410 |
-| 21×21| 16       | 7056        | 7056×90     | 7056×250     | 7056×490     | **7056×8410** |
-Underdetermined (n < p): ks=7 on 7×7 with 4 examples = 196 < 490 → interpolation → overfitting risk HIGH.
 ## Session Notes for Future Agents
 **Before touching code:**
 1. Read this file (LEARNING.md) — all the way through
-2. Read SKILL.md — especially the "Development Methodology: The Closed-Loop" section
-3. Read TODO.md — check the experiment log and research queue
 4. Run the current solver on 20-50 tasks to establish baseline
 5. Only then: design experiment, implement, validate, compare
 **Before claiming a feature works:**
 - Must pass arc-gen on ≥20 tasks (or full 400 if cheap)
 - Must show >10% improvement in arc-gen survival rate OR total score
-- Must include A/B comparison: with vs without feature on same tasks
-**Before uploading code to repo:**
 - Must have run full 400-task arc-gen validation
-- Must confirm total score > previous best
-- Must not overwrite neurogolf_solver.py with unvalidated code
-- Use git tags or commit messages for version tracking, NOT filenames
-**What to focus on next (as of v4.3):**
-1. Skip ks=5,7,9 in conv fitting — avoid interpolation threshold
-2. PCA dimensionality reduction before lstsq — ensure p_reduced << n
-3. Test opset 17 Slice-based transforms on full 400 tasks
-4. Identify actual composition tasks by scanning 400 task data
-5. Lasso (ℓ₁) instead of Ridge — matches sparse signal structure

 | Version | Date | Tasks (arc-gen validated) | Est LB | Key Changes |
 |---------|------|--------------------------|--------|-------------|
+| **v5.0** | **2026-04-26** | **TBD (running)** | **TBD** | Refactored to 16-file package, opset 17 (IR 8), Slice-based flip/rotate (0 MACs), tensor-based Pad & ReduceSum, lstsq crash fix |
 | v4.3 | 2026-04-25 | 50 | ~670 | Updated TODO.md + SKILL.md + LEARNING.md with closed-loop methodology. NO code changes. |
 | v4.2 | 2026-04-24 | 50 | ~670 | Added PyTorch learned conv (single+two-layer, multi-seed, ternary snap). Needs GPU. |
 | v4.1 | 2026-04-24 | 50 | ~670 | Color map Gather for permutations (+15 pts) |
 ## Mistakes Log (DO NOT REPEAT)
+### 2026-04-26: Agent put entire 1400-line codebase into a single file, repeatedly overwrote user's code
+- **What**: When implementing v5 opset 17 changes, agent uploaded the entire solver as a single `neurogolf_solver.py` file — three times. Each upload overwrote the user's `run_tasks`, `main`, and W&B code that the agent couldn't read (the read tool truncates at ~1000 lines).
+- **Result**: User's W&B logging code was deleted. User's `run_tasks` function was deleted. User had to point agent to a specific commit (3f3d372) to recover.
+- **Root cause**: (1) Agent couldn't read the tail of the file due to tool truncation, so it rewrote the entire file from scratch instead of making surgical edits. (2) No Python best practice says "put all code in one file" — the opposite is true. (3) Agent prioritized "getting it done" over preserving existing working code.
+- **Rule**: NEVER rewrite an entire file when you can't read all of it. Use the `edit` tool for targeted string replacements. If the file is too large to read, split it into smaller files FIRST (which is what the user ultimately had to specify). NEVER destroy code you can't see.
+### 2026-04-26: lstsq SVD non-convergence crash on task 313
+- **What**: `np.linalg.lstsq(P, T_oh, rcond=None)` raised `LinAlgError: SVD did not converge` during `solve_conv_variable` for task 313.
+- **Result**: Entire solver crashed, no further tasks processed.
+- **Root cause**: The `_lstsq_conv` function had no try/except around the lstsq call. `solve_conv_var_diff` already had one, but `_lstsq_conv` (used by `solve_conv_fixed` and `solve_conv_variable`) did not.
+- **Fix**: Wrapped lstsq in `try/except (np.linalg.LinAlgError, ValueError): return None` in all three call sites (`_lstsq_conv`, `solve_conv_diffshape` inline lstsq).
+- **Rule**: EVERY lstsq call must be guarded. SVD non-convergence is rare but real, especially for ill-conditioned patch matrices from unusual grid patterns.
+### 2026-04-26: ReduceSum axes attribute invalid in opset 17
+- **What**: Code used `ReduceSum(['data'], ['output'], axes=[1,2,3], keepdims=1)` which puts axes as a node attribute. In opset 13+, axes must be a tensor input, not an attribute.
+- **Result**: Models would fail ONNX checker validation and potentially fail on Kaggle inference server.
+- **Fix**: Created `_build_reducesum()` helper that adds axes as an int64 initializer tensor and passes it as the 2nd input to ReduceSum. Applied to `s_constant` (axes=[1,2,3]), `solve_conv_variable` (axes=[1]), `solve_conv_var_diff` (axes=[1]).
+- **Rule**: When changing opset version, audit ALL operators for breaking API changes. Key opset 13 changes: ReduceSum, ReduceMean, ReduceMax all moved axes from attribute to tensor input. Pad moved pads from attribute to tensor input at opset 11. Slice added steps input at opset 13.
 ### 2026-04-25: Agent wrote 1919 lines of v5 code WITHOUT running full 400-task arc-gen validation
 - **What**: Generated neurogolf_solver_v5.py with opset 17 Slice-based transforms, LOOCV Ridge tuning, stride_tricks, composition detectors, channel reduction wrapper — claimed all features were "working" in the docstring and README
 - **Result**: Uploaded to repo, overwrote neurogolf_solver.py. Tested only 10 individual tasks manually. 3/10 FAILED arc-gen validation (tasks 4, 6, 241 conv models). NEVER ran full 400 with arc-gen validation. LOOCV Ridge theory in code was never tested against actual data. Estimated LB score is UNKNOWN — cannot claim improvement over v4's proven ~670.
 - **What**: Created neurogolf_solver_v5.py instead of updating neurogolf_solver.py directly
 - **Result**: User had to explicitly request deletion of the version-named file. Repo had duplicate code. Confusion about which file is canonical.
 - **Root cause**: Did not check existing repo structure to understand naming conventions. SKILL.md says "Solver: neurogolf_solver.py".
+- **Rule**: No version numbers in filenames. Use git commits for version tracking. The canonical solver is `neurogolf_solver/` package (v5+) or `neurogolf_solver.py` (legacy).
 ### 2026-04-25: Agent claimed LOOCV Ridge tuning would improve arc-gen survival without evidence
+- **What**: Wrote 200+ lines of Ridge tuning code based on Cawley & Talbot (2010) and Bartlett et al. (2020) theory.
+- **Result**: Code exists but ZERO evidence it helps. Our overfitting is catastrophic, not benign. Ridge cannot fix catastrophic overfitting in the interpolation threshold regime.
+- **Rule**: Theory from papers is NOT proof for our specific data. Run A/B experiments first.
 ### 2026-04-25: Agent misrepresented user's intent in LEARNING.md — BLENDING is NOT the user's strategy
+- **What**: Added rules about blending contradicting user's explicit "no blending" philosophy.
+- **Rule**: LEARNING.md must reflect the USER'S strategy. Competitive intelligence goes in "What Others Do" section only.
+### 2026-04-25: Composition detectors, channel reduction wrapper — untested dead code
+- **What**: Wrote composition detectors (rotate+color, flip+color, transpose+color) and channel reduction wrapper. Neither was tested or found to solve any task.
+- **Rule**: Only add a solver if it demonstrably solves ≥1 task. Delete dead code. These were NOT included in the v5 refactor.
 ### 2026-04-25: Agent delivered untested code and asked user to validate it
 - **What**: Wrote and uploaded 1919-line solver, then asked user "Want me to run the full 400 now?"
+- **Rule**: VALIDATE FIRST, DELIVER SECOND. A solver that hasn't been run is a draft, not a deliverable.
 ### 2026-04-24: PyTorch 2-layer conv — fits training but doesn't generalize to arc-gen
+- **What**: Trained Conv→ReLU→Conv on train+test only. Perfect train fit, 0/30 arc-gen pass.
+- **Rule**: PyTorch conv only useful if trained on arc-gen data too AND run on GPU.
 ### 2026-04-24: Arc-gen in lstsq fitting exposes overfitting
+- **What**: Task 7 solved by lstsq at ks=7 with 4 base examples. Adding arc-gen causes failure.
+- **Rule**: An lstsq fit that only works when underdetermined is likely overfitting.
 ### 2026-04-24: CuPy/GPU for lstsq — DOES NOT HELP
+- **What**: Swapped numpy→cupy. OOM on task 4, same speed on rest.
+- **Rule**: NEVER GPU-accelerate lstsq. Bottleneck is algorithmic O(n³), not device.
 ### 2026-04-24: Channel Gather for non-permutation color maps — WRONG OUTPUT
+- **What**: Used Gather(axis=1) for all color maps. Tasks 276, 309 produced double-active channels.
+- **Rule**: Channel Gather ONLY for permutation color maps. Non-permutations need Conv 1×1.
 ### 2026-04-24: ARC-GEN not loaded — THE #1 SCORE KILLER (v3→v4 fix)
+- **What**: v3 validate() checked arc-gen but never loaded it. 3267 local → 501 LB.
+- **Rule**: ALWAYS load arc-gen data. ALWAYS validate against it locally.
+### 2026-04-24: s_flip used GatherElements — OPSET 11 BUG
+- **Rule**: NEVER use GatherElements with opset 10. Use Gather on flattened spatial dim.
+### 2026-04-24: score_network fallback returned (0,0,0)
+- **Rule**: Use static profiler that walks the ONNX graph.
 ### 2026-04-24: Ignored EXCLUDED tasks
+- **Rule**: Skip {21, 55, 80, 184, 202, 366}.
 ## Competitive Intelligence
 #### Why top notebooks score 4000+ and we score ~670
+Top notebooks are **BLENDERS** — they assemble pre-solved ONNX models from public sources.
+**Our strategy**: Build our own solver. No blending. No public datasets.
 #### The 6 Key Techniques They Have That We Lack
+1. **Opset 17** — ✅ DONE in v5. Slice+Transpose for near-zero cost transforms.
+2. **Channel Reduction Wrapper** — 🔲 Not yet. Conv1x1(10→N) → transform → Conv1x1(N→10).
+3. **Composition Detectors** — 🔲 Not yet. Need to scan 400 tasks to find actual instances first.
+4. **Best-of-N Model Selection** — 🔲 Not yet. Generate 20+ candidates, keep cheapest valid.
+5. **ONNX Optimizer Pass** — 🔲 Not yet. onnxoptimizer.optimize() for dead-code elimination.
+6. **LLM Rescue** — 🔲 Not yet. Per-task ONNX graphs for algorithmic tasks (gravity, outline, etc.)
 ## Deep Research Findings
+### lstsq Conv Research (2026-04-25)
 **Key Finding: Our overfitting is CATASTROPHIC, not benign.**
+- Bartlett et al. benign overfitting requires high effective rank of covariance. Our one-hot patches have LOW effective rank.
+- Double descent peak at ks=5,7,9 (p ≈ n).
+- Ridge predicted to fail; Lasso (ℓ₁) theoretically better for sparse signals.
+**Evidence-backed next steps:**
+1. Lasso instead of lstsq
+2. PCA dimensionality reduction (top-20 components)
+3. Skip ks=5,7,9
+4. Gradient descent with early stopping
+### ONNX Opset 17 Migration Notes (2026-04-26)
+**Breaking changes from opset 10:**
+| Operator | Opset 10 | Opset 13+ (incl. 17) |
+|----------|----------|----------------------|
+| ReduceSum | axes as **attribute** | axes as **tensor input** |
+| ReduceMean | axes as **attribute** | axes as **tensor input** |
+| Pad | pads as **attribute** | pads as **tensor input** (since opset 11) |
+| Slice | no steps input | **steps** added as 5th tensor input |
+| Conv | pads as attribute | pads as attribute ✅ (unchanged) |
+| Transpose | perm as attribute | perm as attribute ✅ (unchanged) |
+| Gather | unchanged | unchanged ✅ |
+**IR version**: Opset 17 requires IR ≤ 8. We use IR=8.
+**Slice(step=-1) for reversing:**
+- `starts=[dim-1], ends=[INT64_MIN], axes=[ax], steps=[-1]` — reverses entire axis
+- INT64_MIN as end sentinel (not -1, which means dim-1 in ONNX)
+- Zero MACs, zero params, near-zero memory (just 4 int64 scalars)
 ## What Has NOT Worked
+| Technique | Result | Why |
+|-----------|--------|-----|
+| Ridge/LOOCV λ | Fails arc-gen | Catastrophic, not benign overfitting |
+| CuPy GPU lstsq | OOM + same speed | O(n³) SVD bottleneck |
+| PyTorch 2-layer (no arc-gen) | 0/30 arc-gen pass | Memorizes training |
+| Composition detectors | No tasks found | May not exist in dataset |
+| Channel reduction wrapper | Never executed | Disabled due to Gather incompatibility |
 ## Technical Notes
 ### ARC-AGI Task Statistics
+- 400 tasks total, 6 excluded: {21, 55, 80, 184, 202, 366}
+- ~25 analytical tasks, ~25 conv tasks that survive arc-gen, ~350 unsolved
 ### Score Calculation
 ```python
 score = max(1.0, 25.0 - math.log(macs + memory_bytes + params))
 ```
+### Lstsq Matrix Sizes (for reference)
+| Grid | Examples | Patches (n) | ks=3 (p=90) | ks=7 (p=490) | ks=29 (p=8410) |
+|------|----------|-------------|-------------|--------------|----------------|
+| 7×7  | 4        | 196         | 196×90      | **196×490 (under!)** | 196×8410 |
+| 12×12| 6        | 576         | 576×90      | 576×490      | 576×8410 |
+| 21×21| 16       | 7056        | 7056×90     | 7056×490     | **7056×8410** |
 ## Session Notes for Future Agents
 **Before touching code:**
 1. Read this file (LEARNING.md) — all the way through
+2. Read SKILL.md — especially "Development Methodology" and "Submission Checklist"
+3. Read TODO.md — check experiment log and research queue
 4. Run the current solver on 20-50 tasks to establish baseline
 5. Only then: design experiment, implement, validate, compare
+**Code structure (v5):**
+- The solver is a Python package at `neurogolf_solver/`
+- Run with `python -m neurogolf_solver.main [args]`
+- Edit individual files surgically — NEVER rewrite the whole package
+- The legacy `neurogolf_solver.py` at root is v4, kept for reference — do NOT edit it
 **Before claiming a feature works:**
 - Must pass arc-gen on ≥20 tasks (or full 400 if cheap)
 - Must show >10% improvement in arc-gen survival rate OR total score
+- Must include A/B comparison
+**Before uploading code:**
 - Must have run full 400-task arc-gen validation
+- Must confirm total score ≥ previous best
+**What to focus on next:**
+1. Wait for v5 Kaggle results — compare arc-gen survival and LB score to v4
+2. Skip ks=5,7,9 in conv fitting — avoid interpolation threshold
+3. PCA dimensionality reduction before lstsq
+4. Lasso (ℓ₁) instead of lstsq
+5. Best-of-N model selection (generate multiple candidates, keep cheapest valid)