rogermt
/

neurogolf-solver

Model card Files Files and versions

xet

Community

rogermt commited on 14 days ago

Commit

863483e

verified ·

1 Parent(s): eabdff6

v4.3: Update TODO.md with experiment queue, research loop, status key, explicit blending exclusion

Browse files

Files changed (1) hide show

TODO.md +148 -48

TODO.md CHANGED Viewed

@@ -1,74 +1,174 @@
 # NeuroGolf Solver — Roadmap
-> Current: v4.2 · 50 arc-gen validated · ~670 LB · Target: 3000+
 ## Phase 1: Cheap Wins (est +400 pts → ~1100)
-- [ ] **Switch to opset 17** — replace all Gather-index models with Slice+Transpose builders
   - Rotation: `Crop → Transpose → Slice(step=-1)` = ~0 cost (was ~165K)
   - Flip: `Crop → Slice(step=-1)` = ~0 cost (was ~165K)
   - Transpose: `Crop → Transpose(perm)` = ~0 cost (was ~36K)
-  - ~25 analytical tasks go from ~15 pts → ~25 pts each
-- [ ] **Channel reduction wrapper** — `Conv1x1(10→N) → transform → Conv1x1(N→10)` when <8 colors used
-  - Saves ~20-40% MACs on conv tasks with few colors
-- [ ] **Composition detectors** — rotation+color, flip+color, transpose+color
-  - These are tasks where two operations are combined (e.g. rotate then recolor)
-  - Top notebooks have these, we don't
 ## Phase 2: Fix Arc-Gen Survival (est +100-150 tasks → ~2000-2500)
-This is the #1 blocker. We solve 307 locally but only 50 survive arc-gen.
-- [ ] **PyTorch learned conv on GPU** — train on train+test+arc-gen data
-  - Multi-seed Adam (seeds 0,7,42), 3000 steps, lr=0.03
-  - Try ks=1,3,5 single-layer + ks=(3,1) and (5,1) two-layer with ReLU
-  - **Ternary weight snap** — after training, snap weights to {-1,0,1}, re-validate
-  - Must include arc-gen examples in training data (not just validation)
-  - Needs GPU (T4 minimum) — CPU too slow for 400 tasks × 3 seeds × multiple ks
-- [ ] **Increase arc-gen in lstsq fitting** — currently capped at 10, try 20-50 for fixed-size tasks
-  - More data = more constraints = less overfitting in underdetermined systems
-- [ ] **Generate MORE arc-gen data** — use ARC-GEN generator (github.com/google/ARC-GEN) to produce 1000+ examples per task instead of ~250
-  - More fitting data = better generalization
-## Phase 3: Hard Tasks — Hash Matchers & LLM Rescue (est +20-50 tasks → ~2500-3000)
-For tasks no automated solver can handle.
-- [ ] **Hash-based matcher builder** — automated version of the LLM rescue pattern
-  - Flatten input → MatMul(hash_weights) → match against all known examples → apply stored delta
-  - Requires opset 17 (ScatterND)
-  - Works for ANY task where all examples fit in 1.44MB model
-  - Build a generic `build_hash_matcher(task_data) → onnx_bytes` function
-- [ ] **Per-task LLM rescue** — for the ~20 hardest tasks with algorithmic patterns
-  - Feed task JSON + Python solution to LLM, get back ONNX builder function
-  - Priority tasks: gravity, flood fill, outline extraction, pattern counting
-- [ ] **Run-length / gap pattern detector** — like task096 in the notebooks
-  - Depthwise conv to detect runs of N, gap patterns
-  - Template for a class of "count and classify" tasks
 ## Phase 4: Score Optimization (est +200-500 pts on existing tasks)
-- [ ] **ONNX optimizer pass** — `onnxoptimizer.optimize()` with dead-code elimination, identity removal
   - Top notebooks do this; can shrink models 5-20%
-- [ ] **Best-of-N model selection** — for each task, generate multiple candidate models (different ks, bias/no-bias, etc.), keep cheapest valid one
-  - Already partially done but could be more aggressive
-- [ ] **Validate with official `neurogolf_utils.score_network()`** — use `onnx_tool` for exact cost matching
-  - Our static profiler is close but may diverge on edge cases
-## Optional: Blend Pipeline
-If the above isn't enough, we can build our own blend pipeline:
-- [ ] Upload our solver's `submission.zip` as a Kaggle dataset
-- [ ] Create a blend notebook that loads our own output + runs a second-pass solver
-- [ ] Attach public datasets (see LEARNING.md for the full list of 24 sources)
-- [ ] `strict_validate()` every model through `neurogolf_utils` before submission
 ## Status Key
 | Symbol | Meaning |
 |--------|---------|
-| `[ ]` | Not started |
-| `[~]` | In progress |
-| `[x]` | Done |
-| `[!]` | Blocked |

 # NeuroGolf Solver — Roadmap
+> Current: v4.3 · 50 arc-gen validated · ~670 LB · Target: 3000+
+> Philosophy: **Research → Design → Experiment → Analyze → Research** loop until confirmed score increase.
+> Rule: **NEVER claim a feature works without full arc-gen validation on representative tasks.**
 ## Phase 1: Cheap Wins (est +400 pts → ~1100)
+### 1a: Opset 17 Slice-Based Analytical Solvers (~0 cost)
+- [ ] **Convert ALL analytical solvers to opset 17** — not just new ones
   - Rotation: `Crop → Transpose → Slice(step=-1)` = ~0 cost (was ~165K)
   - Flip: `Crop → Slice(step=-1)` = ~0 cost (was ~165K)
   - Transpose: `Crop → Transpose(perm)` = ~0 cost (was ~36K)
+  - Pad nodes: all must use opset 17 tensor-based `pads` input (not attribute)
+  - Affected solvers: s_tile, s_upscale, s_concat, s_concat_enhanced, s_kronecker, s_diagonal_tile, s_shift, s_mirror_h, s_mirror_v, s_quad_mirror, s_fixed_crop, s_spatial_gather, s_varshape_spatial_gather
+- [ ] **Validate**: Full 400 arc-gen run. Compare analytical task count vs v4.
+  - Target: ~25 analytical tasks scoring ~25 pts each (was ~15)
+  - Accept only if >10% improvement in analytical category total score.
+### 1b: Composition Detectors
+- [ ] **Identify actual tasks** that are rotation+recolor, flip+recolor, transpose+recolor
+  - Scan 400 tasks: apply rotate → check if color_map solves, etc.
+  - Only implement solvers for combinations that exist in dataset
+- [ ] **Build composition solver** — chain analytical + color_map as single ONNX graph
+- [ ] **Validate**: Full 400 arc-gen. Count new tasks solved. Accept only if >0 new tasks.
+### 1c: Channel Reduction Wrapper
+- [ ] **Design for Gather compatibility** — current Reshape hardcodes [1,10,900]
+  - Option A: Add Conv1x1(10→N) before + Conv1x1(N→10) after for conv-based models
+  - Option B: Use Slice to extract active channels + Gather remapping for pure spatial transforms
+- [ ] **Validate**: Pick 5 tasks with <5 colors. Compare score with/without wrapper.
+  - Accept only if >5% score improvement per task AND arc-gen still passes.
+---
 ## Phase 2: Fix Arc-Gen Survival (est +100-150 tasks → ~2000-2500)
+> **This is the #1 blocker.** We solve 307 locally but only 50 survive arc-gen.
+> Research (Bartlett et al., Belkin et al., arXiv:2306.13185) shows:
+> - Our patch covariance has LOW effective rank (~10-40) vs n≈600 patches
+> - This is CATASTROPHIC overfitting regime, NOT benign
+> - Ridge/LOOCV λ tuning CANNOT fix this — theory predicts failure
+### 2a: Skip Interpolation Threshold Kernels
+- [ ] **Remove ks=5,7,9 from conv fitting** — these are at/near double descent peak
+  - Try ks list: [1, 3, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29]
+  - Rationale: ks=5 (p=490, n≈600) is worst-case. ks=1 (p=10) is safe. ks=29 (p=8410) is overparameterized but at least past the peak.
+- [ ] **Validate**: Full 400 arc-gen. Compare arc-gen survival rate vs v4.
+  - Accept only if survival rate improves by >10% (5+ more tasks).
+### 2b: PCA Dimensionality Reduction Before lstsq
+- [ ] **PCA pre-processing**: project patch matrix P to top-k components (k=15-25 matching effective rank)
+  - Fit PCA on training patches, transform both P and test patches, then lstsq on reduced space
+  - Ensures p_reduced << n, avoiding interpolation regime entirely
+- [ ] **Validate**: Test on 20 tasks that currently fail arc-gen at ks=7,9.
+  - Compare: raw lstsq vs PCA+lstsq. Measure arc-gen pass rate.
+  - Accept only if >20% of previously-failing tasks now pass.
+### 2c: Gradient Descent with Early Stopping (Alternative to lstsq)
+- [ ] **Iterative solver**: Adam on conv weights, early stop at ~95% train accuracy (don't interpolate)
+  - Implicit ℓ₁-like regularization — theory predicts better generalization than explicit Ridge
+  - Use small model: ks=3 single-layer or ks=(3,1) two-layer
+- [ ] **Validate**: Same 20 failing tasks. Compare lstsq vs early-stopping GD.
+  - Accept only if >15% improvement in arc-gen survival.
+### 2d: Lasso / Sparse Regression
+- [ ] **Replace np.linalg.lstsq with sklearn.linear_model.Lasso**
+  - α tuning via cross-validation on training data
+  - Matches sparse signal structure of one-hot patches
+- [ ] **Validate**: Same 20 failing tasks. Compare lstsq vs Lasso.
+  - Accept only if >15% improvement.
+### 2e: PyTorch Multi-Seed with Arc-Gen Training (GPU Required)
+- [ ] **Train Conv→ReLU→Conv on train+test+arc-gen** (all available examples matching grid size)
+  - Multi-seed (0,7,42), 3000 steps, lr=0.03, early stopping on arc-gen loss
+  - ks=(3,1) or (5,1) two-layer
+  - **Ternary snap**: after training, snap weights to {-1,0,1}, re-validate on arc-gen
+- [ ] **Validate**: Run on 50 tasks. Compare arc-gen survival vs lstsq baseline.
+  - Needs GPU (T4 minimum). CPU too slow for 400×3 seeds.
+  - Accept only if >10% improvement AND total runtime <12hr Kaggle limit.
+### 2f: Generate More ARC-GEN Data
+- [ ] **Use ARC-GEN generator** (github.com/google/ARC-GEN) to produce 1000+ examples/task
+  - More fitting data = more constraints, but ONLY helps if we avoid interpolation regime
+  - Combine with PCA or GD — lstsq with more rows still overfits if p > n
+- [ ] **Validate**: Test on 20 tasks with 1000 vs 250 arc-gen examples.
+  - Compare arc-gen survival. Accept only if >10% improvement.
+---
+## Phase 3: Hard Tasks — Hash Matchers & Pattern Detectors (est +20-50 tasks → ~2500-3000)
+### 3a: Hash-Based Matcher Builder
+- [ ] **Generic hash matcher**: flatten input → MatMul(hash_weights) → match → apply stored delta
+  - Requires opset 17 (ScatterND)
+  - Works for ANY task where all examples fit in 1.44MB model
+  - Build `build_hash_matcher(task_data) → onnx_bytes`
+- [ ] **Validate**: Identify 10 tasks that no solver handles. Test hash matcher on them.
+  - Accept if it solves ≥2 tasks that are currently unsolved.
+### 3b: Run-Length / Gap Pattern Detector
+- [ ] **Depthwise conv to detect runs of N, gap patterns** — like task096 in public notebooks
+  - Template for "count and classify" tasks
+- [ ] **Validate**: Find tasks with run-length structure. Test detector.
+  - Accept if it solves ≥2 new tasks.
+### 3c: Per-Task LLM Rescue
+- [ ] **For ~20 hardest tasks**: feed task JSON + Python solution to LLM → get ONNX builder
+  - Priority: gravity, flood fill, outline extraction, pattern counting
+- [ ] **Validate**: Build 5 rescue models. Arc-gen validate. Accept if ≥3 pass.
+---
 ## Phase 4: Score Optimization (est +200-500 pts on existing tasks)
+### 4a: ONNX Optimizer Pass
+- [ ] **`onnxoptimizer.optimize()`** with dead-code elimination, identity removal
   - Top notebooks do this; can shrink models 5-20%
+- [ ] **Validate**: Run on all 400 models. Compare total score before/after.
+  - Accept if total score improves by >2%.
+### 4b: Best-of-N Model Selection
+- [ ] **For each task**: generate multiple candidates (different ks, bias/no-bias, PCA vs raw, etc.)
+  - Keep cheapest valid one
+- [ ] **Validate**: Full 400 run. Compare total score vs single-candidate selection.
+  - Accept if total score improves by >3%.
+### 4c: Official Scoring Alignment
+- [ ] **Use `neurogolf_utils.score_network()`** — `onnx_tool` for exact cost matching
+  - Our static profiler may diverge on edge cases
+- [ ] **Validate**: Compare static profiler vs onnx_tool on 50 random models.
+  - Accept if divergence >5% and fix profiler.
+---
+## BLENDING — EXPLICITLY EXCLUDED
+> **User's competitive philosophy**: "I am writing my own models no blending. This is major flaw in the competition loophole."
+- [ ] ~~Blend pipeline~~ — **NOT DONE. Not our strategy.**
+- [ ] ~~Upload submission.zip as Kaggle dataset~~ — **NOT DONE.**
+- [ ] ~~Attach public datasets (24 sources)~~ — **NOT DONE.**
+Competitive intelligence on blending stays in LEARNING.md "What Others Do" section only.
+---
+## Experiment Log
+| Date | Experiment | Tasks Tested | Result | Decision |
+|------|-----------|-------------|--------|----------|
+| 2026-04-24 | v4.2 baseline | 400 | 50 arc-gen, ~670 LB | Keep |
+| 2026-04-25 | v5 untested code | 10 | 3/10 FAILED arc-gen | **REVERTED** |
+| 2026-04-25 | LOOCV Ridge theory | 0 | Never tested — theory predicts failure | **NOT IMPLEMENTED** |
+---
 ## Status Key
 | Symbol | Meaning |
 |--------|---------|
+| `[ ]` | Not started — need research/design first |
+| `[~]` | In progress — experiment running |
+| `[x]` | Done — validated with arc-gen on ≥20 tasks, confirmed score increase |
+| `[!]` | Blocked — needs prerequisite or resource (e.g., GPU) |
+| `[-]` | Rejected — tested, did not improve arc-gen survival or score |
+## Research Queue (Next 3 Papers to Read)
+1. **arXiv:2302.00257** — "Benign overfitting in ridge regression..." (Lasso vs Ridge in sparse regimes)
+2. **Belkin et al. (2019) PNAS** — "Reconciling modern machine-learning practice..." (double descent, interpolation threshold)
+3. **CITE NEEDED** — ARC-AGI solver papers from NeurIPS 2024 / ICML 2024 workshops
+> Loop: Research → Design → Experiment → Analyze → Research → ... until score increases.