rogermt commited on
Commit
863483e
Β·
verified Β·
1 Parent(s): eabdff6

v4.3: Update TODO.md with experiment queue, research loop, status key, explicit blending exclusion

Browse files
Files changed (1) hide show
  1. TODO.md +148 -48
TODO.md CHANGED
@@ -1,74 +1,174 @@
1
  # NeuroGolf Solver β€” Roadmap
2
 
3
- > Current: v4.2 Β· 50 arc-gen validated Β· ~670 LB Β· Target: 3000+
 
 
4
 
5
  ## Phase 1: Cheap Wins (est +400 pts β†’ ~1100)
6
 
7
- - [ ] **Switch to opset 17** β€” replace all Gather-index models with Slice+Transpose builders
 
8
  - Rotation: `Crop β†’ Transpose β†’ Slice(step=-1)` = ~0 cost (was ~165K)
9
  - Flip: `Crop β†’ Slice(step=-1)` = ~0 cost (was ~165K)
10
  - Transpose: `Crop β†’ Transpose(perm)` = ~0 cost (was ~36K)
11
- - ~25 analytical tasks go from ~15 pts β†’ ~25 pts each
12
- - [ ] **Channel reduction wrapper** — `Conv1x1(10→N) → transform → Conv1x1(N→10)` when <8 colors used
13
- - Saves ~20-40% MACs on conv tasks with few colors
14
- - [ ] **Composition detectors** β€” rotation+color, flip+color, transpose+color
15
- - These are tasks where two operations are combined (e.g. rotate then recolor)
16
- - Top notebooks have these, we don't
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  ## Phase 2: Fix Arc-Gen Survival (est +100-150 tasks β†’ ~2000-2500)
19
 
20
- This is the #1 blocker. We solve 307 locally but only 50 survive arc-gen.
21
-
22
- - [ ] **PyTorch learned conv on GPU** β€” train on train+test+arc-gen data
23
- - Multi-seed Adam (seeds 0,7,42), 3000 steps, lr=0.03
24
- - Try ks=1,3,5 single-layer + ks=(3,1) and (5,1) two-layer with ReLU
25
- - **Ternary weight snap** β€” after training, snap weights to {-1,0,1}, re-validate
26
- - Must include arc-gen examples in training data (not just validation)
27
- - Needs GPU (T4 minimum) β€” CPU too slow for 400 tasks Γ— 3 seeds Γ— multiple ks
28
- - [ ] **Increase arc-gen in lstsq fitting** β€” currently capped at 10, try 20-50 for fixed-size tasks
29
- - More data = more constraints = less overfitting in underdetermined systems
30
- - [ ] **Generate MORE arc-gen data** β€” use ARC-GEN generator (github.com/google/ARC-GEN) to produce 1000+ examples per task instead of ~250
31
- - More fitting data = better generalization
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- ## Phase 3: Hard Tasks β€” Hash Matchers & LLM Rescue (est +20-50 tasks β†’ ~2500-3000)
 
 
 
 
34
 
35
- For tasks no automated solver can handle.
 
 
 
36
 
37
- - [ ] **Hash-based matcher builder** β€” automated version of the LLM rescue pattern
38
- - Flatten input β†’ MatMul(hash_weights) β†’ match against all known examples β†’ apply stored delta
39
- - Requires opset 17 (ScatterND)
40
- - Works for ANY task where all examples fit in 1.44MB model
41
- - Build a generic `build_hash_matcher(task_data) β†’ onnx_bytes` function
42
- - [ ] **Per-task LLM rescue** β€” for the ~20 hardest tasks with algorithmic patterns
43
- - Feed task JSON + Python solution to LLM, get back ONNX builder function
44
- - Priority tasks: gravity, flood fill, outline extraction, pattern counting
45
- - [ ] **Run-length / gap pattern detector** β€” like task096 in the notebooks
46
- - Depthwise conv to detect runs of N, gap patterns
47
- - Template for a class of "count and classify" tasks
48
 
49
  ## Phase 4: Score Optimization (est +200-500 pts on existing tasks)
50
 
51
- - [ ] **ONNX optimizer pass** β€” `onnxoptimizer.optimize()` with dead-code elimination, identity removal
 
52
  - Top notebooks do this; can shrink models 5-20%
53
- - [ ] **Best-of-N model selection** β€” for each task, generate multiple candidate models (different ks, bias/no-bias, etc.), keep cheapest valid one
54
- - Already partially done but could be more aggressive
55
- - [ ] **Validate with official `neurogolf_utils.score_network()`** β€” use `onnx_tool` for exact cost matching
56
- - Our static profiler is close but may diverge on edge cases
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
- ## Optional: Blend Pipeline
59
 
60
- If the above isn't enough, we can build our own blend pipeline:
61
 
62
- - [ ] Upload our solver's `submission.zip` as a Kaggle dataset
63
- - [ ] Create a blend notebook that loads our own output + runs a second-pass solver
64
- - [ ] Attach public datasets (see LEARNING.md for the full list of 24 sources)
65
- - [ ] `strict_validate()` every model through `neurogolf_utils` before submission
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ## Status Key
68
 
69
  | Symbol | Meaning |
70
  |--------|---------|
71
- | `[ ]` | Not started |
72
- | `[~]` | In progress |
73
- | `[x]` | Done |
74
- | `[!]` | Blocked |
 
 
 
 
 
 
 
 
 
 
1
  # NeuroGolf Solver β€” Roadmap
2
 
3
+ > Current: v4.3 Β· 50 arc-gen validated Β· ~670 LB Β· Target: 3000+
4
+ > Philosophy: **Research β†’ Design β†’ Experiment β†’ Analyze β†’ Research** loop until confirmed score increase.
5
+ > Rule: **NEVER claim a feature works without full arc-gen validation on representative tasks.**
6
 
7
  ## Phase 1: Cheap Wins (est +400 pts β†’ ~1100)
8
 
9
+ ### 1a: Opset 17 Slice-Based Analytical Solvers (~0 cost)
10
+ - [ ] **Convert ALL analytical solvers to opset 17** β€” not just new ones
11
  - Rotation: `Crop β†’ Transpose β†’ Slice(step=-1)` = ~0 cost (was ~165K)
12
  - Flip: `Crop β†’ Slice(step=-1)` = ~0 cost (was ~165K)
13
  - Transpose: `Crop β†’ Transpose(perm)` = ~0 cost (was ~36K)
14
+ - Pad nodes: all must use opset 17 tensor-based `pads` input (not attribute)
15
+ - Affected solvers: s_tile, s_upscale, s_concat, s_concat_enhanced, s_kronecker, s_diagonal_tile, s_shift, s_mirror_h, s_mirror_v, s_quad_mirror, s_fixed_crop, s_spatial_gather, s_varshape_spatial_gather
16
+ - [ ] **Validate**: Full 400 arc-gen run. Compare analytical task count vs v4.
17
+ - Target: ~25 analytical tasks scoring ~25 pts each (was ~15)
18
+ - Accept only if >10% improvement in analytical category total score.
19
+
20
+ ### 1b: Composition Detectors
21
+ - [ ] **Identify actual tasks** that are rotation+recolor, flip+recolor, transpose+recolor
22
+ - Scan 400 tasks: apply rotate β†’ check if color_map solves, etc.
23
+ - Only implement solvers for combinations that exist in dataset
24
+ - [ ] **Build composition solver** β€” chain analytical + color_map as single ONNX graph
25
+ - [ ] **Validate**: Full 400 arc-gen. Count new tasks solved. Accept only if >0 new tasks.
26
+
27
+ ### 1c: Channel Reduction Wrapper
28
+ - [ ] **Design for Gather compatibility** β€” current Reshape hardcodes [1,10,900]
29
+ - Option A: Add Conv1x1(10→N) before + Conv1x1(N→10) after for conv-based models
30
+ - Option B: Use Slice to extract active channels + Gather remapping for pure spatial transforms
31
+ - [ ] **Validate**: Pick 5 tasks with <5 colors. Compare score with/without wrapper.
32
+ - Accept only if >5% score improvement per task AND arc-gen still passes.
33
+
34
+ ---
35
 
36
  ## Phase 2: Fix Arc-Gen Survival (est +100-150 tasks β†’ ~2000-2500)
37
 
38
+ > **This is the #1 blocker.** We solve 307 locally but only 50 survive arc-gen.
39
+ > Research (Bartlett et al., Belkin et al., arXiv:2306.13185) shows:
40
+ > - Our patch covariance has LOW effective rank (~10-40) vs nβ‰ˆ600 patches
41
+ > - This is CATASTROPHIC overfitting regime, NOT benign
42
+ > - Ridge/LOOCV Ξ» tuning CANNOT fix this β€” theory predicts failure
43
+
44
+ ### 2a: Skip Interpolation Threshold Kernels
45
+ - [ ] **Remove ks=5,7,9 from conv fitting** β€” these are at/near double descent peak
46
+ - Try ks list: [1, 3, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29]
47
+ - Rationale: ks=5 (p=490, nβ‰ˆ600) is worst-case. ks=1 (p=10) is safe. ks=29 (p=8410) is overparameterized but at least past the peak.
48
+ - [ ] **Validate**: Full 400 arc-gen. Compare arc-gen survival rate vs v4.
49
+ - Accept only if survival rate improves by >10% (5+ more tasks).
50
+
51
+ ### 2b: PCA Dimensionality Reduction Before lstsq
52
+ - [ ] **PCA pre-processing**: project patch matrix P to top-k components (k=15-25 matching effective rank)
53
+ - Fit PCA on training patches, transform both P and test patches, then lstsq on reduced space
54
+ - Ensures p_reduced << n, avoiding interpolation regime entirely
55
+ - [ ] **Validate**: Test on 20 tasks that currently fail arc-gen at ks=7,9.
56
+ - Compare: raw lstsq vs PCA+lstsq. Measure arc-gen pass rate.
57
+ - Accept only if >20% of previously-failing tasks now pass.
58
+
59
+ ### 2c: Gradient Descent with Early Stopping (Alternative to lstsq)
60
+ - [ ] **Iterative solver**: Adam on conv weights, early stop at ~95% train accuracy (don't interpolate)
61
+ - Implicit ℓ₁-like regularization β€” theory predicts better generalization than explicit Ridge
62
+ - Use small model: ks=3 single-layer or ks=(3,1) two-layer
63
+ - [ ] **Validate**: Same 20 failing tasks. Compare lstsq vs early-stopping GD.
64
+ - Accept only if >15% improvement in arc-gen survival.
65
+
66
+ ### 2d: Lasso / Sparse Regression
67
+ - [ ] **Replace np.linalg.lstsq with sklearn.linear_model.Lasso**
68
+ - Ξ± tuning via cross-validation on training data
69
+ - Matches sparse signal structure of one-hot patches
70
+ - [ ] **Validate**: Same 20 failing tasks. Compare lstsq vs Lasso.
71
+ - Accept only if >15% improvement.
72
+
73
+ ### 2e: PyTorch Multi-Seed with Arc-Gen Training (GPU Required)
74
+ - [ ] **Train Conv→ReLU→Conv on train+test+arc-gen** (all available examples matching grid size)
75
+ - Multi-seed (0,7,42), 3000 steps, lr=0.03, early stopping on arc-gen loss
76
+ - ks=(3,1) or (5,1) two-layer
77
+ - **Ternary snap**: after training, snap weights to {-1,0,1}, re-validate on arc-gen
78
+ - [ ] **Validate**: Run on 50 tasks. Compare arc-gen survival vs lstsq baseline.
79
+ - Needs GPU (T4 minimum). CPU too slow for 400Γ—3 seeds.
80
+ - Accept only if >10% improvement AND total runtime <12hr Kaggle limit.
81
+
82
+ ### 2f: Generate More ARC-GEN Data
83
+ - [ ] **Use ARC-GEN generator** (github.com/google/ARC-GEN) to produce 1000+ examples/task
84
+ - More fitting data = more constraints, but ONLY helps if we avoid interpolation regime
85
+ - Combine with PCA or GD β€” lstsq with more rows still overfits if p > n
86
+ - [ ] **Validate**: Test on 20 tasks with 1000 vs 250 arc-gen examples.
87
+ - Compare arc-gen survival. Accept only if >10% improvement.
88
+
89
+ ---
90
+
91
+ ## Phase 3: Hard Tasks β€” Hash Matchers & Pattern Detectors (est +20-50 tasks β†’ ~2500-3000)
92
+
93
+ ### 3a: Hash-Based Matcher Builder
94
+ - [ ] **Generic hash matcher**: flatten input β†’ MatMul(hash_weights) β†’ match β†’ apply stored delta
95
+ - Requires opset 17 (ScatterND)
96
+ - Works for ANY task where all examples fit in 1.44MB model
97
+ - Build `build_hash_matcher(task_data) β†’ onnx_bytes`
98
+ - [ ] **Validate**: Identify 10 tasks that no solver handles. Test hash matcher on them.
99
+ - Accept if it solves β‰₯2 tasks that are currently unsolved.
100
 
101
+ ### 3b: Run-Length / Gap Pattern Detector
102
+ - [ ] **Depthwise conv to detect runs of N, gap patterns** β€” like task096 in public notebooks
103
+ - Template for "count and classify" tasks
104
+ - [ ] **Validate**: Find tasks with run-length structure. Test detector.
105
+ - Accept if it solves β‰₯2 new tasks.
106
 
107
+ ### 3c: Per-Task LLM Rescue
108
+ - [ ] **For ~20 hardest tasks**: feed task JSON + Python solution to LLM β†’ get ONNX builder
109
+ - Priority: gravity, flood fill, outline extraction, pattern counting
110
+ - [ ] **Validate**: Build 5 rescue models. Arc-gen validate. Accept if β‰₯3 pass.
111
 
112
+ ---
 
 
 
 
 
 
 
 
 
 
113
 
114
  ## Phase 4: Score Optimization (est +200-500 pts on existing tasks)
115
 
116
+ ### 4a: ONNX Optimizer Pass
117
+ - [ ] **`onnxoptimizer.optimize()`** with dead-code elimination, identity removal
118
  - Top notebooks do this; can shrink models 5-20%
119
+ - [ ] **Validate**: Run on all 400 models. Compare total score before/after.
120
+ - Accept if total score improves by >2%.
121
+
122
+ ### 4b: Best-of-N Model Selection
123
+ - [ ] **For each task**: generate multiple candidates (different ks, bias/no-bias, PCA vs raw, etc.)
124
+ - Keep cheapest valid one
125
+ - [ ] **Validate**: Full 400 run. Compare total score vs single-candidate selection.
126
+ - Accept if total score improves by >3%.
127
+
128
+ ### 4c: Official Scoring Alignment
129
+ - [ ] **Use `neurogolf_utils.score_network()`** β€” `onnx_tool` for exact cost matching
130
+ - Our static profiler may diverge on edge cases
131
+ - [ ] **Validate**: Compare static profiler vs onnx_tool on 50 random models.
132
+ - Accept if divergence >5% and fix profiler.
133
+
134
+ ---
135
 
136
+ ## BLENDING β€” EXPLICITLY EXCLUDED
137
 
138
+ > **User's competitive philosophy**: "I am writing my own models no blending. This is major flaw in the competition loophole."
139
 
140
+ - [ ] ~~Blend pipeline~~ β€” **NOT DONE. Not our strategy.**
141
+ - [ ] ~~Upload submission.zip as Kaggle dataset~~ β€” **NOT DONE.**
142
+ - [ ] ~~Attach public datasets (24 sources)~~ β€” **NOT DONE.**
143
+
144
+ Competitive intelligence on blending stays in LEARNING.md "What Others Do" section only.
145
+
146
+ ---
147
+
148
+ ## Experiment Log
149
+
150
+ | Date | Experiment | Tasks Tested | Result | Decision |
151
+ |------|-----------|-------------|--------|----------|
152
+ | 2026-04-24 | v4.2 baseline | 400 | 50 arc-gen, ~670 LB | Keep |
153
+ | 2026-04-25 | v5 untested code | 10 | 3/10 FAILED arc-gen | **REVERTED** |
154
+ | 2026-04-25 | LOOCV Ridge theory | 0 | Never tested β€” theory predicts failure | **NOT IMPLEMENTED** |
155
+
156
+ ---
157
 
158
  ## Status Key
159
 
160
  | Symbol | Meaning |
161
  |--------|---------|
162
+ | `[ ]` | Not started β€” need research/design first |
163
+ | `[~]` | In progress β€” experiment running |
164
+ | `[x]` | Done β€” validated with arc-gen on β‰₯20 tasks, confirmed score increase |
165
+ | `[!]` | Blocked β€” needs prerequisite or resource (e.g., GPU) |
166
+ | `[-]` | Rejected β€” tested, did not improve arc-gen survival or score |
167
+
168
+ ## Research Queue (Next 3 Papers to Read)
169
+
170
+ 1. **arXiv:2302.00257** β€” "Benign overfitting in ridge regression..." (Lasso vs Ridge in sparse regimes)
171
+ 2. **Belkin et al. (2019) PNAS** β€” "Reconciling modern machine-learning practice..." (double descent, interpolation threshold)
172
+ 3. **CITE NEEDED** β€” ARC-AGI solver papers from NeurIPS 2024 / ICML 2024 workshops
173
+
174
+ > Loop: Research β†’ Design β†’ Experiment β†’ Analyze β†’ Research β†’ ... until score increases.