rogermt commited on
Commit
05ef2ec
Β·
verified Β·
1 Parent(s): 909a0d3

Rewrite Phase 3 with research-backed blueprints + honest projections

Browse files

Based on CompressARC (2512.06104), TRM (2510.04871), NCA (2506.15746),
and ONNX opset 17 operator audit. Realistic score estimates per solver type.
Removed fake excluded tasks references throughout."

Files changed (1) hide show
  1. TODO.md +217 -165
TODO.md CHANGED
@@ -1,172 +1,238 @@
1
  # NeuroGolf Solver β€” Roadmap
2
 
3
- > Current: v5.1 Β· 49 arc-gen validated (budget=5s) Β· ~603.6 score Β· Target: 3000+
4
  > Philosophy: **Research β†’ Design β†’ Experiment β†’ Analyze β†’ Research** loop until confirmed score increase.
5
  > Rule: **NEVER claim a feature works without full arc-gen validation on representative tasks.**
6
- > Updated: 2026-04-26 β€” Exp 3 (PCA/SVD) fully tested on 400 tasks. 0 PCR solves. Architecture mismatch confirmed.
 
7
 
8
  ---
9
 
10
- ## Phase 1: Cheap Wins (est +400 pts β†’ ~1100)
11
 
12
- ### 1a: Opset 17 Slice-Based Analytical Solvers (~0 cost)
13
- - [ ] **Convert ALL analytical solvers to opset 17** β€” not just new ones
14
- - Rotation: `Crop β†’ Transpose β†’ Slice(step=-1)` = ~0 cost (was ~165K)
15
- - Flip: `Crop β†’ Slice(step=-1)` = ~0 cost (was ~165K)
16
- - Transpose: `Crop β†’ Transpose(perm)` = ~0 cost (was ~36K)
17
- - Pad nodes: all must use opset 17 tensor-based `pads` input (not attribute)
18
- - Affected solvers: s_tile, s_upscale, s_concat, s_concat_enhanced, s_kronecker, s_diagonal_tile, s_shift, s_mirror_h, s_mirror_v, s_quad_mirror, s_fixed_crop, s_spatial_gather, s_varshape_spatial_gather
19
- - [ ] **Validate**: Full 400 arc-gen run. Compare analytical task count vs v4.
20
- - Target: ~25 analytical tasks scoring ~25 pts each (was ~15)
21
- - Accept only if >10% improvement in analytical category total score.
22
 
23
- ### 1b: Composition Detectors
24
- - [ ] **Identify actual tasks** that are rotation+recolor, flip+recolor, transpose+recolor
25
- - Scan 400 tasks: apply rotate β†’ check if color_map solves, etc.
26
- - Only implement solvers for combinations that exist in dataset
27
- - [ ] **Build composition solver** β€” chain analytical + color_map as single ONNX graph
28
- - [ ] **Validate**: Full 400 arc-gen. Count new tasks solved. Accept only if >0 new tasks.
29
 
30
- ### 1c: Channel Reduction Wrapper
31
- - [ ] **Design for Gather compatibility** β€” current Reshape hardcodes [1,10,900]
32
- - Option A: Add Conv1x1(10→N) before + Conv1x1(N→10) after for conv-based models
33
- - Option B: Use Slice to extract active channels + Gather remapping for pure spatial transforms
34
- - [ ] **Validate**: Pick 5 tasks with <5 colors. Compare score with/without wrapper.
35
- - Accept only if >5% score improvement per task AND arc-gen still passes.
 
 
 
 
 
 
 
 
 
 
36
 
37
  ---
38
 
39
- ## Phase 2: Fix Arc-Gen Survival β€” EXPERIMENTS COMPLETED
40
 
41
- > **Status:** Exps 0-3 tested. Root cause is architecture mismatch, not regularization.
42
- > **Action:** Move to Phase 3 (new solver types). Keep PCR code for future Lasso/Ridge experiments.
43
 
44
- ### The Problem (with numbers from conv.py)
45
 
46
- Current `_lstsq_conv()` runs `np.linalg.lstsq(P, T_oh, rcond=None)` β€” zero regularization.
47
- v5.1 refactored to composable primitives: `_build_patch_matrix` + `_solve_weights` + `_extract_weights`.
48
- PCR (`_solve_weights_pcr`) added as deferred 2nd-pass fallback.
49
 
50
- | Kernel | p (features) | n (patches, 7Γ—7 grid, 4 ex) | p/n | Regime |
51
- |--------|-------------|------------------------------|-----|--------|
52
- | ks=1 | 10 | 196 | 0.05 | βœ… Safe underparameterized |
53
- | ks=3 | 90 | 196 | 0.46 | βœ… Underparameterized |
54
- | **ks=5** | **250** | **196** | **1.27** | **❌ INTERPOLATION THRESHOLD** |
55
- | **ks=7** | **490** | **196** | **2.50** | **❌ PAST THRESHOLD** |
56
- | ks=11 | 1210 | 196 | 6.17 | Overparameterized |
57
- | ks=29 | 8410 | 196 | 42.9 | Heavily overparameterized |
58
 
59
- ### Literature Backing
60
 
61
- | Paper | arxiv | Key Finding for Us |
62
- |-------|-------|--------------------|
63
- | Nakkiran et al. 2019 (NeurIPS) | `1912.02292` | Test error peaks at pβ‰ˆn. Correct theory but inapplicable β€” tasks fail for architecture mismatch, not regularization. |
64
- | Segert 2023 | `2311.11093` | PCA > Ridge for low-rank covariance. Tested: 0/400 PCR solves. Signal is in the noise dimensions PCA removes. |
65
- | Zhou & Ge 2023 (NeurIPS) | `2302.00257` | L1 near-minimax for sparse signals. **Untested** β€” may still help for Exp 5. |
66
- | Liao & Gu 2024 (CompressARC) | `2512.06104` | Regularization enables ARC generalization. True in their framework (MDL/KL) but conv lstsq is a different beast. |
 
 
 
 
67
 
68
- ### Experiment Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
- #### Exp 0: Baseline Measurement [-] DONE
71
- - v5.0 on 400 tasks with budget=5s: **49 solved, 603.6 score**
72
- - Conv breakdown: 16 conv_var + 8 conv_fixed + 1 conv_diff = 25 conv tasks
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
- #### Exp 1: Skip ks=5,7,9 [-] REJECTED
75
- - HURTS 2 solved tasks (322@ks5, 299@ks9), helps 0 new
76
 
77
- #### Exp 2: Best-of-N [~] NEUTRAL
78
- - No new solves on unsolved tasks. Score optimization only.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
- #### Exp 3: PCA / Truncated SVD [-] REJECTED β€” Confidence: ~~75%~~ β†’ **0%**
81
 
82
- **Full test results (2026-04-26):**
 
83
 
84
- **Diagnostic on 25 solved conv tasks:**
85
- | p/n regime | Tasks | PCR at 0.99 | Arc-gen impact |
86
- |------------|-------|-------------|----------------|
87
- | p/n < 0.5 (safe) | 17 | Mostly fits train | Already 100% ag β€” no improvement possible |
88
- | p/n > 1.0 (danger) | 8 | 4 fail to fit train at ANY threshold | PCR removes dimensions that carry signal |
89
 
90
- **Diagnostic on 345 unsolved tasks (same-shape only, ks≀9):**
91
- - Only **10 tasks** have any ks where lstsq fits training
92
- - PCR improves arc-gen accuracy on **4 tasks** (by 3-9%) but **none reach 100%** required for validation
93
- - Task 32: lstsq 87.5% β†’ PCR 94.9% (still fails)
94
- - Task 389: lstsq 87.2% β†’ PCR 95.7% (still fails)
95
- - Task 129: lstsq 59.6% β†’ PCR 63.0% (still fails)
96
- - Task 229: lstsq 57.0% β†’ PCR 60.0% (still fails)
97
 
98
- **Full 400-task run with PCR-enhanced solver:**
99
- - 50 solved (vs 49 baseline) β€” the +1 is Task 61, a **timing artifact** (took 11.8s, not a PCR solve)
100
- - **0 tasks solved via PCR path**
101
- - **0 regressions** on existing 25 conv tasks
102
- - Code kept: composable primitives useful for future Lasso/Ridge experiments
103
 
104
- **Why PCR failed:**
105
- 1. For tasks with p/n < 0.5: lstsq already generalizes perfectly. PCR is unnecessary.
106
- 2. For tasks with p/n > 1.0: the training signal requires ALL patch dimensions to interpolate. PCA truncation removes exactly the dimensions that encode the (noisy) signal, causing train_fail.
107
- 3. For unsolved tasks: most (~335/345) can't be fit by ANY ks β€” architecture mismatch (conv can't represent the required operation). The 10 that fit have wrong arc-gen behavior because the task requires global reasoning, not local patches.
108
 
109
- #### Exp 4: Increase Arc-Gen Fitting Cap [DEPRIORITIZED]
110
- > Only works with regularization. Since regularization (Exp 3) didn't help, this is moot.
 
 
 
 
 
 
 
111
 
112
- #### Exp 5: Lasso (L1) for Large Kernels ⬜ β€” Confidence: **55%**
113
- > Still potentially useful β€” L1 selects sparse features differently from PCA. Untested.
114
- > But given that only 10/345 unsolved tasks even have lstsq fits, the ceiling is very low.
115
 
116
- #### Exp 6-8: [DEPRIORITIZED]
 
 
117
 
118
  ---
119
 
120
- ### Phase 2 Post-Mortem
 
121
 
122
- **Original projection was wildly optimistic:**
123
- | Scenario | Projected | Actual |
124
- |----------|-----------|--------|
125
- | Exp 1 alone | 60-80 tasks | **HURT** 2 tasks |
126
- | Exp 1+2+3 | 90-130 tasks | **49 tasks** (no change) |
127
 
128
- **Root cause confirmed:** Architecture mismatch, not regularization. The ~300 unsolved tasks require operations (mode counting, flood fill, outline detection, pattern matching) that NO local convolution can represent, regardless of regularization.
129
 
130
- **Next steps:** Phase 3 (new solver types) or new architectures. The conv solver has reached its ceiling at ~25 tasks.
 
 
 
131
 
132
  ---
133
 
134
- ## Phase 3: Hard Tasks β€” Hash Matchers & Pattern Detectors (est +20-50 tasks β†’ ~2500-3000)
 
135
 
136
- ### 3a: Hash-Based Matcher Builder
137
- - [ ] **Generic hash matcher**: flatten input β†’ MatMul(hash_weights) β†’ match β†’ apply stored delta
138
- - Requires opset 17 (ScatterND)
139
- - Works for ANY task where all examples fit in 1.44MB model
140
- - Build `build_hash_matcher(task_data) β†’ onnx_bytes`
141
- - [ ] **Validate**: Identify 10 tasks that no solver handles. Test hash matcher on them.
142
- - Accept if it solves β‰₯2 tasks that are currently unsolved.
143
 
144
- ### 3b: Run-Length / Gap Pattern Detector
145
- - [ ] **Depthwise conv to detect runs of N, gap patterns** β€” like task096 in public notebooks
146
- - Template for "count and classify" tasks
147
- - [ ] **Validate**: Find tasks with run-length structure. Test detector.
148
- - Accept if it solves β‰₯2 new tasks.
149
 
150
- ### 3c: Per-Task LLM Rescue
151
- - [ ] **For ~20 hardest tasks**: feed task JSON + Python solution to LLM β†’ get ONNX builder
152
- - Priority: gravity, flood fill, outline extraction, pattern counting
153
- - [ ] **Validate**: Build 5 rescue models. Arc-gen validate. Accept if β‰₯3 pass.
154
 
155
  ---
156
 
157
- ## Phase 4: Score Optimization (est +200-500 pts on existing tasks)
158
 
159
- ### 4a: ONNX Optimizer Pass
160
- - [ ] **`onnxoptimizer.optimize()`** with dead-code elimination, identity removal
161
- - Top notebooks do this; can shrink models 5-20%
162
- - [ ] **Validate**: Run on all 400 models. Compare total score before/after.
163
- - Accept if total score improves by >2%.
164
 
165
- ### 4b: Official Scoring Alignment
166
- - [ ] **Use `neurogolf_utils.score_network()`** β€” `onnx_tool` for exact cost matching
167
- - Our static profiler may diverge on edge cases
168
- - [ ] **Validate**: Compare static profiler vs onnx_tool on 50 random models.
169
- - Accept if divergence >5% and fix profiler.
 
 
 
 
 
170
 
171
  ---
172
 
@@ -174,12 +240,6 @@ PCR (`_solve_weights_pcr`) added as deferred 2nd-pass fallback.
174
 
175
  > **User's competitive philosophy**: "I am writing my own models no blending. This is major flaw in the competition loophole."
176
 
177
- - ~~Blend pipeline~~ β€” **NOT DONE. Not our strategy.**
178
- - ~~Upload submission.zip as Kaggle dataset~~ β€” **NOT DONE.**
179
- - ~~Attach public datasets (24 sources)~~ β€” **NOT DONE.**
180
-
181
- Competitive intelligence on blending stays in LEARNING.md "What Others Do" section only.
182
-
183
  ---
184
 
185
  ## Experiment Log
@@ -188,52 +248,44 @@ Competitive intelligence on blending stays in LEARNING.md "What Others Do" secti
188
  |------|-----------|-------------|--------|----------|
189
  | 2026-04-24 | v4.2 baseline | 400 | 50 arc-gen, ~670 LB | Keep as baseline |
190
  | 2026-04-25 | v5 untested code | 10 | 3/10 FAILED arc-gen | **REVERTED** |
191
- | 2026-04-26 | v5.0 refactor | 394 | **49 solved, ~603.6 score, budget=5s** | New baseline |
192
- | 2026-04-26 | Exp 0: Baseline | 25 conv tasks | 24/25 solved, score=253 | Baseline for conv |
193
- | 2026-04-26 | Exp 1: Skip ks=5,7,9 | 25 conv+30 unsolved | **HURTS 2 solved tasks** | **[-] REJECTED** |
194
- | 2026-04-26 | Exp 2: Best-of-N | 25 conv+30 unsolved | **No new solves** | **[~] NEUTRAL** |
195
- | 2026-04-26 | Exp 3: Ridge reg | 4 victims Γ— 5 alphas | **0/4 pass arc-gen** | **[-] REJECTED** |
196
- | 2026-04-26 | Exp 3: PCA/trunc-SVD (partial) | Task 129 | **0 pass** | **[-] REJECTED for lstsq** |
197
- | 2026-04-26 | **Exp 3: Full PCA/SVD** | **400 tasks** | **0 PCR solves, 0 regressions, code refactored** | **[-] REJECTED (code kept)** |
198
 
199
- ### CRITICAL FINDING (2026-04-26) β€” STRENGTHENED
200
 
201
- The "307β†’50 arc-gen survival gap" is **NOT caused by lstsq overfitting**. Period.
202
 
203
- **Evidence (strengthened with full Exp 3 data):**
204
- 1. Only **10 of 345** unsolved same-shape tasks pass train-fit at any ks≀9.
205
- 2. Ridge (L2) on 4 victim tasks Γ— 5 alphas: **zero arc-gen passes**.
206
- 3. PCA/truncated-SVD on 400 tasks with thresholds {0.999, 0.99, 0.95}: **zero arc-gen validates**.
207
- 4. PCR improves arc-gen accuracy by 3-9% on 4 unsolved tasks β€” but 95.7% is the ceiling. 100% is required.
208
- 5. For tasks where conv IS the right solver (25 tasks), lstsq already generalizes perfectly (100% arc-gen at p/n < 0.5).
209
 
210
- **Root cause:** Architecture mismatch. Tasks that fail arc-gen require operations (mode counting, flood fill, outline detection, conditional logic) that no local convolution can represent.
211
 
212
- **Impact:** Phase 2 regularization experiments are exhausted. Score improvement must come from:
213
- - Phase 1a: Opset 17 conversions (reduce cost on existing solved tasks)
214
- - Phase 3: New solver types (hash matchers, pattern detectors, LLM rescue)
215
- - Phase 4: ONNX optimization + scoring alignment
 
 
 
 
 
 
216
 
217
- ---
218
 
219
- ## Status Key
220
-
221
- | Symbol | Meaning |
222
- |--------|---------|
223
- | `⬜` / `[ ]` | Not started β€” designed, ready to implement |
224
- | `[~]` | In progress β€” experiment running |
225
- | `[x]` | Done β€” validated with arc-gen on β‰₯20 tasks, confirmed score increase |
226
- | `[!]` | Blocked β€” needs prerequisite or resource (e.g., GPU) |
227
- | `[-]` | Rejected β€” tested, did not improve arc-gen survival or score |
228
 
229
- ## Research Queue (Papers Read βœ… / To Read)
230
 
231
- 1. βœ… **Nakkiran et al. 2019** (`1912.02292`) β€” Double descent. Correct theory, inapplicable to our regime.
232
- 2. βœ… **Segert 2023** (`2311.11093`) β€” PCA > Ridge. Tested: **0/400 PCR solves**.
233
- 3. βœ… **Zhou & Ge 2023** (`2302.00257`) β€” L1 near-minimax for sparse signals. Untested.
234
- 4. βœ… **Liu et al. 2023** (`2302.01088`) β€” More rows help only with regularization. Moot since regularization doesn't help.
235
- 5. βœ… **Liao & Gu 2024** (`2512.06104`) β€” CompressARC. Different regime (MDL/KL vs conv lstsq).
236
- 6. βœ… **Ali et al. 2019** β€” GD early stopping ≑ Ridge (therefore suboptimal here)
237
- 7. [ ] **ARC Prize 2025 Technical Report** (`2601.10904`) β€” competition landscape, top approaches
 
238
 
239
- > Loop: Research β†’ Design β†’ Experiment β†’ Analyze β†’ Research β†’ ... until score increases.
 
1
  # NeuroGolf Solver β€” Roadmap
2
 
3
+ > Current: v5.1 Β· 49 arc-gen validated (budget=5s) Β· ~604 score Β· Target: 3000+
4
  > Philosophy: **Research β†’ Design β†’ Experiment β†’ Analyze β†’ Research** loop until confirmed score increase.
5
  > Rule: **NEVER claim a feature works without full arc-gen validation on representative tasks.**
6
+ > Updated: 2026-04-26 β€” Phase 2 (regularization) exhausted. Phase 3 redesigned from literature.
7
+ > **All 400 tasks count. There are NO excluded tasks.**
8
 
9
  ---
10
 
11
+ ## Current Solver Breakdown (49/400 solved)
12
 
13
+ | Category | Tasks | Avg Score | Solver |
14
+ |----------|-------|-----------|--------|
15
+ | Conv (lstsq) | 25 | ~10.5 | conv_fixed, conv_var, conv_diff, conv_var_diff |
16
+ | Analytical | 24 | ~15.5 | identity, constant, color_map, transpose, flip, rotate, shift, tile, upscale, mirror, concat, spatial_gather, etc. |
17
+ | **Unsolved** | **351** | **1.0** | β€” |
18
+ | **Total** | **400** | | **~604** |
 
 
 
 
19
 
20
+ The 351 unsolved tasks need fundamentally different solver architectures.
 
 
 
 
 
21
 
22
+ ---
23
+
24
+ ## Phase 1: Score Optimization on Existing Tasks (est +100-200 pts)
25
+
26
+ ### 1a: Opset 17 Slice-Based Analytical Solvers (~0 cost) ⬜
27
+ > Reduce MACs on the 24 analytical tasks. Currently score ~15.5 avg, target ~20+.
28
+
29
+ - [ ] Convert Gather-based solvers to Slice(step=-1) + Transpose
30
+ - Affected: s_tile, s_upscale, s_concat, s_concat_enhanced, s_kronecker, s_diagonal_tile, s_shift, s_mirror_h, s_mirror_v, s_quad_mirror, s_fixed_crop, s_spatial_gather, s_varshape_spatial_gather
31
+ - [ ] Validate: Full 400 arc-gen. Accept if >10% score increase on analytical tasks.
32
+ - **Estimate:** 24 tasks Γ— (+5 pts avg) = **+120 pts**
33
+
34
+ ### 1b: ONNX Optimizer Pass ⬜
35
+ - [ ] `onnxoptimizer.optimize()` with dead-code elimination
36
+ - [ ] Validate: Compare scores before/after on all 49 solved tasks.
37
+ - **Estimate:** 49 tasks Γ— (+1-2 pts avg) = **+50-100 pts**
38
 
39
  ---
40
 
41
+ ## Phase 2: Regularization β€” EXHAUSTED
42
 
43
+ > Exps 0-3 tested. Root cause is architecture mismatch, not overfitting.
44
+ > Conv ceiling = ~25 tasks. See Experiment Log below for full data.
45
 
46
+ ---
47
 
48
+ ## Phase 3: New Solver Types (the actual path to 3000+)
 
 
49
 
50
+ > **Research basis:** CompressARC (`2512.06104`), TRM (`2510.04871`), NCA (`2506.15746`), ONNX opset 17 operator audit.
51
+ > **Key insight:** ARC tasks cluster into ~8 families. Each family needs a specialized ONNX architecture. Score = max(1, 25 - ln(MACs + mem + params)), so tiny models score highest.
52
+ >
53
+ > **Honest math:** Solving 50 more tasks at ~12 pts avg = +600. Solving 100 more = +1200. To hit 3000 we need ~200 new tasks at ~12 pts avg. That's ambitious but structurally possible.
 
 
 
 
54
 
55
+ ### Solver Priority Table (ordered by score Γ— expected tasks)
56
 
57
+ | # | Solver | Expected Tasks | Score | Total Pts | Complexity | Key Ops |
58
+ |---|--------|---------------|-------|-----------|------------|---------|
59
+ | 1 | **Gravity (4-dir)** | 10-20 | ~12 | 120-240 | Medium | Conv(3Γ—3 shift kernel) Γ— 30 unrolled steps + Where |
60
+ | 2 | **Flood Fill (BFS)** | 10-20 | ~12 | 120-240 | Medium | Conv(3Γ—3 cross kernel) + Clip Γ— 30 steps |
61
+ | 3 | **Edge/Boundary Detect** | 10-20 | ~13 | 130-260 | Low | Conv(Laplacian/Sobel kernel) + threshold |
62
+ | 4 | **Composition (transform+recolor)** | 10-15 | ~14 | 140-210 | Low | Chain existing analytical + color_map |
63
+ | 5 | **Mode/Majority Color** | 5-10 | ~16 | 80-160 | Low | ReduceSum β†’ ArgMax β†’ Expand |
64
+ | 6 | **Color LUT (10Γ—10 MatMul)** | 10-20 | ~13 | 130-260 | Low | OneHot β†’ MatMul(W_lut) β†’ ArgMax, lstsq-fit W_lut |
65
+ | 7 | **Object Copy/Offset** | 5-15 | ~12 | 60-180 | High | ScatterND + offset detection |
66
+ | 8 | **CumSum Analysis** | 5-10 | ~15 | 75-150 | Medium | CumSum for running totals, object extent |
67
 
68
+ **Conservative total: +80-150 tasks, +850-1700 pts β†’ est LB ~1450-2300**
69
+ **Optimistic total: +150-200 tasks β†’ est LB ~2400-3000**
70
+
71
+ ---
72
+
73
+ ### 3a: Gravity Solver ⬜ β€” Confidence: **70%**
74
+ > Directional pixel propagation. ~30 unrolled steps, 4 directions.
75
+
76
+ **ONNX Blueprint:**
77
+ ```python
78
+ # Per step: pull pixel from direction, fill if empty
79
+ shift_k = np.zeros((1,1,3,3), dtype=np.float32)
80
+ shift_k[0,0,0,1] = 1.0 # gravity down: pull from row above
81
+ for i in range(30):
82
+ nodes += [
83
+ Conv(cur, shift_k, pads=[1,1,0,0]), # shifted copy
84
+ Equal(cur, zero), # is cell empty?
85
+ Where(is_empty, shifted, cur), # fill empty cells
86
+ ]
87
+ ```
88
+
89
+ **Fitting:** For each task, try all 4 directions. Detect "empty color" (usually 0). Validate against arc-gen.
90
+ **Cost:** ~240K MACs (30 steps Γ— 8100 per Conv), ~4.8KB, score ~12.
91
+ **Implementation:** ~60 lines in `neurogolf_solver/solvers/gravity.py`
92
+
93
+ - [ ] Implement `s_gravity_unrolled(td)` for all 4 directions
94
+ - [ ] Detect empty color from training examples
95
+ - [ ] Validate on 400 tasks
96
+ - **Accept if:** β‰₯3 new tasks solved
97
+
98
+ ---
99
 
100
+ ### 3b: Flood Fill Solver ⬜ β€” Confidence: **60%**
101
+ > BFS via unrolled Conv. Seeds propagate through passable cells.
102
+
103
+ **ONNX Blueprint:**
104
+ ```python
105
+ # 30-step BFS. Seed starts at one color, spreads through another.
106
+ cross_k = np.array([[0,1,0],[1,0,1],[0,1,0]], dtype=np.float32).reshape(1,1,3,3)
107
+ for i in range(30):
108
+ nodes += [
109
+ Conv(cur, cross_k, pads=[1,1,1,1]), # expand frontier
110
+ Clip(expanded, 0, 1), # saturate
111
+ Mul(clipped, obstacle_mask), # block walls
112
+ Add(cur, masked), # accumulate
113
+ Clip(sum, 0, 1), # final saturate
114
+ ]
115
+ ```
116
+
117
+ **Fitting:** Learn seed_selector (10 weights: which input color is seed) + obstacle_selector (10 weights: which colors are passable). Fit via lstsq on training examples.
118
+ **Cost:** ~240K MACs, ~4.9KB, score ~12.
119
+ **Implementation:** ~80 lines in `neurogolf_solver/solvers/flood.py`
120
+
121
+ - [ ] Implement `s_flood_fill(td)` with parameterized seed/obstacle selection
122
+ - [ ] Fit selectors via lstsq
123
+ - [ ] Validate on 400 tasks
124
+ - **Accept if:** β‰₯2 new tasks solved
125
 
126
+ ---
 
127
 
128
+ ### 3c: Edge/Boundary Detection ⬜ β€” Confidence: **75%**
129
+ > Laplacian/Sobel convolution to detect boundaries between colors.
130
+
131
+ **ONNX Blueprint:**
132
+ ```python
133
+ # Laplacian kernel detects any color boundary
134
+ lap_k = np.array([[0,-1,0],[-1,4,-1],[0,-1,0]], dtype=np.float32)
135
+ nodes = [
136
+ ReduceSum(input, axes=[1]), # collapse channels to [1,1,H,W] intensity
137
+ Conv(intensity, lap_k, pads=[1,1,1,1]), # edge response
138
+ Greater(response, threshold), # binary edge map
139
+ Cast(binary, FLOAT), # to float
140
+ # Then: assign edge_color via Mul + Add
141
+ ]
142
+ ```
143
+
144
+ **Fitting:** Detect edge_color and background_color from training pairs. Many ARC tasks ask "draw the outline of the shape."
145
+ **Cost:** ~16K MACs, ~1KB, score ~15.
146
+ **Implementation:** ~40 lines in `neurogolf_solver/solvers/edge.py`
147
+
148
+ - [ ] Implement `s_edge_detect(td)` with Laplacian + Sobel variants
149
+ - [ ] Fit edge/background colors from examples
150
+ - [ ] Validate on 400 tasks
151
+ - **Accept if:** β‰₯2 new tasks solved
152
 
153
+ ---
154
 
155
+ ### 3d: Composition Detectors ⬜ β€” Confidence: **65%**
156
+ > Chain existing analytical solvers: rotate+recolor, flip+recolor, etc.
157
 
158
+ **Approach:** For each task, try all (transform Γ— color_map) pairs. If the composition matches all train+arc-gen examples, emit combined ONNX graph.
 
 
 
 
159
 
160
+ - [ ] Scan 400 tasks: for each, apply all transforms, then check if color_map fixes remainder
161
+ - [ ] Build ONNX graph that chains transform + color_map nodes
162
+ - [ ] Validate on 400 tasks
163
+ - **Accept if:** β‰₯3 new tasks solved
 
 
 
164
 
165
+ ---
 
 
 
 
166
 
167
+ ### 3e: Mode/Majority Color Solver ⬜ β€” Confidence: **80%**
168
+ > Output = most common color in input (or region).
 
 
169
 
170
+ **ONNX Blueprint:**
171
+ ```python
172
+ # ~543 bytes, 13 params, ~10K MACs, score ~16
173
+ nodes = [
174
+ ReduceSum(input, axes=[2,3]), # sum over spatial β†’ [1,10] histogram
175
+ ArgMax(hist, axis=1), # most common color index
176
+ # Expand to full grid, one-hot encode
177
+ ]
178
+ ```
179
 
180
+ **Fitting:** Check training pairs: does output = constant fill of mode color? Also try per-row/per-col mode.
181
+ **Implementation:** ~30 lines
 
182
 
183
+ - [ ] Implement `s_mode_color(td)` β€” global, per-row, per-col variants
184
+ - [ ] Validate on 400 tasks
185
+ - **Accept if:** β‰₯1 new task solved
186
 
187
  ---
188
 
189
+ ### 3f: Color LUT (10Γ—10 MatMul) ⬜ β€” Confidence: **70%**
190
+ > General color→color mapping via learned 10×10 weight matrix.
191
 
192
+ Already have `s_color_map` for permutations + Conv 1Γ—1 for non-permutations. This extends to position-dependent color transforms by stacking spatial features.
 
 
 
 
193
 
194
+ **Fitting:** `W_lut = lstsq(OneHot(input_pixels), OneHot(output_pixels))`
195
 
196
+ - [ ] Implement `s_color_lut(td)` using OneHot β†’ MatMul β†’ ArgMax
197
+ - [ ] Compare with existing color_map solver β€” keep if it solves additional tasks
198
+ - [ ] Validate on 400 tasks
199
+ - **Accept if:** β‰₯2 new tasks beyond existing color_map
200
 
201
  ---
202
 
203
+ ### 3g: CumSum-Based Analysis ⬜ β€” Confidence: **50%**
204
+ > Running sums for object extent, counting, filling. Key op from CompressARC.
205
 
206
+ **ONNX Blueprint:**
207
+ ```python
208
+ # CumSum along axis 2 (rows) β†’ running sum per column
209
+ axis_tensor = from_array(np.int64(2), 'axis')
210
+ nodes = [CumSum(input_channel, axis_tensor)]
211
+ ```
 
212
 
213
+ **Use cases:** "Fill everything below the topmost pixel of each color", "count pixels per row", object bounding boxes.
 
 
 
 
214
 
215
+ - [ ] Prototype CumSum-based solver for specific task families
216
+ - [ ] Validate on 400 tasks
217
+ - **Accept if:** β‰₯1 new task solved
 
218
 
219
  ---
220
 
221
+ ## Phase 4: Score Optimization (est +50-100 pts)
222
 
223
+ ### 4a: Best-of-N Model Selection ⬜
224
+ > For each task, try ALL ks values + ALL solver types, keep cheapest valid model.
 
 
 
225
 
226
+ - [ ] Refactor `solve_task` to collect all valid candidates, pick lowest cost
227
+ - [ ] Validate: Compare total score before/after
228
+ - **Accept if:** β‰₯3% total score improvement
229
+
230
+ ### 4b: Official Scoring Alignment ⬜
231
+ > Use `onnx_tool` for exact cost matching with Kaggle scorer.
232
+
233
+ - [ ] Compare static profiler vs onnx_tool on all solved models
234
+ - [ ] Fix divergences
235
+ - **Accept if:** divergence <2% on all models
236
 
237
  ---
238
 
 
240
 
241
  > **User's competitive philosophy**: "I am writing my own models no blending. This is major flaw in the competition loophole."
242
 
 
 
 
 
 
 
243
  ---
244
 
245
  ## Experiment Log
 
248
  |------|-----------|-------------|--------|----------|
249
  | 2026-04-24 | v4.2 baseline | 400 | 50 arc-gen, ~670 LB | Keep as baseline |
250
  | 2026-04-25 | v5 untested code | 10 | 3/10 FAILED arc-gen | **REVERTED** |
251
+ | 2026-04-26 | v5.0 refactor | 400 | **49 solved, ~603.6 score, budget=5s** | New baseline |
252
+ | 2026-04-26 | Exp 1: Skip ks=5,7,9 | 55 | **HURTS 2 solved tasks** | **[-] REJECTED** |
253
+ | 2026-04-26 | Exp 2: Best-of-N | 55 | **No new solves** | **[~] NEUTRAL** |
254
+ | 2026-04-26 | Exp 3: Ridge reg | 4 victims | **0/4 pass arc-gen** | **[-] REJECTED** |
255
+ | 2026-04-26 | **Exp 3: Full PCA/SVD** | **400 tasks** | **0 PCR solves, 0 regressions** | **[-] REJECTED** |
 
 
256
 
257
+ ### CRITICAL FINDING (2026-04-26)
258
 
259
+ The 351 unsolved tasks fail because **conv is the wrong architecture**, not because of bad regularization. Score improvement requires new solver types (Phase 3), not fixing conv.
260
 
261
+ ---
 
 
 
 
 
262
 
263
+ ## Realistic Projections
264
 
265
+ | Milestone | Solved | Score | How |
266
+ |-----------|--------|-------|-----|
267
+ | **Current** | **49** | **~604** | β€” |
268
+ | + Phase 1 (score opt) | 49 | ~750-800 | Opset 17 conversions + ONNX optimizer |
269
+ | + 3c edge detect | 55-65 | ~900-1000 | Laplacian/Sobel conv |
270
+ | + 3d composition | 60-75 | ~1000-1150 | Transform+recolor chains |
271
+ | + 3a gravity | 70-90 | ~1150-1400 | 4-dir unrolled Conv+Where |
272
+ | + 3b flood fill | 80-110 | ~1300-1700 | Unrolled BFS |
273
+ | + 3e-g (mode, LUT, cumsum) | 90-130 | ~1500-2000 | Various analytical |
274
+ | **Stretch: all Phase 3** | **130-200** | **~1800-2800** | Everything above working |
275
 
276
+ **3000+ requires ~200+ solved tasks.** Achievable only if most Phase 3 solvers work AND we find additional task families to target. Honest range: **1500-2500 LB.**
277
 
278
+ ---
 
 
 
 
 
 
 
 
279
 
280
+ ## Research Queue
281
 
282
+ 1. βœ… Nakkiran 2019 β€” double descent (inapplicable)
283
+ 2. βœ… Segert 2023 β€” PCA > Ridge (0/400 PCR solves)
284
+ 3. βœ… CompressARC 2024 β€” MDL principle, CumMax/ReduceSum architecture
285
+ 4. βœ… TRM 2025 β€” recursive reasoning, 45% ARC-AGI-1
286
+ 5. βœ… NCA 2025 β€” cellular automata, fails at global coordination
287
+ 6. βœ… ARC Prize 2025 Tech Report β€” competition landscape
288
+ 7. [ ] **Task taxonomy:** Classify all 351 unsolved tasks by family β†’ prioritize solvers
289
+ 8. [ ] **Top Kaggle non-blending notebooks** β€” implementation details
290
 
291
+ > **Next action:** Classify the 351 unsolved tasks to validate the Phase 3 task count estimates before building anything.