rogermt commited on
Commit
97af2d2
·
verified ·
1 Parent(s): 17e36c1

Rewrite Phase 3: merged expert + original solvers, organized by architecture type, honest estimates

Browse files
Files changed (1) hide show
  1. TODO.md +113 -216
TODO.md CHANGED
@@ -1,291 +1,188 @@
1
  # NeuroGolf Solver — Roadmap
2
 
3
- > Current: v5.1 · 49 arc-gen validated (budget=5s) · ~604 score · Target: 3000+
4
  > Philosophy: **Research → Design → Experiment → Analyze → Research** loop until confirmed score increase.
5
  > Rule: **NEVER claim a feature works without full arc-gen validation on representative tasks.**
6
- > Updated: 2026-04-26Phase 2 (regularization) exhausted. Phase 3 redesigned from literature.
7
- > **All 400 tasks count. There are NO excluded tasks.**
8
 
9
  ---
10
 
11
- ## Current Solver Breakdown (49/400 solved)
12
 
13
- | Category | Tasks | Avg Score | Solver |
14
- |----------|-------|-----------|--------|
15
- | Conv (lstsq) | 25 | ~10.5 | conv_fixed, conv_var, conv_diff, conv_var_diff |
16
- | Analytical | 24 | ~15.5 | identity, constant, color_map, transpose, flip, rotate, shift, tile, upscale, mirror, concat, spatial_gather, etc. |
17
- | **Unsolved** | **351** | **1.0** | |
18
- | **Total** | **400** | | **~604** |
19
-
20
- The 351 unsolved tasks need fundamentally different solver architectures.
21
 
22
  ---
23
 
24
- ## Phase 1: Score Optimization on Existing Tasks (est +100-200 pts)
25
-
26
- ### 1a: Opset 17 Slice-Based Analytical Solvers (~0 cost) ⬜
27
- > Reduce MACs on the 24 analytical tasks. Currently score ~15.5 avg, target ~20+.
28
 
29
- - [ ] Convert Gather-based solvers to Slice(step=-1) + Transpose
30
- - Affected: s_tile, s_upscale, s_concat, s_concat_enhanced, s_kronecker, s_diagonal_tile, s_shift, s_mirror_h, s_mirror_v, s_quad_mirror, s_fixed_crop, s_spatial_gather, s_varshape_spatial_gather
31
- - [ ] Validate: Full 400 arc-gen. Accept if >10% score increase on analytical tasks.
32
- - **Estimate:** 24 tasks × (+5 pts avg) = **+120 pts**
33
 
34
  ### 1b: ONNX Optimizer Pass ⬜
35
- - [ ] `onnxoptimizer.optimize()` with dead-code elimination
36
- - [ ] Validate: Compare scores before/after on all 49 solved tasks.
37
- - **Estimate:** 49 tasks × (+1-2 pts avg) = **+50-100 pts**
38
 
39
  ---
40
 
41
  ## Phase 2: Regularization — EXHAUSTED
42
 
43
- > Exps 0-3 tested. Root cause is architecture mismatch, not overfitting.
44
- > Conv ceiling = ~25 tasks. See Experiment Log below for full data.
45
 
46
  ---
47
 
48
- ## Phase 3: New Solver Types (the actual path to 3000+)
49
-
50
- > **Research basis:** CompressARC (`2512.06104`), TRM (`2510.04871`), NCA (`2506.15746`), ONNX opset 17 operator audit.
51
- > **Key insight:** ARC tasks cluster into ~8 families. Each family needs a specialized ONNX architecture. Score = max(1, 25 - ln(MACs + mem + params)), so tiny models score highest.
52
- >
53
- > **Honest math:** Solving 50 more tasks at ~12 pts avg = +600. Solving 100 more = +1200. To hit 3000 we need ~200 new tasks at ~12 pts avg. That's ambitious but structurally possible.
54
 
55
- ### Solver Priority Table (ordered by score × expected tasks)
 
56
 
57
- | # | Solver | Expected Tasks | Score | Total Pts | Complexity | Key Ops |
58
- |---|--------|---------------|-------|-----------|------------|---------|
59
- | 1 | **Gravity (4-dir)** | 10-20 | ~12 | 120-240 | Medium | Conv(3×3 shift kernel) × 30 unrolled steps + Where |
60
- | 2 | **Flood Fill (BFS)** | 10-20 | ~12 | 120-240 | Medium | Conv(3×3 cross kernel) + Clip × 30 steps |
61
- | 3 | **Edge/Boundary Detect** | 10-20 | ~13 | 130-260 | Low | Conv(Laplacian/Sobel kernel) + threshold |
62
- | 4 | **Composition (transform+recolor)** | 10-15 | ~14 | 140-210 | Low | Chain existing analytical + color_map |
63
- | 5 | **Mode/Majority Color** | 5-10 | ~16 | 80-160 | Low | ReduceSum → ArgMax → Expand |
64
- | 6 | **Color LUT (10×10 MatMul)** | 10-20 | ~13 | 130-260 | Low | OneHot → MatMul(W_lut) → ArgMax, lstsq-fit W_lut |
65
- | 7 | **Object Copy/Offset** | 5-15 | ~12 | 60-180 | High | ScatterND + offset detection |
66
- | 8 | **CumSum Analysis** | 5-10 | ~15 | 75-150 | Medium | CumSum for running totals, object extent |
67
 
68
- **Conservative total: +80-150 tasks, +850-1700 pts → est LB ~1450-2300**
69
- **Optimistic total: +150-200 tasks → est LB ~2400-3000**
70
 
71
- ---
72
 
73
- ### 3a: Gravity Solver Confidence: **70%**
74
- > Directional pixel propagation. ~30 unrolled steps, 4 directions.
75
-
76
- **ONNX Blueprint:**
77
- ```python
78
- # Per step: pull pixel from direction, fill if empty
79
- shift_k = np.zeros((1,1,3,3), dtype=np.float32)
80
- shift_k[0,0,0,1] = 1.0 # gravity down: pull from row above
81
- for i in range(30):
82
- nodes += [
83
- Conv(cur, shift_k, pads=[1,1,0,0]), # shifted copy
84
- Equal(cur, zero), # is cell empty?
85
- Where(is_empty, shifted, cur), # fill empty cells
86
- ]
87
- ```
88
-
89
- **Fitting:** For each task, try all 4 directions. Detect "empty color" (usually 0). Validate against arc-gen.
90
- **Cost:** ~240K MACs (30 steps × 8100 per Conv), ~4.8KB, score ~12.
91
- **Implementation:** ~60 lines in `neurogolf_solver/solvers/gravity.py`
92
-
93
- - [ ] Implement `s_gravity_unrolled(td)` for all 4 directions
94
- - [ ] Detect empty color from training examples
95
- - [ ] Validate on 400 tasks
96
- - **Accept if:** ≥3 new tasks solved
97
 
98
  ---
99
 
100
- ### 3b: Flood Fill Solver ⬜ — Confidence: **60%**
101
- > BFS via unrolled Conv. Seeds propagate through passable cells.
102
-
103
- **ONNX Blueprint:**
104
- ```python
105
- # 30-step BFS. Seed starts at one color, spreads through another.
106
- cross_k = np.array([[0,1,0],[1,0,1],[0,1,0]], dtype=np.float32).reshape(1,1,3,3)
107
- for i in range(30):
108
- nodes += [
109
- Conv(cur, cross_k, pads=[1,1,1,1]), # expand frontier
110
- Clip(expanded, 0, 1), # saturate
111
- Mul(clipped, obstacle_mask), # block walls
112
- Add(cur, masked), # accumulate
113
- Clip(sum, 0, 1), # final saturate
114
- ]
115
- ```
116
-
117
- **Fitting:** Learn seed_selector (10 weights: which input color is seed) + obstacle_selector (10 weights: which colors are passable). Fit via lstsq on training examples.
118
- **Cost:** ~240K MACs, ~4.9KB, score ~12.
119
- **Implementation:** ~80 lines in `neurogolf_solver/solvers/flood.py`
120
-
121
- - [ ] Implement `s_flood_fill(td)` with parameterized seed/obstacle selection
122
- - [ ] Fit selectors via lstsq
123
- - [ ] Validate on 400 tasks
124
- - **Accept if:** ≥2 new tasks solved
125
 
126
- ---
127
 
128
- ### 3c: Edge/Boundary Detection Confidence: **75%**
129
- > Laplacian/Sobel convolution to detect boundaries between colors.
130
-
131
- **ONNX Blueprint:**
132
- ```python
133
- # Laplacian kernel detects any color boundary
134
- lap_k = np.array([[0,-1,0],[-1,4,-1],[0,-1,0]], dtype=np.float32)
135
- nodes = [
136
- ReduceSum(input, axes=[1]), # collapse channels to [1,1,H,W] intensity
137
- Conv(intensity, lap_k, pads=[1,1,1,1]), # edge response
138
- Greater(response, threshold), # binary edge map
139
- Cast(binary, FLOAT), # to float
140
- # Then: assign edge_color via Mul + Add
141
- ]
142
- ```
143
-
144
- **Fitting:** Detect edge_color and background_color from training pairs. Many ARC tasks ask "draw the outline of the shape."
145
- **Cost:** ~16K MACs, ~1KB, score ~15.
146
- **Implementation:** ~40 lines in `neurogolf_solver/solvers/edge.py`
147
-
148
- - [ ] Implement `s_edge_detect(td)` with Laplacian + Sobel variants
149
- - [ ] Fit edge/background colors from examples
150
- - [ ] Validate on 400 tasks
151
- - **Accept if:** ≥2 new tasks solved
152
 
153
  ---
154
 
155
- ### 3d: Composition Detectors ⬜ — Confidence: **65%**
156
- > Chain existing analytical solvers: rotate+recolor, flip+recolor, etc.
157
 
158
- **Approach:** For each task, try all (transform × color_map) pairs. If the composition matches all train+arc-gen examples, emit combined ONNX graph.
159
 
160
- - [ ] Scan 400 tasks: for each, apply all transforms, then check if color_map fixes remainder
161
- - [ ] Build ONNX graph that chains transform + color_map nodes
162
- - [ ] Validate on 400 tasks
163
- - **Accept if:** ≥3 new tasks solved
 
164
 
165
  ---
166
 
167
- ### 3e: Mode/Majority Color Solver ⬜ — Confidence: **80%**
168
- > Output = most common color in input (or region).
169
 
170
- **ONNX Blueprint:**
171
- ```python
172
- # ~543 bytes, 13 params, ~10K MACs, score ~16
173
- nodes = [
174
- ReduceSum(input, axes=[2,3]), # sum over spatial → [1,10] histogram
175
- ArgMax(hist, axis=1), # most common color index
176
- # Expand to full grid, one-hot encode
177
- ]
178
- ```
179
 
180
- **Fitting:** Check training pairs: does output = constant fill of mode color? Also try per-row/per-col mode.
181
- **Implementation:** ~30 lines
182
-
183
- - [ ] Implement `s_mode_color(td)` global, per-row, per-col variants
184
- - [ ] Validate on 400 tasks
185
- - **Accept if:** ≥1 new task solved
186
 
187
  ---
188
 
189
- ### 3f: Color LUT (10×10 MatMul) ⬜ — Confidence: **70%**
190
- > General color→color mapping via learned 10×10 weight matrix.
191
-
192
- Already have `s_color_map` for permutations + Conv 1×1 for non-permutations. This extends to position-dependent color transforms by stacking spatial features.
193
 
194
- **Fitting:** `W_lut = lstsq(OneHot(input_pixels), OneHot(output_pixels))`
195
 
196
- - [ ] Implement `s_color_lut(td)` using OneHot MatMul ArgMax
197
- - [ ] Compare with existing color_map solver — keep if it solves additional tasks
198
- - [ ] Validate on 400 tasks
199
- - **Accept if:** ≥2 new tasks beyond existing color_map
 
200
 
201
  ---
202
 
203
- ### 3g: CumSum-Based Analysis Confidence: **50%**
204
- > Running sums for object extent, counting, filling. Key op from CompressARC.
205
-
206
- **ONNX Blueprint:**
207
- ```python
208
- # CumSum along axis 2 (rows) → running sum per column
209
- axis_tensor = from_array(np.int64(2), 'axis')
210
- nodes = [CumSum(input_channel, axis_tensor)]
211
- ```
212
 
213
- **Use cases:** "Fill everything below the topmost pixel of each color", "count pixels per row", object bounding boxes.
 
 
 
 
214
 
215
- - [ ] Prototype CumSum-based solver for specific task families
216
- - [ ] Validate on 400 tasks
217
- - **Accept if:** ≥1 new task solved
218
-
219
- ---
220
 
221
- ## Phase 4: Score Optimization (est +50-100 pts)
 
222
 
223
- ### 4a: Best-of-N Model Selection
224
- > For each task, try ALL ks values + ALL solver types, keep cheapest valid model.
225
 
226
- - [ ] Refactor `solve_task` to collect all valid candidates, pick lowest cost
227
- - [ ] Validate: Compare total score before/after
228
- - **Accept if:** ≥3% total score improvement
229
 
230
- ### 4b: Official Scoring Alignment ⬜
231
- > Use `onnx_tool` for exact cost matching with Kaggle scorer.
232
 
233
- - [ ] Compare static profiler vs onnx_tool on all solved models
234
- - [ ] Fix divergences
235
- - **Accept if:** divergence <2% on all models
236
 
237
- ---
238
 
239
- ## BLENDING EXPLICITLY EXCLUDED
 
 
 
 
 
240
 
241
- > **User's competitive philosophy**: "I am writing my own models no blending. This is major flaw in the competition loophole."
 
 
 
 
 
242
 
243
  ---
244
 
245
- ## Experiment Log
246
 
247
- | Date | Experiment | Tasks Tested | Result | Decision |
248
- |------|-----------|-------------|--------|----------|
249
- | 2026-04-24 | v4.2 baseline | 400 | 50 arc-gen, ~670 LB | Keep as baseline |
250
- | 2026-04-25 | v5 untested code | 10 | 3/10 FAILED arc-gen | **REVERTED** |
251
- | 2026-04-26 | v5.0 refactor | 400 | **49 solved, ~603.6 score, budget=5s** | New baseline |
252
- | 2026-04-26 | Exp 1: Skip ks=5,7,9 | 55 | **HURTS 2 solved tasks** | **[-] REJECTED** |
253
- | 2026-04-26 | Exp 2: Best-of-N | 55 | **No new solves** | **[~] NEUTRAL** |
254
- | 2026-04-26 | Exp 3: Ridge reg | 4 victims | **0/4 pass arc-gen** | **[-] REJECTED** |
255
- | 2026-04-26 | **Exp 3: Full PCA/SVD** | **400 tasks** | **0 PCR solves, 0 regressions** | **[-] REJECTED** |
256
 
257
- ### CRITICAL FINDING (2026-04-26)
258
 
259
- The 351 unsolved tasks fail because **conv is the wrong architecture**, not because of bad regularization. Score improvement requires new solver types (Phase 3), not fixing conv.
260
 
261
  ---
262
 
263
- ## Realistic Projections
264
-
265
- | Milestone | Solved | Score | How |
266
- |-----------|--------|-------|-----|
267
- | **Current** | **49** | **~604** | — |
268
- | + Phase 1 (score opt) | 49 | ~750-800 | Opset 17 conversions + ONNX optimizer |
269
- | + 3c edge detect | 55-65 | ~900-1000 | Laplacian/Sobel conv |
270
- | + 3d composition | 60-75 | ~1000-1150 | Transform+recolor chains |
271
- | + 3a gravity | 70-90 | ~1150-1400 | 4-dir unrolled Conv+Where |
272
- | + 3b flood fill | 80-110 | ~1300-1700 | Unrolled BFS |
273
- | + 3e-g (mode, LUT, cumsum) | 90-130 | ~1500-2000 | Various analytical |
274
- | **Stretch: all Phase 3** | **130-200** | **~1800-2800** | Everything above working |
275
 
276
- **3000+ requires ~200+ solved tasks.** Achievable only if most Phase 3 solvers work AND we find additional task families to target. Honest range: **1500-2500 LB.**
 
 
 
 
 
 
277
 
278
  ---
279
 
280
  ## Research Queue
281
 
282
- 1. ✅ Nakkiran 2019 double descent (inapplicable)
283
- 2. ✅ Segert 2023 PCA > Ridge (0/400 PCR solves)
284
- 3. ✅ CompressARC 2024 MDL principle, CumMax/ReduceSum architecture
285
- 4. ✅ TRM 2025recursive reasoning, 45% ARC-AGI-1
286
- 5. ✅ NCA 2025cellular automata, fails at global coordination
287
- 6. ARC Prize 2025 Tech Report competition landscape
288
- 7. [ ] **Task taxonomy:** Classify all 351 unsolved tasks by family → prioritize solvers
289
- 8. [ ] **Top Kaggle non-blending notebooks** — implementation details
290
-
291
- > **Next action:** Classify the 351 unsolved tasks to validate the Phase 3 task count estimates before building anything.
 
1
  # NeuroGolf Solver — Roadmap
2
 
3
+ > Current: v5.2 · 51 Kaggle validated · LB 594.84 · Target: 3000+
4
  > Philosophy: **Research → Design → Experiment → Analyze → Research** loop until confirmed score increase.
5
  > Rule: **NEVER claim a feature works without full arc-gen validation on representative tasks.**
6
+ > Updated: 2026-04-27LB 594.84 confirmed. Phase 3 redesigned from expert review + literature.
7
+ > **All 400 tasks count. There are NO excluded tasks. Unsolved = 1.0 pt (Kaggle adds automatically).**
8
 
9
  ---
10
 
11
+ ## Current Solver Breakdown (51/400 solved, LB 594.84)
12
 
13
+ | Category | Tasks | Solvers |
14
+ |----------|-------|---------|
15
+ | Conv (lstsq) | 25 | conv_fixed, conv_var, conv_diff, conv_var_diff |
16
+ | Analytical | 24 | identity, constant, color_map, transpose, flip, rotate, shift, tile, upscale, mirror, concat, spatial_gather, etc. |
17
+ | Gravity | 1 | gravity_unrolled (Task 78) |
18
+ | Mode fill | 1 | mode_fill (Task 129) |
19
+ | **Unsolved** | **349** | — |
 
20
 
21
  ---
22
 
23
+ ## Phase 1: Score Optimization on Existing Tasks
 
 
 
24
 
25
+ ### 1a: Opset 17 Slice-Based Analytical Solvers ⬜
26
+ > Convert Gather-based solvers to Slice(step=-1) + Transpose for ~0 MACs.
 
 
27
 
28
  ### 1b: ONNX Optimizer Pass ⬜
29
+ > `onnxoptimizer.optimize()` for dead-code elimination.
 
 
30
 
31
  ---
32
 
33
  ## Phase 2: Regularization — EXHAUSTED
34
 
35
+ > Exps 0-3 tested. Architecture mismatch, not overfitting. Conv ceiling = ~25 tasks.
 
36
 
37
  ---
38
 
39
+ ## Phase 3: New Solver Types
 
 
 
 
 
40
 
41
+ > Organized by architecture type. Each solver is a separate .py file.
42
+ > **Build rule:** Scan for matches FIRST, build only what has hits, validate on arc-gen.
43
 
44
+ ---
 
 
 
 
 
 
 
 
 
45
 
46
+ ### Category A: Static Spatial Remapping (Gather/Slice/Pad)
 
47
 
48
+ These are cheap, zero/low-MAC solvers that use precomputed index mappings. Highest score per task. Build these first.
49
 
50
+ | # | Solver | Pattern | Key Ops | Status |
51
+ |---|--------|---------|---------|--------|
52
+ | A1 | `extract_inner` | Remove N-pixel border frame → smaller output | Gather | ⬜ |
53
+ | A2 | `add_border` | Add constant-color border → larger output | Gather+const | ⬜ |
54
+ | A3 | `pad_align` | Input pasted into larger canvas at fixed offset | Gather+const | ⬜ |
55
+ | A4 | `downsample_stride` | `out[r,c] = inp[r*sH, c*sW]` | Gather | ⬜ |
56
+ | A5 | `extract_and_tile` | Find smallest repeating unit, tile to fill output | Gather | ⬜ |
57
+ | A6 | `sparse_fill` | Each non-zero pixel becomes NxN block | Gather | ⬜ |
58
+ | A7 | `symmetry_complete` | Mirror sparse data to complete L-R or T-B symmetry | Gather | ⬜ |
59
+ | A8 | `multi_stamp` | Union of shifted copies of input at fixed offsets | Gather+Add | ⬜ |
60
+ | A9 | `affine_remap` | General integer coordinate remap: stride+offset, axis swap | Gather | ⬜ |
61
+ | A10 | `crop_paste` | Crop from input, paste at different position in output | Gather+const | ⬜ |
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ---
64
 
65
+ ### Category B: Channel/Color Operations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
+ Color-level transforms that work in the 10-channel one-hot space.
68
 
69
+ | # | Solver | Pattern | Key Ops | Status |
70
+ |---|--------|---------|---------|--------|
71
+ | B1 | `channel_filter` | Keep only certain colors, rest → background | Mul(mask [1,10,1,1]) | ⬜ |
72
+ | B2 | `overlay_constant` | Input + fixed pixel pattern overlaid | Add or Where + constant tensor | ⬜ |
73
+ | B3 | `fill_bg_with_mode` | Background pixels filled with dominant color, non-bg unchanged | ReduceSum→ArgMax→Where | ⬜ |
74
+ | B4 | `row_mode_fill` | Each row filled with its dominant color | ReduceSum(width)→ArgMax→Tile(width) | ⬜ |
75
+ | B5 | `col_mode_fill` | Each column filled with its dominant color | ReduceSum(height)→ArgMax���Tile(height) | ⬜ |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ---
78
 
79
+ ### Category C: Composition / Chaining
 
80
 
81
+ Chain two existing solvers. If transform(input) intermediate, and color_map(intermediate) output, emit one combined graph.
82
 
83
+ | # | Solver | Pattern | Key Ops | Status |
84
+ |---|--------|---------|---------|--------|
85
+ | C1 | `transform_then_recolor` | rotate/flip/transpose + color_map | Chain existing | ⬜ |
86
+ | C2 | `crop_then_transform` | fixed_crop + rotate/flip | Chain existing | ⬜ |
87
+ | C3 | `recolor_then_tile` | color_map + tile/upscale | Chain existing | ⬜ |
88
 
89
  ---
90
 
91
+ ### Category D: Unrolled Propagation (Conv+Where loops)
 
92
 
93
+ Dynamic solvers that need N unrolled steps. Higher MAC cost (~8-12 score).
 
 
 
 
 
 
 
 
94
 
95
+ | # | Solver | Pattern | Key Ops | Status |
96
+ |---|--------|---------|---------|--------|
97
+ | D1 | `gravity_unrolled` | Directional compaction, 4 dirs × 10 bg colors | Conv+Where ×N steps | ✅ Task 78 |
98
+ | D2 | `flood_fill` | BFS: seed spreads through passable cells | Conv+Clip+Mul ×N steps | ⬜ |
99
+ | D3 | `edge_detect` | Laplacian/Sobel boundary detection | Conv(3×3)+Abs+Greater | ✅ built, 0 matches |
 
100
 
101
  ---
102
 
103
+ ### Category E: Global Aggregation
 
 
 
104
 
105
+ Solvers that compute a global statistic and broadcast it.
106
 
107
+ | # | Solver | Pattern | Key Ops | Status |
108
+ |---|--------|---------|---------|--------|
109
+ | E1 | `mode_fill` | Output = solid fill of most common input color | ReduceSum→ArgMax→Expand | ✅ Task 129 |
110
+ | E2 | `cumsum_fill` | Running sums for object extent, directional filling | CumSum | ⬜ |
111
+ | E3 | `bbox_crop_pad` | Find bounding box via ReduceSum+ArgMax, crop+pad | ReduceSum→ArgMax→Slice→Pad | ⬜ |
112
 
113
  ---
114
 
115
+ ### Build Order (highest expected ROI first)
 
 
 
 
 
 
 
 
116
 
117
+ **Wave 1 Static remapping (Category A):** Cheapest to build, highest score per task, most likely to have matches. ~1 day.
118
+ 1. A1 `extract_inner` + A2 `add_border` (border ops)
119
+ 2. A5 `extract_and_tile` + A6 `sparse_fill` (pattern ops)
120
+ 3. A3 `pad_align` + A4 `downsample_stride` (placement ops)
121
+ 4. A7 `symmetry_complete` (symmetry)
122
 
123
+ **Wave 2 Color/channel ops (Category B):** Builds on mode_fill. ~0.5 day.
124
+ 5. B1 `channel_filter` + B3 `fill_bg_with_mode`
125
+ 6. B4 `row_mode_fill` + B5 `col_mode_fill`
 
 
126
 
127
+ **Wave 3 — Composition (Category C):** Chains existing solvers, no new ONNX ops. ~0.5 day.
128
+ 7. C1 `transform_then_recolor`
129
 
130
+ **Wave 4 — Propagation (Category D):** More complex, lower score. ~1 day.
131
+ 8. D2 `flood_fill`
132
 
133
+ **Wave 5 Global aggregation (Category E):** Needs careful design. ~1 day.
134
+ 9. E2 `cumsum_fill` + E3 `bbox_crop_pad`
 
135
 
136
+ ---
 
137
 
138
+ ### Honest Projections
 
 
139
 
140
+ I will NOT repeat the Phase 2 mistake of projecting fantasy numbers. Here's what I know:
141
 
142
+ - **51 tasks solved today.** LB 594.84.
143
+ - **Each Wave:** Might add 2-10 tasks. Might add 0. We don't know until we scan and test.
144
+ - **The only reliable estimate:** Gravity added 1 task. Mode fill added 1 task. Edge detect added 0. Hit rate so far: ~1 new task per solver built.
145
+ - **If hit rate holds:** 20 new solvers × ~1 task each = ~20 new tasks → ~70 solved → LB ~800-900.
146
+ - **If some solvers hit 5+ tasks:** Could reach 100-120 solved → LB ~1200-1500.
147
+ - **3000+ requires a fundamentally different approach** (test-time training, learned architectures) that we're not doing.
148
 
149
+ | Scenario | Solved | Est LB | Confidence |
150
+ |----------|--------|--------|------------|
151
+ | Wave 1 only | 55-65 | 650-800 | 60% |
152
+ | Wave 1+2 | 60-75 | 750-950 | 50% |
153
+ | Wave 1+2+3 | 65-85 | 850-1100 | 40% |
154
+ | All waves | 70-120 | 900-1500 | 30% |
155
 
156
  ---
157
 
158
+ ## Phase 4: Score Optimization
159
 
160
+ ### 4a: Best-of-N Model Selection
161
+ ### 4b: Official Scoring Alignment (onnx_tool) ⬜
 
 
 
 
 
 
 
162
 
163
+ ---
164
 
165
+ ## BLENDING EXPLICITLY EXCLUDED
166
 
167
  ---
168
 
169
+ ## Experiment Log
 
 
 
 
 
 
 
 
 
 
 
170
 
171
+ | Date | Experiment | Result | Decision |
172
+ |------|-----------|--------|----------|
173
+ | 2026-04-24 | v4.2 baseline | 50 arc-gen, LB ~501 | Baseline |
174
+ | 2026-04-26 | v5.0 refactor | 49 solved, ~604 score | New baseline |
175
+ | 2026-04-26 | Exp 1-3 (regularization) | 0 improvement | **EXHAUSTED** |
176
+ | 2026-04-26 | v5.2 gravity+mode | +2 tasks (78, 129) | ✅ Kept |
177
+ | 2026-04-27 | **v5.2 Kaggle submission** | **51 solved, LB 594.84** | **Current best** |
178
 
179
  ---
180
 
181
  ## Research Queue
182
 
183
+ 1. ✅ CompressARCCumMax/ReduceSum architecture
184
+ 2. ✅ TRMrecursive reasoning
185
+ 3. ✅ ARC Prize 2025 Tech Report
186
+ 4. ✅ Expert review #1 Phase 3 solver list (pad_align, crop_paste, downsample, etc.)
187
+ 5. ✅ Expert review #2 6 concrete solvers with code (extract_inner, add_border, etc.)
188
+ 6. [ ] **Task taxonomy scan**for each Wave 1 solver, count matching unsolved tasks before building