Move own-solver/TODO.md to own-solver/

987c46d verified 11 days ago

8.03 kB

	# NeuroGolf Solver — Roadmap

	> Current: v5.2 · 51 Kaggle validated · LB 594.84 · Target: 3000+
	> Philosophy: Research → Design → Experiment → Analyze → Research loop until confirmed score increase.
	> Rule: NEVER claim a feature works without full arc-gen validation on representative tasks.
	> Updated: 2026-04-27 — LB 594.84 confirmed. Phase 3 redesigned from expert review + literature.
	> All 400 tasks count. There are NO excluded tasks. Unsolved = 1.0 pt (Kaggle adds automatically).

	---

	## Current Solver Breakdown (51/400 solved, LB 594.84)

	\| Category \| Tasks \| Solvers \|
	\|----------\|-------\|---------\|
	\| Conv (lstsq) \| 25 \| conv_fixed, conv_var, conv_diff, conv_var_diff \|
	\| Analytical \| 24 \| identity, constant, color_map, transpose, flip, rotate, shift, tile, upscale, mirror, concat, spatial_gather, etc. \|
	\| Gravity \| 1 \| gravity_unrolled (Task 78) \|
	\| Mode fill \| 1 \| mode_fill (Task 129) \|
	\| Unsolved \| 349 \| — \|

	---

	## Phase 1: Score Optimization on Existing Tasks

	### 1a: Opset 17 Slice-Based Analytical Solvers ⬜
	> Convert Gather-based solvers to Slice(step=-1) + Transpose for ~0 MACs.

	### 1b: ONNX Optimizer Pass ⬜
	> `onnxoptimizer.optimize()` for dead-code elimination.

	---

	## Phase 2: Regularization — EXHAUSTED

	> Exps 0-3 tested. Architecture mismatch, not overfitting. Conv ceiling = ~25 tasks.

	---

	## Phase 3: New Solver Types

	> Organized by architecture type. Each solver is a separate .py file.
	> Build rule: Scan for matches FIRST, build only what has hits, validate on arc-gen.

	---

	### Category A: Static Spatial Remapping (Gather/Slice/Pad)

	These are cheap, zero/low-MAC solvers that use precomputed index mappings. Highest score per task. Build these first.

	\| # \| Solver \| Pattern \| Key Ops \| Status \|
	\|---\|--------\|---------\|---------\|--------\|
	\| A1 \| `extract_inner` \| Remove N-pixel border frame → smaller output \| Gather \| ⬜ \|
	\| A2 \| `add_border` \| Add constant-color border → larger output \| Gather+const \| ⬜ \|
	\| A3 \| `pad_align` \| Input pasted into larger canvas at fixed offset \| Gather+const \| ⬜ \|
	\| A4 \| `downsample_stride` \| `out[r,c] = inp[rsH, csW]` \| Gather \| ⬜ \|
	\| A5 \| `extract_and_tile` \| Find smallest repeating unit, tile to fill output \| Gather \| ⬜ \|
	\| A6 \| `sparse_fill` \| Each non-zero pixel becomes NxN block \| Gather \| ⬜ \|
	\| A7 \| `symmetry_complete` \| Mirror sparse data to complete L-R or T-B symmetry \| Gather \| ⬜ \|
	\| A8 \| `multi_stamp` \| Union of shifted copies of input at fixed offsets \| Gather+Add \| ⬜ \|
	\| A9 \| `affine_remap` \| General integer coordinate remap: stride+offset, axis swap \| Gather \| ⬜ \|
	\| A10 \| `crop_paste` \| Crop from input, paste at different position in output \| Gather+const \| ⬜ \|

	---

	### Category B: Channel/Color Operations

	Color-level transforms that work in the 10-channel one-hot space.

	\| # \| Solver \| Pattern \| Key Ops \| Status \|
	\|---\|--------\|---------\|---------\|--------\|
	\| B1 \| `channel_filter` \| Keep only certain colors, rest → background \| Mul(mask [1,10,1,1]) \| ⬜ \|
	\| B2 \| `overlay_constant` \| Input + fixed pixel pattern overlaid \| Add or Where + constant tensor \| ⬜ \|
	\| B3 \| `fill_bg_with_mode` \| Background pixels filled with dominant color, non-bg unchanged \| ReduceSum→ArgMax→Where \| ⬜ \|
	\| B4 \| `row_mode_fill` \| Each row filled with its dominant color \| ReduceSum(width)→ArgMax→Tile(width) \| ⬜ \|
	\| B5 \| `col_mode_fill` \| Each column filled with its dominant color \| ReduceSum(height)→ArgMax→Tile(height) \| ⬜ \|

	---

	### Category C: Composition / Chaining

	Chain two existing solvers. If transform(input) → intermediate, and color_map(intermediate) → output, emit one combined graph.

	\| # \| Solver \| Pattern \| Key Ops \| Status \|
	\|---\|--------\|---------\|---------\|--------\|
	\| C1 \| `transform_then_recolor` \| rotate/flip/transpose + color_map \| Chain existing \| ⬜ \|
	\| C2 \| `crop_then_transform` \| fixed_crop + rotate/flip \| Chain existing \| ⬜ \|
	\| C3 \| `recolor_then_tile` \| color_map + tile/upscale \| Chain existing \| ⬜ \|

	---

	### Category D: Unrolled Propagation (Conv+Where loops)

	Dynamic solvers that need N unrolled steps. Higher MAC cost (~8-12 score).

	\| # \| Solver \| Pattern \| Key Ops \| Status \|
	\|---\|--------\|---------\|---------\|--------\|
	\| D1 \| `gravity_unrolled` \| Directional compaction, 4 dirs × 10 bg colors \| Conv+Where ×N steps \| ✅ Task 78 \|
	\| D2 \| `flood_fill` \| BFS: seed spreads through passable cells \| Conv+Clip+Mul ×N steps \| ⬜ \|
	\| D3 \| `edge_detect` \| Laplacian/Sobel boundary detection \| Conv(3×3)+Abs+Greater \| ✅ built, 0 matches \|

	---

	### Category E: Global Aggregation

	Solvers that compute a global statistic and broadcast it.

	\| # \| Solver \| Pattern \| Key Ops \| Status \|
	\|---\|--------\|---------\|---------\|--------\|
	\| E1 \| `mode_fill` \| Output = solid fill of most common input color \| ReduceSum→ArgMax→Expand \| ✅ Task 129 \|
	\| E2 \| `cumsum_fill` \| Running sums for object extent, directional filling \| CumSum \| ⬜ \|
	\| E3 \| `bbox_crop_pad` \| Find bounding box via ReduceSum+ArgMax, crop+pad \| ReduceSum→ArgMax→Slice→Pad \| ⬜ \|

	---

	### Build Order (highest expected ROI first)

	Wave 1 — Static remapping (Category A): Cheapest to build, highest score per task, most likely to have matches. ~1 day.
	1. A1 `extract_inner` + A2 `add_border` (border ops)
	2. A5 `extract_and_tile` + A6 `sparse_fill` (pattern ops)
	3. A3 `pad_align` + A4 `downsample_stride` (placement ops)
	4. A7 `symmetry_complete` (symmetry)

	Wave 2 — Color/channel ops (Category B): Builds on mode_fill. ~0.5 day.
	5. B1 `channel_filter` + B3 `fill_bg_with_mode`
	6. B4 `row_mode_fill` + B5 `col_mode_fill`

	Wave 3 — Composition (Category C): Chains existing solvers, no new ONNX ops. ~0.5 day.
	7. C1 `transform_then_recolor`

	Wave 4 — Propagation (Category D): More complex, lower score. ~1 day.
	8. D2 `flood_fill`

	Wave 5 — Global aggregation (Category E): Needs careful design. ~1 day.
	9. E2 `cumsum_fill` + E3 `bbox_crop_pad`

	---

	### Honest Projections

	I will NOT repeat the Phase 2 mistake of projecting fantasy numbers. Here's what I know:

	- 51 tasks solved today. LB 594.84.
	- Each Wave: Might add 2-10 tasks. Might add 0. We don't know until we scan and test.
	- The only reliable estimate: Gravity added 1 task. Mode fill added 1 task. Edge detect added 0. Hit rate so far: ~1 new task per solver built.
	- If hit rate holds: 20 new solvers × ~1 task each = ~20 new tasks → ~70 solved → LB ~800-900.
	- If some solvers hit 5+ tasks: Could reach 100-120 solved → LB ~1200-1500.
	- 3000+ requires a fundamentally different approach (test-time training, learned architectures) that we're not doing.

	\| Scenario \| Solved \| Est LB \| Confidence \|
	\|----------\|--------\|--------\|------------\|
	\| Wave 1 only \| 55-65 \| 650-800 \| 60% \|
	\| Wave 1+2 \| 60-75 \| 750-950 \| 50% \|
	\| Wave 1+2+3 \| 65-85 \| 850-1100 \| 40% \|
	\| All waves \| 70-120 \| 900-1500 \| 30% \|

	---

	## Phase 4: Score Optimization

	### 4a: Best-of-N Model Selection ⬜
	### 4b: Official Scoring Alignment (onnx_tool) ⬜

	---

	## BLENDING — EXPLICITLY EXCLUDED

	---

	## Experiment Log

	\| Date \| Experiment \| Result \| Decision \|
	\|------\|-----------\|--------\|----------\|
	\| 2026-04-24 \| v4.2 baseline \| 50 arc-gen, LB ~501 \| Baseline \|
	\| 2026-04-26 \| v5.0 refactor \| 49 solved, ~604 score \| New baseline \|
	\| 2026-04-26 \| Exp 1-3 (regularization) \| 0 improvement \| EXHAUSTED \|
	\| 2026-04-26 \| v5.2 gravity+mode \| +2 tasks (78, 129) \| ✅ Kept \|
	\| 2026-04-27 \| v5.2 Kaggle submission \| 51 solved, LB 594.84 \| Current best \|

	---

	## Research Queue

	1. ✅ CompressARC — CumMax/ReduceSum architecture
	2. ✅ TRM — recursive reasoning
	3. ✅ ARC Prize 2025 Tech Report
	4. ✅ Expert review #1 — Phase 3 solver list (pad_align, crop_paste, downsample, etc.)
	5. ✅ Expert review #2 — 6 concrete solvers with code (extract_inner, add_border, etc.)
	6. [ ] Task taxonomy scan — for each Wave 1 solver, count matching unsolved tasks before building

	# NeuroGolf Solver — Roadmap

	> Current: v5.2 · 51 Kaggle validated · LB 594.84 · Target: 3000+
	> Philosophy: Research → Design → Experiment → Analyze → Research loop until confirmed score increase.
	> Rule: NEVER claim a feature works without full arc-gen validation on representative tasks.
	> Updated: 2026-04-27 — LB 594.84 confirmed. Phase 3 redesigned from expert review + literature.
	> All 400 tasks count. There are NO excluded tasks. Unsolved = 1.0 pt (Kaggle adds automatically).

	---

	## Current Solver Breakdown (51/400 solved, LB 594.84)

	\| Category \| Tasks \| Solvers \|
	\|----------\|-------\|---------\|
	\| Conv (lstsq) \| 25 \| conv_fixed, conv_var, conv_diff, conv_var_diff \|
	\| Analytical \| 24 \| identity, constant, color_map, transpose, flip, rotate, shift, tile, upscale, mirror, concat, spatial_gather, etc. \|
	\| Gravity \| 1 \| gravity_unrolled (Task 78) \|
	\| Mode fill \| 1 \| mode_fill (Task 129) \|
	\| Unsolved \| 349 \| — \|

	---

	## Phase 1: Score Optimization on Existing Tasks

	### 1a: Opset 17 Slice-Based Analytical Solvers ⬜
	> Convert Gather-based solvers to Slice(step=-1) + Transpose for ~0 MACs.

	### 1b: ONNX Optimizer Pass ⬜
	> `onnxoptimizer.optimize()` for dead-code elimination.

	---

	## Phase 2: Regularization — EXHAUSTED

	> Exps 0-3 tested. Architecture mismatch, not overfitting. Conv ceiling = ~25 tasks.

	---

	## Phase 3: New Solver Types

	> Organized by architecture type. Each solver is a separate .py file.
	> Build rule: Scan for matches FIRST, build only what has hits, validate on arc-gen.

	---

	### Category A: Static Spatial Remapping (Gather/Slice/Pad)

	These are cheap, zero/low-MAC solvers that use precomputed index mappings. Highest score per task. Build these first.

	\| # \| Solver \| Pattern \| Key Ops \| Status \|
	\|---\|--------\|---------\|---------\|--------\|
	\| A1 \| `extract_inner` \| Remove N-pixel border frame → smaller output \| Gather \| ⬜ \|
	\| A2 \| `add_border` \| Add constant-color border → larger output \| Gather+const \| ⬜ \|
	\| A3 \| `pad_align` \| Input pasted into larger canvas at fixed offset \| Gather+const \| ⬜ \|
	\| A4 \| `downsample_stride` \| `out[r,c] = inp[rsH, csW]` \| Gather \| ⬜ \|
	\| A5 \| `extract_and_tile` \| Find smallest repeating unit, tile to fill output \| Gather \| ⬜ \|
	\| A6 \| `sparse_fill` \| Each non-zero pixel becomes NxN block \| Gather \| ⬜ \|
	\| A7 \| `symmetry_complete` \| Mirror sparse data to complete L-R or T-B symmetry \| Gather \| ⬜ \|
	\| A8 \| `multi_stamp` \| Union of shifted copies of input at fixed offsets \| Gather+Add \| ⬜ \|
	\| A9 \| `affine_remap` \| General integer coordinate remap: stride+offset, axis swap \| Gather \| ⬜ \|
	\| A10 \| `crop_paste` \| Crop from input, paste at different position in output \| Gather+const \| ⬜ \|

	---

	### Category B: Channel/Color Operations

	Color-level transforms that work in the 10-channel one-hot space.

	\| # \| Solver \| Pattern \| Key Ops \| Status \|
	\|---\|--------\|---------\|---------\|--------\|
	\| B1 \| `channel_filter` \| Keep only certain colors, rest → background \| Mul(mask [1,10,1,1]) \| ⬜ \|
	\| B2 \| `overlay_constant` \| Input + fixed pixel pattern overlaid \| Add or Where + constant tensor \| ⬜ \|
	\| B3 \| `fill_bg_with_mode` \| Background pixels filled with dominant color, non-bg unchanged \| ReduceSum→ArgMax→Where \| ⬜ \|
	\| B4 \| `row_mode_fill` \| Each row filled with its dominant color \| ReduceSum(width)→ArgMax→Tile(width) \| ⬜ \|
	\| B5 \| `col_mode_fill` \| Each column filled with its dominant color \| ReduceSum(height)→ArgMax→Tile(height) \| ⬜ \|

	---

	### Category C: Composition / Chaining

	Chain two existing solvers. If transform(input) → intermediate, and color_map(intermediate) → output, emit one combined graph.

	\| # \| Solver \| Pattern \| Key Ops \| Status \|
	\|---\|--------\|---------\|---------\|--------\|
	\| C1 \| `transform_then_recolor` \| rotate/flip/transpose + color_map \| Chain existing \| ⬜ \|
	\| C2 \| `crop_then_transform` \| fixed_crop + rotate/flip \| Chain existing \| ⬜ \|
	\| C3 \| `recolor_then_tile` \| color_map + tile/upscale \| Chain existing \| ⬜ \|

	---

	### Category D: Unrolled Propagation (Conv+Where loops)

	Dynamic solvers that need N unrolled steps. Higher MAC cost (~8-12 score).

	\| # \| Solver \| Pattern \| Key Ops \| Status \|
	\|---\|--------\|---------\|---------\|--------\|
	\| D1 \| `gravity_unrolled` \| Directional compaction, 4 dirs × 10 bg colors \| Conv+Where ×N steps \| ✅ Task 78 \|
	\| D2 \| `flood_fill` \| BFS: seed spreads through passable cells \| Conv+Clip+Mul ×N steps \| ⬜ \|
	\| D3 \| `edge_detect` \| Laplacian/Sobel boundary detection \| Conv(3×3)+Abs+Greater \| ✅ built, 0 matches \|

	---

	### Category E: Global Aggregation

	Solvers that compute a global statistic and broadcast it.

	\| # \| Solver \| Pattern \| Key Ops \| Status \|
	\|---\|--------\|---------\|---------\|--------\|
	\| E1 \| `mode_fill` \| Output = solid fill of most common input color \| ReduceSum→ArgMax→Expand \| ✅ Task 129 \|
	\| E2 \| `cumsum_fill` \| Running sums for object extent, directional filling \| CumSum \| ⬜ \|
	\| E3 \| `bbox_crop_pad` \| Find bounding box via ReduceSum+ArgMax, crop+pad \| ReduceSum→ArgMax→Slice→Pad \| ⬜ \|

	---

	### Build Order (highest expected ROI first)

	Wave 1 — Static remapping (Category A): Cheapest to build, highest score per task, most likely to have matches. ~1 day.
	1. A1 `extract_inner` + A2 `add_border` (border ops)
	2. A5 `extract_and_tile` + A6 `sparse_fill` (pattern ops)
	3. A3 `pad_align` + A4 `downsample_stride` (placement ops)
	4. A7 `symmetry_complete` (symmetry)

	Wave 2 — Color/channel ops (Category B): Builds on mode_fill. ~0.5 day.
	5. B1 `channel_filter` + B3 `fill_bg_with_mode`
	6. B4 `row_mode_fill` + B5 `col_mode_fill`

	Wave 3 — Composition (Category C): Chains existing solvers, no new ONNX ops. ~0.5 day.
	7. C1 `transform_then_recolor`

	Wave 4 — Propagation (Category D): More complex, lower score. ~1 day.
	8. D2 `flood_fill`

	Wave 5 — Global aggregation (Category E): Needs careful design. ~1 day.
	9. E2 `cumsum_fill` + E3 `bbox_crop_pad`

	---

	### Honest Projections

	I will NOT repeat the Phase 2 mistake of projecting fantasy numbers. Here's what I know:

	- 51 tasks solved today. LB 594.84.
	- Each Wave: Might add 2-10 tasks. Might add 0. We don't know until we scan and test.
	- The only reliable estimate: Gravity added 1 task. Mode fill added 1 task. Edge detect added 0. Hit rate so far: ~1 new task per solver built.
	- If hit rate holds: 20 new solvers × ~1 task each = ~20 new tasks → ~70 solved → LB ~800-900.
	- If some solvers hit 5+ tasks: Could reach 100-120 solved → LB ~1200-1500.
	- 3000+ requires a fundamentally different approach (test-time training, learned architectures) that we're not doing.

	\| Scenario \| Solved \| Est LB \| Confidence \|
	\|----------\|--------\|--------\|------------\|
	\| Wave 1 only \| 55-65 \| 650-800 \| 60% \|
	\| Wave 1+2 \| 60-75 \| 750-950 \| 50% \|
	\| Wave 1+2+3 \| 65-85 \| 850-1100 \| 40% \|
	\| All waves \| 70-120 \| 900-1500 \| 30% \|

	---

	## Phase 4: Score Optimization

	### 4a: Best-of-N Model Selection ⬜
	### 4b: Official Scoring Alignment (onnx_tool) ⬜

	---

	## BLENDING — EXPLICITLY EXCLUDED

	---

	## Experiment Log

	\| Date \| Experiment \| Result \| Decision \|
	\|------\|-----------\|--------\|----------\|
	\| 2026-04-24 \| v4.2 baseline \| 50 arc-gen, LB ~501 \| Baseline \|
	\| 2026-04-26 \| v5.0 refactor \| 49 solved, ~604 score \| New baseline \|
	\| 2026-04-26 \| Exp 1-3 (regularization) \| 0 improvement \| EXHAUSTED \|
	\| 2026-04-26 \| v5.2 gravity+mode \| +2 tasks (78, 129) \| ✅ Kept \|
	\| 2026-04-27 \| v5.2 Kaggle submission \| 51 solved, LB 594.84 \| Current best \|

	---

	## Research Queue

	1. ✅ CompressARC — CumMax/ReduceSum architecture
	2. ✅ TRM — recursive reasoning
	3. ✅ ARC Prize 2025 Tech Report
	4. ✅ Expert review #1 — Phase 3 solver list (pad_align, crop_paste, downsample, etc.)
	5. ✅ Expert review #2 — 6 concrete solvers with code (extract_inner, add_border, etc.)
	6. [ ] Task taxonomy scan — for each Wave 1 solver, count matching unsolved tasks before building