rogermt
/

neurogolf-solver

Model card Files Files and versions

xet

Community

rogermt commited on 12 days ago

Commit

97af2d2

verified ·

1 Parent(s): 17e36c1

Rewrite Phase 3: merged expert + original solvers, organized by architecture type, honest estimates

Browse files

Files changed (1) hide show

TODO.md +113 -216

TODO.md CHANGED Viewed

@@ -1,291 +1,188 @@
 # NeuroGolf Solver — Roadmap
-> Current: v5.1 · 49 arc-gen validated (budget=5s) · ~604 score · Target: 3000+
 > Philosophy: **Research → Design → Experiment → Analyze → Research** loop until confirmed score increase.
 > Rule: **NEVER claim a feature works without full arc-gen validation on representative tasks.**
-> Updated: 2026-04-26 — Phase 2 (regularization) exhausted. Phase 3 redesigned from literature.
-> **All 400 tasks count. There are NO excluded tasks.**
 ---
-## Current Solver Breakdown (49/400 solved)
-| Category | Tasks | Avg Score | Solver |
-|----------|-------|-----------|--------|
-| Conv (lstsq) | 25 | ~10.5 | conv_fixed, conv_var, conv_diff, conv_var_diff |
-| Analytical | 24 | ~15.5 | identity, constant, color_map, transpose, flip, rotate, shift, tile, upscale, mirror, concat, spatial_gather, etc. |
-| **Unsolved** | **351** | **1.0** | — |
-| **Total** | **400** | | **~604** |
-The 351 unsolved tasks need fundamentally different solver architectures.
 ---
-## Phase 1: Score Optimization on Existing Tasks (est +100-200 pts)
-### 1a: Opset 17 Slice-Based Analytical Solvers (~0 cost) ⬜
-> Reduce MACs on the 24 analytical tasks. Currently score ~15.5 avg, target ~20+.
-- [ ] Convert Gather-based solvers to Slice(step=-1) + Transpose
-  - Affected: s_tile, s_upscale, s_concat, s_concat_enhanced, s_kronecker, s_diagonal_tile, s_shift, s_mirror_h, s_mirror_v, s_quad_mirror, s_fixed_crop, s_spatial_gather, s_varshape_spatial_gather
-- [ ] Validate: Full 400 arc-gen. Accept if >10% score increase on analytical tasks.
-- **Estimate:** 24 tasks × (+5 pts avg) = **+120 pts**
 ### 1b: ONNX Optimizer Pass ⬜
-- [ ] `onnxoptimizer.optimize()` with dead-code elimination
-- [ ] Validate: Compare scores before/after on all 49 solved tasks.
-- **Estimate:** 49 tasks × (+1-2 pts avg) = **+50-100 pts**
 ---
 ## Phase 2: Regularization — EXHAUSTED
-> Exps 0-3 tested. Root cause is architecture mismatch, not overfitting.
-> Conv ceiling = ~25 tasks. See Experiment Log below for full data.
 ---
-## Phase 3: New Solver Types (the actual path to 3000+)
-> **Research basis:** CompressARC (`2512.06104`), TRM (`2510.04871`), NCA (`2506.15746`), ONNX opset 17 operator audit.
-> **Key insight:** ARC tasks cluster into ~8 families. Each family needs a specialized ONNX architecture. Score = max(1, 25 - ln(MACs + mem + params)), so tiny models score highest.
->
-> **Honest math:** Solving 50 more tasks at ~12 pts avg = +600. Solving 100 more = +1200. To hit 3000 we need ~200 new tasks at ~12 pts avg. That's ambitious but structurally possible.
-### Solver Priority Table (ordered by score × expected tasks)
-| # | Solver | Expected Tasks | Score | Total Pts | Complexity | Key Ops |
-|---|--------|---------------|-------|-----------|------------|---------|
-| 1 | **Gravity (4-dir)** | 10-20 | ~12 | 120-240 | Medium | Conv(3×3 shift kernel) × 30 unrolled steps + Where |
-| 2 | **Flood Fill (BFS)** | 10-20 | ~12 | 120-240 | Medium | Conv(3×3 cross kernel) + Clip × 30 steps |
-| 3 | **Edge/Boundary Detect** | 10-20 | ~13 | 130-260 | Low | Conv(Laplacian/Sobel kernel) + threshold |
-| 4 | **Composition (transform+recolor)** | 10-15 | ~14 | 140-210 | Low | Chain existing analytical + color_map |
-| 5 | **Mode/Majority Color** | 5-10 | ~16 | 80-160 | Low | ReduceSum → ArgMax → Expand |
-| 6 | **Color LUT (10×10 MatMul)** | 10-20 | ~13 | 130-260 | Low | OneHot → MatMul(W_lut) → ArgMax, lstsq-fit W_lut |
-| 7 | **Object Copy/Offset** | 5-15 | ~12 | 60-180 | High | ScatterND + offset detection |
-| 8 | **CumSum Analysis** | 5-10 | ~15 | 75-150 | Medium | CumSum for running totals, object extent |
-**Conservative total: +80-150 tasks, +850-1700 pts → est LB ~1450-2300**
-**Optimistic total: +150-200 tasks → est LB ~2400-3000**
----
-### 3a: Gravity Solver ⬜ — Confidence: **70%**
-> Directional pixel propagation. ~30 unrolled steps, 4 directions.
-**ONNX Blueprint:**
-```python
-# Per step: pull pixel from direction, fill if empty
-shift_k = np.zeros((1,1,3,3), dtype=np.float32)
-shift_k[0,0,0,1] = 1.0  # gravity down: pull from row above
-for i in range(30):
-    nodes += [
-        Conv(cur, shift_k, pads=[1,1,0,0]),  # shifted copy
-        Equal(cur, zero),                      # is cell empty?
-        Where(is_empty, shifted, cur),         # fill empty cells
-    ]
-```
-**Fitting:** For each task, try all 4 directions. Detect "empty color" (usually 0). Validate against arc-gen.
-**Cost:** ~240K MACs (30 steps × 8100 per Conv), ~4.8KB, score ~12.
-**Implementation:** ~60 lines in `neurogolf_solver/solvers/gravity.py`
-- [ ] Implement `s_gravity_unrolled(td)` for all 4 directions
-- [ ] Detect empty color from training examples
-- [ ] Validate on 400 tasks
-- **Accept if:** ≥3 new tasks solved
 ---
-### 3b: Flood Fill Solver ⬜ — Confidence: **60%**
-> BFS via unrolled Conv. Seeds propagate through passable cells.
-**ONNX Blueprint:**
-```python
-# 30-step BFS. Seed starts at one color, spreads through another.
-cross_k = np.array([[0,1,0],[1,0,1],[0,1,0]], dtype=np.float32).reshape(1,1,3,3)
-for i in range(30):
-    nodes += [
-        Conv(cur, cross_k, pads=[1,1,1,1]),  # expand frontier
-        Clip(expanded, 0, 1),                  # saturate
-        Mul(clipped, obstacle_mask),           # block walls
-        Add(cur, masked),                      # accumulate
-        Clip(sum, 0, 1),                       # final saturate
-    ]
-```
-**Fitting:** Learn seed_selector (10 weights: which input color is seed) + obstacle_selector (10 weights: which colors are passable). Fit via lstsq on training examples.
-**Cost:** ~240K MACs, ~4.9KB, score ~12.
-**Implementation:** ~80 lines in `neurogolf_solver/solvers/flood.py`
-- [ ] Implement `s_flood_fill(td)` with parameterized seed/obstacle selection
-- [ ] Fit selectors via lstsq
-- [ ] Validate on 400 tasks
-- **Accept if:** ≥2 new tasks solved
----
-### 3c: Edge/Boundary Detection ⬜ — Confidence: **75%**
-> Laplacian/Sobel convolution to detect boundaries between colors.
-**ONNX Blueprint:**
-```python
-# Laplacian kernel detects any color boundary
-lap_k = np.array([[0,-1,0],[-1,4,-1],[0,-1,0]], dtype=np.float32)
-nodes = [
-    ReduceSum(input, axes=[1]),          # collapse channels to [1,1,H,W] intensity
-    Conv(intensity, lap_k, pads=[1,1,1,1]),  # edge response
-    Greater(response, threshold),        # binary edge map
-    Cast(binary, FLOAT),                 # to float
-    # Then: assign edge_color via Mul + Add
-]
-```
-**Fitting:** Detect edge_color and background_color from training pairs. Many ARC tasks ask "draw the outline of the shape."
-**Cost:** ~16K MACs, ~1KB, score ~15.
-**Implementation:** ~40 lines in `neurogolf_solver/solvers/edge.py`
-- [ ] Implement `s_edge_detect(td)` with Laplacian + Sobel variants
-- [ ] Fit edge/background colors from examples
-- [ ] Validate on 400 tasks
-- **Accept if:** ≥2 new tasks solved
 ---
-### 3d: Composition Detectors ⬜ — Confidence: **65%**
-> Chain existing analytical solvers: rotate+recolor, flip+recolor, etc.
-**Approach:** For each task, try all (transform × color_map) pairs. If the composition matches all train+arc-gen examples, emit combined ONNX graph.
-- [ ] Scan 400 tasks: for each, apply all transforms, then check if color_map fixes remainder
-- [ ] Build ONNX graph that chains transform + color_map nodes
-- [ ] Validate on 400 tasks
-- **Accept if:** ≥3 new tasks solved
 ---
-### 3e: Mode/Majority Color Solver ⬜ — Confidence: **80%**
-> Output = most common color in input (or region).
-**ONNX Blueprint:**
-```python
-# ~543 bytes, 13 params, ~10K MACs, score ~16
-nodes = [
-    ReduceSum(input, axes=[2,3]),  # sum over spatial → [1,10] histogram
-    ArgMax(hist, axis=1),          # most common color index
-    # Expand to full grid, one-hot encode
-]
-```
-**Fitting:** Check training pairs: does output = constant fill of mode color? Also try per-row/per-col mode.
-**Implementation:** ~30 lines
-- [ ] Implement `s_mode_color(td)` — global, per-row, per-col variants
-- [ ] Validate on 400 tasks
-- **Accept if:** ≥1 new task solved
 ---
-### 3f: Color LUT (10×10 MatMul) ⬜ — Confidence: **70%**
-> General color→color mapping via learned 10×10 weight matrix.
-Already have `s_color_map` for permutations + Conv 1×1 for non-permutations. This extends to position-dependent color transforms by stacking spatial features.
-**Fitting:** `W_lut = lstsq(OneHot(input_pixels), OneHot(output_pixels))`
-- [ ] Implement `s_color_lut(td)` using OneHot → MatMul → ArgMax
-- [ ] Compare with existing color_map solver — keep if it solves additional tasks
-- [ ] Validate on 400 tasks
-- **Accept if:** ≥2 new tasks beyond existing color_map
 ---
-### 3g: CumSum-Based Analysis ⬜ — Confidence: **50%**
-> Running sums for object extent, counting, filling. Key op from CompressARC.
-**ONNX Blueprint:**
-```python
-# CumSum along axis 2 (rows) → running sum per column
-axis_tensor = from_array(np.int64(2), 'axis')
-nodes = [CumSum(input_channel, axis_tensor)]
-```
-**Use cases:** "Fill everything below the topmost pixel of each color", "count pixels per row", object bounding boxes.
-- [ ] Prototype CumSum-based solver for specific task families
-- [ ] Validate on 400 tasks
-- **Accept if:** ≥1 new task solved
----
-## Phase 4: Score Optimization (est +50-100 pts)
-### 4a: Best-of-N Model Selection ⬜
-> For each task, try ALL ks values + ALL solver types, keep cheapest valid model.
-- [ ] Refactor `solve_task` to collect all valid candidates, pick lowest cost
-- [ ] Validate: Compare total score before/after
-- **Accept if:** ≥3% total score improvement
-### 4b: Official Scoring Alignment ⬜
-> Use `onnx_tool` for exact cost matching with Kaggle scorer.
-- [ ] Compare static profiler vs onnx_tool on all solved models
-- [ ] Fix divergences
-- **Accept if:** divergence <2% on all models
----
-## BLENDING — EXPLICITLY EXCLUDED
-> **User's competitive philosophy**: "I am writing my own models no blending. This is major flaw in the competition loophole."
 ---
-## Experiment Log
-| Date | Experiment | Tasks Tested | Result | Decision |
-|------|-----------|-------------|--------|----------|
-| 2026-04-24 | v4.2 baseline | 400 | 50 arc-gen, ~670 LB | Keep as baseline |
-| 2026-04-25 | v5 untested code | 10 | 3/10 FAILED arc-gen | **REVERTED** |
-| 2026-04-26 | v5.0 refactor | 400 | **49 solved, ~603.6 score, budget=5s** | New baseline |
-| 2026-04-26 | Exp 1: Skip ks=5,7,9 | 55 | **HURTS 2 solved tasks** | **[-] REJECTED** |
-| 2026-04-26 | Exp 2: Best-of-N | 55 | **No new solves** | **[~] NEUTRAL** |
-| 2026-04-26 | Exp 3: Ridge reg | 4 victims | **0/4 pass arc-gen** | **[-] REJECTED** |
-| 2026-04-26 | **Exp 3: Full PCA/SVD** | **400 tasks** | **0 PCR solves, 0 regressions** | **[-] REJECTED** |
-### CRITICAL FINDING (2026-04-26)
-The 351 unsolved tasks fail because **conv is the wrong architecture**, not because of bad regularization. Score improvement requires new solver types (Phase 3), not fixing conv.
 ---
-## Realistic Projections
-| Milestone | Solved | Score | How |
-|-----------|--------|-------|-----|
-| **Current** | **49** | **~604** | — |
-| + Phase 1 (score opt) | 49 | ~750-800 | Opset 17 conversions + ONNX optimizer |
-| + 3c edge detect | 55-65 | ~900-1000 | Laplacian/Sobel conv |
-| + 3d composition | 60-75 | ~1000-1150 | Transform+recolor chains |
-| + 3a gravity | 70-90 | ~1150-1400 | 4-dir unrolled Conv+Where |
-| + 3b flood fill | 80-110 | ~1300-1700 | Unrolled BFS |
-| + 3e-g (mode, LUT, cumsum) | 90-130 | ~1500-2000 | Various analytical |
-| **Stretch: all Phase 3** | **130-200** | **~1800-2800** | Everything above working |
-**3000+ requires ~200+ solved tasks.** Achievable only if most Phase 3 solvers work AND we find additional task families to target. Honest range: **1500-2500 LB.**
 ---
 ## Research Queue
-1. ✅ Nakkiran 2019 — double descent (inapplicable)
-2. ✅ Segert 2023 — PCA > Ridge (0/400 PCR solves)
-3. ✅ CompressARC 2024 — MDL principle, CumMax/ReduceSum architecture
-4. ✅ TRM 2025 — recursive reasoning, 45% ARC-AGI-1
-5. ✅ NCA 2025 — cellular automata, fails at global coordination
-6. ✅ ARC Prize 2025 Tech Report — competition landscape
-7. [ ] **Task taxonomy:** Classify all 351 unsolved tasks by family → prioritize solvers
-8. [ ] **Top Kaggle non-blending notebooks** — implementation details
-> **Next action:** Classify the 351 unsolved tasks to validate the Phase 3 task count estimates before building anything.

 # NeuroGolf Solver — Roadmap
+> Current: v5.2 · 51 Kaggle validated · LB 594.84 · Target: 3000+
 > Philosophy: **Research → Design → Experiment → Analyze → Research** loop until confirmed score increase.
 > Rule: **NEVER claim a feature works without full arc-gen validation on representative tasks.**
+> Updated: 2026-04-27 — LB 594.84 confirmed. Phase 3 redesigned from expert review + literature.
+> **All 400 tasks count. There are NO excluded tasks. Unsolved = 1.0 pt (Kaggle adds automatically).**
 ---
+## Current Solver Breakdown (51/400 solved, LB 594.84)
+| Category | Tasks | Solvers |
+|----------|-------|---------|
+| Conv (lstsq) | 25 | conv_fixed, conv_var, conv_diff, conv_var_diff |
+| Analytical | 24 | identity, constant, color_map, transpose, flip, rotate, shift, tile, upscale, mirror, concat, spatial_gather, etc. |
+| Gravity | 1 | gravity_unrolled (Task 78) |
+| Mode fill | 1 | mode_fill (Task 129) |
+| **Unsolved** | **349** | — |
 ---
+## Phase 1: Score Optimization on Existing Tasks
+### 1a: Opset 17 Slice-Based Analytical Solvers ⬜
+> Convert Gather-based solvers to Slice(step=-1) + Transpose for ~0 MACs.
 ### 1b: ONNX Optimizer Pass ⬜
+> `onnxoptimizer.optimize()` for dead-code elimination.
 ---
 ## Phase 2: Regularization — EXHAUSTED
+> Exps 0-3 tested. Architecture mismatch, not overfitting. Conv ceiling = ~25 tasks.
 ---
+## Phase 3: New Solver Types
+> Organized by architecture type. Each solver is a separate .py file.
+> **Build rule:** Scan for matches FIRST, build only what has hits, validate on arc-gen.
+---
+### Category A: Static Spatial Remapping (Gather/Slice/Pad)
+These are cheap, zero/low-MAC solvers that use precomputed index mappings. Highest score per task. Build these first.
+| # | Solver | Pattern | Key Ops | Status |
+|---|--------|---------|---------|--------|
+| A1 | `extract_inner` | Remove N-pixel border frame → smaller output | Gather | ⬜ |
+| A2 | `add_border` | Add constant-color border → larger output | Gather+const | ⬜ |
+| A3 | `pad_align` | Input pasted into larger canvas at fixed offset | Gather+const | ⬜ |
+| A4 | `downsample_stride` | `out[r,c] = inp[r*sH, c*sW]` | Gather | ⬜ |
+| A5 | `extract_and_tile` | Find smallest repeating unit, tile to fill output | Gather | ⬜ |
+| A6 | `sparse_fill` | Each non-zero pixel becomes NxN block | Gather | ⬜ |
+| A7 | `symmetry_complete` | Mirror sparse data to complete L-R or T-B symmetry | Gather | ⬜ |
+| A8 | `multi_stamp` | Union of shifted copies of input at fixed offsets | Gather+Add | ⬜ |
+| A9 | `affine_remap` | General integer coordinate remap: stride+offset, axis swap | Gather | ⬜ |
+| A10 | `crop_paste` | Crop from input, paste at different position in output | Gather+const | ⬜ |
 ---
+### Category B: Channel/Color Operations
+Color-level transforms that work in the 10-channel one-hot space.
+| # | Solver | Pattern | Key Ops | Status |
+|---|--------|---------|---------|--------|
+| B1 | `channel_filter` | Keep only certain colors, rest → background | Mul(mask [1,10,1,1]) | ⬜ |
+| B2 | `overlay_constant` | Input + fixed pixel pattern overlaid | Add or Where + constant tensor | ⬜ |
+| B3 | `fill_bg_with_mode` | Background pixels filled with dominant color, non-bg unchanged | ReduceSum→ArgMax→Where | ⬜ |
+| B4 | `row_mode_fill` | Each row filled with its dominant color | ReduceSum(width)→ArgMax→Tile(width) | ⬜ |
+| B5 | `col_mode_fill` | Each column filled with its dominant color | ReduceSum(height)→ArgMax���Tile(height) | ⬜ |
 ---
+### Category C: Composition / Chaining
+Chain two existing solvers. If transform(input) → intermediate, and color_map(intermediate) → output, emit one combined graph.
+| # | Solver | Pattern | Key Ops | Status |
+|---|--------|---------|---------|--------|
+| C1 | `transform_then_recolor` | rotate/flip/transpose + color_map | Chain existing | ⬜ |
+| C2 | `crop_then_transform` | fixed_crop + rotate/flip | Chain existing | ⬜ |
+| C3 | `recolor_then_tile` | color_map + tile/upscale | Chain existing | ⬜ |
 ---
+### Category D: Unrolled Propagation (Conv+Where loops)
+Dynamic solvers that need N unrolled steps. Higher MAC cost (~8-12 score).
+| # | Solver | Pattern | Key Ops | Status |
+|---|--------|---------|---------|--------|
+| D1 | `gravity_unrolled` | Directional compaction, 4 dirs × 10 bg colors | Conv+Where ×N steps | ✅ Task 78 |
+| D2 | `flood_fill` | BFS: seed spreads through passable cells | Conv+Clip+Mul ×N steps | ⬜ |
+| D3 | `edge_detect` | Laplacian/Sobel boundary detection | Conv(3×3)+Abs+Greater | ✅ built, 0 matches |
 ---
+### Category E: Global Aggregation
+Solvers that compute a global statistic and broadcast it.
+| # | Solver | Pattern | Key Ops | Status |
+|---|--------|---------|---------|--------|
+| E1 | `mode_fill` | Output = solid fill of most common input color | ReduceSum→ArgMax→Expand | ✅ Task 129 |
+| E2 | `cumsum_fill` | Running sums for object extent, directional filling | CumSum | ⬜ |
+| E3 | `bbox_crop_pad` | Find bounding box via ReduceSum+ArgMax, crop+pad | ReduceSum→ArgMax→Slice→Pad | ⬜ |
 ---
+### Build Order (highest expected ROI first)
+**Wave 1 — Static remapping (Category A):** Cheapest to build, highest score per task, most likely to have matches. ~1 day.
+1. A1 `extract_inner` + A2 `add_border` (border ops)
+2. A5 `extract_and_tile` + A6 `sparse_fill` (pattern ops)
+3. A3 `pad_align` + A4 `downsample_stride` (placement ops)
+4. A7 `symmetry_complete` (symmetry)
+**Wave 2 — Color/channel ops (Category B):** Builds on mode_fill. ~0.5 day.
+5. B1 `channel_filter` + B3 `fill_bg_with_mode`
+6. B4 `row_mode_fill` + B5 `col_mode_fill`
+**Wave 3 — Composition (Category C):** Chains existing solvers, no new ONNX ops. ~0.5 day.
+7. C1 `transform_then_recolor`
+**Wave 4 — Propagation (Category D):** More complex, lower score. ~1 day.
+8. D2 `flood_fill`
+**Wave 5 — Global aggregation (Category E):** Needs careful design. ~1 day.
+9. E2 `cumsum_fill` + E3 `bbox_crop_pad`
+---
+### Honest Projections
+I will NOT repeat the Phase 2 mistake of projecting fantasy numbers. Here's what I know:
+- **51 tasks solved today.** LB 594.84.
+- **Each Wave:** Might add 2-10 tasks. Might add 0. We don't know until we scan and test.
+- **The only reliable estimate:** Gravity added 1 task. Mode fill added 1 task. Edge detect added 0. Hit rate so far: ~1 new task per solver built.
+- **If hit rate holds:** 20 new solvers × ~1 task each = ~20 new tasks → ~70 solved → LB ~800-900.
+- **If some solvers hit 5+ tasks:** Could reach 100-120 solved → LB ~1200-1500.
+- **3000+ requires a fundamentally different approach** (test-time training, learned architectures) that we're not doing.
+| Scenario | Solved | Est LB | Confidence |
+|----------|--------|--------|------------|
+| Wave 1 only | 55-65 | 650-800 | 60% |
+| Wave 1+2 | 60-75 | 750-950 | 50% |
+| Wave 1+2+3 | 65-85 | 850-1100 | 40% |
+| All waves | 70-120 | 900-1500 | 30% |
 ---
+## Phase 4: Score Optimization
+### 4a: Best-of-N Model Selection ⬜
+### 4b: Official Scoring Alignment (onnx_tool) ⬜
+---
+## BLENDING — EXPLICITLY EXCLUDED
 ---
+## Experiment Log
+| Date | Experiment | Result | Decision |
+|------|-----------|--------|----------|
+| 2026-04-24 | v4.2 baseline | 50 arc-gen, LB ~501 | Baseline |
+| 2026-04-26 | v5.0 refactor | 49 solved, ~604 score | New baseline |
+| 2026-04-26 | Exp 1-3 (regularization) | 0 improvement | **EXHAUSTED** |
+| 2026-04-26 | v5.2 gravity+mode | +2 tasks (78, 129) | ✅ Kept |
+| 2026-04-27 | **v5.2 Kaggle submission** | **51 solved, LB 594.84** | **Current best** |
 ---
 ## Research Queue
+1. ✅ CompressARC — CumMax/ReduceSum architecture
+2. ✅ TRM — recursive reasoning
+3. ✅ ARC Prize 2025 Tech Report
+4. ✅ Expert review #1 — Phase 3 solver list (pad_align, crop_paste, downsample, etc.)
+5. ✅ Expert review #2 — 6 concrete solvers with code (extract_inner, add_border, etc.)
+6. [ ] **Task taxonomy scan** — for each Wave 1 solver, count matching unsolved tasks before building