| --- |
| name: neurogolf-solver |
| description: Build and improve an ONNX model generator for the NeuroGolf Championship (Kaggle). Produces 400 tiny ONNX models (opset 17, IR 8, input/output [1,10,30,30] one-hot float32) for ARC-AGI tasks. Scoring = max(1, 25 - ln(MACs + memory_bytes + params)). Lower cost = higher score. Use this skill whenever working on this competition, debugging submission failures, or starting a fresh session. |
| --- |
| |
| # NeuroGolf Solver |
|
|
| ## Development Methodology: The Closed-Loop |
|
|
| ``` |
| Research → Design → Experiment → Analyze → Research → ... |
| ``` |
|
|
| **Rule: Loop until we have a CONFIRMED increase in arc-gen validated score.** |
|
|
| | Phase | What | Exit Criteria | |
| |-------|------|---------------| |
| | **Research** | Read papers, understand theory, find what works in similar regimes | Have a testable hypothesis with cited evidence | |
| | **Design** | Write MINIMAL code to test the hypothesis | Code is <200 lines, focused on ONE feature | |
| | **Experiment** | Run on representative task sample (≥20 tasks, or all 400 if cheap) | Full arc-gen validation completed | |
| | **Analyze** | Compare with/without feature. Measure: tasks solved, arc-gen survival, total score | Data shows >10% improvement in arc-gen survival rate OR total score | |
| | **Research** | If failed: why? Read more papers. If succeeded: can we combine with other wins? | Next hypothesis ready | |
|
|
| **Critical rules:** |
| - NEVER write >200 lines without running them first |
| - NEVER claim a feature "works" until arc-gen validated on ≥20 tasks |
| - NEVER upload code to repo that hasn't been validated |
| - Theory from papers is NOT proof for our data — always test |
| - If a feature shows no improvement after testing, DELETE it — don't leave dead code |
| - Make surgical edits to individual files — NEVER rewrite the entire codebase in one shot |
|
|
| ## Quick Reference |
|
|
| - **Repo**: `rogermt/neurogolf-solver` |
| - **Current version**: v5.2 — 52 solved, ~710 score, est LB ~1058 |
| - **Previous best on Kaggle**: v4.3 — 50 arc-gen-validated tasks, est LB ~670 |
| - **Kaggle runtime**: 12 hours for submission |
| - **Target**: 3000+ LB (our own solver, no blending) |
| - **Detailed history, mistakes, analysis**: see `LEARNING.md` |
| - **Roadmap & experiment queue**: see `TODO.md` |
|
|
| ## 1. Competition Rules |
|
|
| | Item | Value | |
| |------|-------| |
| | Input/Output | `"input"`/`"output"` float32 `[1,10,30,30]` one-hot | |
| | Opset | 17 (IR 8). Opset 10 also accepted on Kaggle | |
| | **Max .onnx file size** | **1.44 MB per ONNX file** (not submission zip) | |
| | Static shapes | **All tensors and parameters must have statically-defined shapes** | |
| | Banned ops | **Loop, Scan, NonZero, Unique, Script, Function** | |
| | Scoring | `max(1.0, 25.0 - ln(MACs + memory + params))` per task | |
| | Tasks | **All 400 count. There are NO excluded tasks. Unsolved = 1.0 pt.** | |
| | Validation | Models checked against **train + test + arc-gen** (ALL splits) | |
| | Submission | `submission.zip` with `task001.onnx`–`task400.onnx` + optional `submission.csv` | |
|
|
| ## 2. ARC-GEN Data — THE Critical Factor |
|
|
| **A model that passes train+test but fails arc-gen scores ZERO on Kaggle.** |
|
|
| - Kaggle tasks at `/kaggle/input/competitions/neurogolf-2026/taskNNN.json` contain `{"train":[], "test":[], "arc-gen":[]}` |
| - Up to 262 arc-gen examples per task (100K total) |
| - Locally: ARC-GEN in `ARC-GEN-100K/{hex_id}.json` as list of `{input, output}` — merge into task data |
| - Conv fitting: include arc-gen examples **only when grid sizes match** train/test (otherwise lstsq fails) |
| - Validation: always check against `arc-gen[:30]` minimum |
|
|
| ## 3. Architecture |
|
|
| ### Package Structure (v5.2) |
| ``` |
| neurogolf_solver/ |
| ├── constants.py # Grid dims, opset, limits (NO excluded tasks) |
| ├── config.py # Runtime providers, opset factory |
| ├── data_loader.py # Task loading, one-hot, example extraction |
| ├── validators.py # Model validation against all splits |
| ├── profiler.py # Static cost profiler (onnx_tool fallback) |
| ├── onnx_helpers.py # Opset 17 builders: Slice, Pad, ReduceSum, mk() |
| ├── gather_helpers.py # Gather-based spatial remapping models |
| ├── submission.py # run_tasks (W&B logging), zip/csv generation |
| ├── main.py # Entry point with argparse |
| └── solvers/ |
| ├── analytical.py # identity, constant, color_map, transpose |
| ├── geometric.py # flip, rotate, shift, crop, gravity (detect only) |
| ├── tiling.py # tile, upscale, mirror, concat, spatial_gather |
| ├── conv.py # lstsq conv (fixed, variable, diffshape, var_diff) + PCR fallback |
| ├── gravity.py # Unrolled bubble-sort gravity (Conv+Where, 4 dirs) — Task 78 |
| ├── edge.py # Laplacian edge detection (0 matches currently) |
| ├── mode.py # Mode fill (ReduceSum→ArgMax→Expand) — Task 129 |
| └── solver_registry.py # ANALYTICAL_SOLVERS list + solve_task() |
| ``` |
|
|
| Run with: `python -m neurogolf_solver.main [args]` |
|
|
| ### Solver Pipeline |
| ``` |
| 1. Analytical solvers (instant, zero/low cost, always arc-gen safe): |
| identity → constant → color_map → transpose → flip → rotate → |
| shift → tile → upscale → kronecker → nonuniform_scale → |
| mirror_h → mirror_v → quad_mirror → concat → concat_enhanced → |
| diagonal_tile → fixed_crop → spatial_gather → varshape_spatial_gather → |
| gravity_unrolled → edge_detect → mode_fill |
| |
| 2. Conv solvers (lstsq fitted, validated against arc-gen, PCR fallback): |
| conv_fixed — Slice→Conv→ArgMax→Equal+Cast→Pad |
| conv_variable — Conv(30×30)→ArgMax→Equal+Cast→Mul(mask) |
| conv_diffshape — Slice→Conv→Slice(crop)→ArgMax→Equal+Cast→Pad |
| conv_var_diff — Conv(30×30)→ArgMax→Equal+Cast→Mul(input_mask) |
| ``` |
|
|
| ### ONNX Building Rules (opset 17) |
| - **All shapes must be static** — no dynamic dimensions |
| - **Max 1.44 MB per .onnx file** — checked by Kaggle validator |
| - **Slice(step=-1)** for flip/rotate — zero MACs, replaces Gather for these transforms |
| - **Gather** (opset 1) for spatial remapping — used by concat, spatial_gather, mirrors, etc. |
| - **NEVER** use GatherElements (opset 11) |
| - **Equal+Cast** for one-hot — NEVER use OneHot (no CUDA kernel) |
| - **Channel Gather** for permutation color maps (0 MACs, score ~21 vs ~13 for Conv 1×1) |
| - **Conv 1×1** for non-permutation color maps (has MACs but correct) |
| - **ReduceSum** with axes as **tensor input** (opset 13+ requirement) |
| - **Pad** with tensor-based `pads` input (opset 11+ requirement) |
| - **lstsq calls** must be wrapped in `try/except (LinAlgError, ValueError)` — SVD can fail to converge |
| - **ArgMax + Equal+Cast** before Pad to ensure clean one-hot in padded region (gravity solver lesson) |
| |
| ### Conv Fitting |
| |
| **Conv ceiling: ~25 tasks.** Regularization (Ridge, PCA/SVD, skip-ks) all tested and rejected. |
| Root cause: architecture mismatch — most unsolved tasks need non-local ops, not local conv patches. |
| |
| **Current fitting strategy (v5.1+):** |
| - Composable primitives: `_build_patch_matrix` + `_solve_weights` + `_extract_weights` |
| - PCR fallback via `_solve_weights_pcr` (deferred 2nd pass, 0 new solves but no regressions) |
| - Kernel sizes: [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29] |
| - Try no-bias first, then bias |
| - lstsq wrapped in try/except for SVD non-convergence |
| - **Validate against arc-gen BEFORE accepting** — reject if fails |
|
|
| ### New Solver Architectures (v5.2) |
|
|
| **gravity.py** — Unrolled bubble-sort via Conv+Where |
| - 4 directions × 10 bg colors, max(IH,IW) steps |
| - Per step: 2× Conv(3×3 shift), 3× ReduceSum, 3× Greater, 2× And, 2× Where |
| - Final: ArgMax + Equal+Cast + Pad (clean one-hot) |
| - Cost: ~16M (10×10 grid), score ~8.4 |
| - **Validated: Task 78 (direction=up, bg=0)** |
|
|
| **edge.py** — Laplacian conv boundary detection |
| - Conv 1×1 (channel collapse) → Conv 3×3 (Laplacian) → Abs → Greater → And → Where |
| - Cost: ~16K MACs, score ~15 |
| - **0 matches currently** — edge definition may be too strict |
|
|
| **mode.py** — Global majority color fill |
| - Slice → ReduceSum(axes=[2,3]) → ArgMax → Equal+Cast → Expand → Pad |
| - Cost: ~2K, score ~19.5 |
| - **Validated: Task 129** |
|
|
| ## 4. Performance |
|
|
| **The lstsq conv solver is the speed bottleneck.** Use `--conv_budget` to cap time per task (5s locally, 60s on Kaggle). |
|
|
| **Do NOT** try to GPU-accelerate lstsq. The bottleneck is algorithmic (O(n³) SVD), not device. |
|
|
| ## 5. Score Accounting (v5.2) |
|
|
| | Category | Tasks | Avg Score | Notes | |
| |----------|-------|-----------|-------| |
| | Analytical | 24 | ~16 | identity, constant, color_map, transpose, flip, rotate, shift, tile, mirrors, etc. | |
| | Conv (lstsq) | 25 | ~10.5 | conv_fixed, conv_var, conv_diff, conv_var_diff | |
| | Gravity | 1 | 8.4 | Task 78 | |
| | Mode fill | 1 | 19.5 | Task 129 | |
| | Timing artifact | 1 | 8.2 | Task 61 (conv_var, only on slow hardware) | |
| | **Unsolved** | **348** | **1.0** | Minimum score | |
| | **Total** | **52/400** | | **~710 solved + 348 = ~1058 est LB** | |
| |
| ### Path to 3000+ |
| 1. ✅ ARC-GEN validation (v4) |
| 2. ✅ New analytical solvers (v4) |
| 3. ✅ Opset 17 Slice-based transforms (v5) |
| 4. ✅ lstsq crash fix + modular package (v5) |
| 5. ✅ PCR fallback in conv (v5.1 — 0 new solves but clean code) |
| 6. ✅ Gravity solver (v5.2 — Task 78) |
| 7. ✅ Mode fill solver (v5.2 — Task 129) |
| 8. 🔲 **Phase 3 solvers**: flood fill, composition, color LUT, CumSum — see TODO.md |
| 9. 🔲 **Phase 1a**: Opset 17 conversions for existing analytical tasks (score optimization) |
| 10. 🔲 **Phase 4**: ONNX optimizer, best-of-N selection |
| |
| **Blending is EXPLICITLY excluded** — user's competitive philosophy. |
| |
| ## 6. Submission Checklist |
| |
| Before submitting to Kaggle: |
| - [ ] All models validated against train + test + arc-gen (locally) |
| - [ ] **All 400 tasks attempted** (no exclusions) |
| - [ ] No GatherElements in any model |
| - [ ] No banned ops (Loop, Scan, NonZero, Unique, Script, Function) |
| - [ ] All tensor shapes are static |
| - [ ] **Each .onnx file < 1.44 MB** |
| - [ ] Local estimated score calculated and compared to expected LB |
| - [ ] **A/B test**: ran both old and new solver on same tasks, new solver scores higher |
| |
| ## 7. Files & Locations |
| |
| | Location | Path | Notes | |
| |----------|------|-------| |
| | HF Repo | `rogermt/neurogolf-solver` | All code + data | |
| | **Solver package** | `neurogolf_solver/` | **v5.2 — 19 files, modular** | |
| | Legacy monolith | `neurogolf_solver.py` | v4, kept for reference — do not edit | |
| | Official utils | `neurogolf_utils.py` | Kaggle scoring lib (needs onnx_tool) | |
| | ARC-GEN data | `ARC-GEN-100K.zip` | 400 files, 100K examples | |
| | Notebooks | `neurogolf-2026-solver-notebooks.zip` | 5 reference notebooks | |
| | Kaggle data | `/kaggle/input/competitions/neurogolf-2026/` | task JSONs with arc-gen | |
| | Roadmap | `TODO.md` | Experiment queue with status key | |
| | Learning | `LEARNING.md` | Knowledge accumulation — read before coding | |
| |
| ## 8. LEARNING.md Maintenance Rules |
| |
| `LEARNING.md` is the knowledge accumulation file. Update it when: |
| - A bug is found and fixed — add to Mistakes Log with root cause |
| - A new approach is tried — record what worked, what didn't, and why |
| - Competition analysis reveals new insights — add to Competitive Intelligence |
| - Version milestones — update the Version History table |
| - Performance measurements — add concrete numbers |
| |
| Structure: chronological within sections, newest entries first. Always include dates and version numbers. |
| |