--- name: neurogolf-solver description: Build and improve an ONNX model generator for the NeuroGolf Championship (Kaggle). Produces 400 tiny ONNX models (opset 17, IR 8, input/output [1,10,30,30] one-hot float32) for ARC-AGI tasks. Scoring = max(1, 25 - ln(MACs + memory_bytes + params)). Lower cost = higher score. Use this skill whenever working on this competition, debugging submission failures, or starting a fresh session. --- # NeuroGolf Solver ## Development Methodology: The Closed-Loop ``` Research → Design → Experiment → Analyze → Research → ... ``` **Rule: Loop until we have a CONFIRMED increase in arc-gen validated score.** | Phase | What | Exit Criteria | |-------|------|---------------| | **Research** | Read papers, understand theory, find what works in similar regimes | Have a testable hypothesis with cited evidence | | **Design** | Write MINIMAL code to test the hypothesis | Code is <200 lines, focused on ONE feature | | **Experiment** | Run on representative task sample (≥20 tasks, or all 400 if cheap) | Full arc-gen validation completed | | **Analyze** | Compare with/without feature. Measure: tasks solved, arc-gen survival, total score | Data shows >10% improvement in arc-gen survival rate OR total score | | **Research** | If failed: why? Read more papers. If succeeded: can we combine with other wins? | Next hypothesis ready | **Critical rules:** - NEVER write >200 lines without running them first - NEVER claim a feature "works" until arc-gen validated on ≥20 tasks - NEVER upload code to repo that hasn't been validated - Theory from papers is NOT proof for our data — always test - If a feature shows no improvement after testing, DELETE it — don't leave dead code - Make surgical edits to individual files — NEVER rewrite the entire codebase in one shot ## Quick Reference - **Repo**: `rogermt/neurogolf-solver` - **Current version**: v5.2 — 52 solved, ~710 score, est LB ~1058 - **Previous best on Kaggle**: v4.3 — 50 arc-gen-validated tasks, est LB ~670 - **Kaggle runtime**: 12 hours for submission - **Target**: 3000+ LB (our own solver, no blending) - **Detailed history, mistakes, analysis**: see `LEARNING.md` - **Roadmap & experiment queue**: see `TODO.md` ## 1. Competition Rules | Item | Value | |------|-------| | Input/Output | `"input"`/`"output"` float32 `[1,10,30,30]` one-hot | | Opset | 17 (IR 8). Opset 10 also accepted on Kaggle | | **Max .onnx file size** | **1.44 MB per ONNX file** (not submission zip) | | Static shapes | **All tensors and parameters must have statically-defined shapes** | | Banned ops | **Loop, Scan, NonZero, Unique, Script, Function** | | Scoring | `max(1.0, 25.0 - ln(MACs + memory + params))` per task | | Tasks | **All 400 count. There are NO excluded tasks. Unsolved = 1.0 pt.** | | Validation | Models checked against **train + test + arc-gen** (ALL splits) | | Submission | `submission.zip` with `task001.onnx`–`task400.onnx` + optional `submission.csv` | ## 2. ARC-GEN Data — THE Critical Factor **A model that passes train+test but fails arc-gen scores ZERO on Kaggle.** - Kaggle tasks at `/kaggle/input/competitions/neurogolf-2026/taskNNN.json` contain `{"train":[], "test":[], "arc-gen":[]}` - Up to 262 arc-gen examples per task (100K total) - Locally: ARC-GEN in `ARC-GEN-100K/{hex_id}.json` as list of `{input, output}` — merge into task data - Conv fitting: include arc-gen examples **only when grid sizes match** train/test (otherwise lstsq fails) - Validation: always check against `arc-gen[:30]` minimum ## 3. Architecture ### Package Structure (v5.2) ``` neurogolf_solver/ ├── constants.py # Grid dims, opset, limits (NO excluded tasks) ├── config.py # Runtime providers, opset factory ├── data_loader.py # Task loading, one-hot, example extraction ├── validators.py # Model validation against all splits ├── profiler.py # Static cost profiler (onnx_tool fallback) ├── onnx_helpers.py # Opset 17 builders: Slice, Pad, ReduceSum, mk() ├── gather_helpers.py # Gather-based spatial remapping models ├── submission.py # run_tasks (W&B logging), zip/csv generation ├── main.py # Entry point with argparse └── solvers/ ├── analytical.py # identity, constant, color_map, transpose ├── geometric.py # flip, rotate, shift, crop, gravity (detect only) ├── tiling.py # tile, upscale, mirror, concat, spatial_gather ├── conv.py # lstsq conv (fixed, variable, diffshape, var_diff) + PCR fallback ├── gravity.py # Unrolled bubble-sort gravity (Conv+Where, 4 dirs) — Task 78 ├── edge.py # Laplacian edge detection (0 matches currently) ├── mode.py # Mode fill (ReduceSum→ArgMax→Expand) — Task 129 └── solver_registry.py # ANALYTICAL_SOLVERS list + solve_task() ``` Run with: `python -m neurogolf_solver.main [args]` ### Solver Pipeline ``` 1. Analytical solvers (instant, zero/low cost, always arc-gen safe): identity → constant → color_map → transpose → flip → rotate → shift → tile → upscale → kronecker → nonuniform_scale → mirror_h → mirror_v → quad_mirror → concat → concat_enhanced → diagonal_tile → fixed_crop → spatial_gather → varshape_spatial_gather → gravity_unrolled → edge_detect → mode_fill 2. Conv solvers (lstsq fitted, validated against arc-gen, PCR fallback): conv_fixed — Slice→Conv→ArgMax→Equal+Cast→Pad conv_variable — Conv(30×30)→ArgMax→Equal+Cast→Mul(mask) conv_diffshape — Slice→Conv→Slice(crop)→ArgMax→Equal+Cast→Pad conv_var_diff — Conv(30×30)→ArgMax→Equal+Cast→Mul(input_mask) ``` ### ONNX Building Rules (opset 17) - **All shapes must be static** — no dynamic dimensions - **Max 1.44 MB per .onnx file** — checked by Kaggle validator - **Slice(step=-1)** for flip/rotate — zero MACs, replaces Gather for these transforms - **Gather** (opset 1) for spatial remapping — used by concat, spatial_gather, mirrors, etc. - **NEVER** use GatherElements (opset 11) - **Equal+Cast** for one-hot — NEVER use OneHot (no CUDA kernel) - **Channel Gather** for permutation color maps (0 MACs, score ~21 vs ~13 for Conv 1×1) - **Conv 1×1** for non-permutation color maps (has MACs but correct) - **ReduceSum** with axes as **tensor input** (opset 13+ requirement) - **Pad** with tensor-based `pads` input (opset 11+ requirement) - **lstsq calls** must be wrapped in `try/except (LinAlgError, ValueError)` — SVD can fail to converge - **ArgMax + Equal+Cast** before Pad to ensure clean one-hot in padded region (gravity solver lesson) ### Conv Fitting **Conv ceiling: ~25 tasks.** Regularization (Ridge, PCA/SVD, skip-ks) all tested and rejected. Root cause: architecture mismatch — most unsolved tasks need non-local ops, not local conv patches. **Current fitting strategy (v5.1+):** - Composable primitives: `_build_patch_matrix` + `_solve_weights` + `_extract_weights` - PCR fallback via `_solve_weights_pcr` (deferred 2nd pass, 0 new solves but no regressions) - Kernel sizes: [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29] - Try no-bias first, then bias - lstsq wrapped in try/except for SVD non-convergence - **Validate against arc-gen BEFORE accepting** — reject if fails ### New Solver Architectures (v5.2) **gravity.py** — Unrolled bubble-sort via Conv+Where - 4 directions × 10 bg colors, max(IH,IW) steps - Per step: 2× Conv(3×3 shift), 3× ReduceSum, 3× Greater, 2× And, 2× Where - Final: ArgMax + Equal+Cast + Pad (clean one-hot) - Cost: ~16M (10×10 grid), score ~8.4 - **Validated: Task 78 (direction=up, bg=0)** **edge.py** — Laplacian conv boundary detection - Conv 1×1 (channel collapse) → Conv 3×3 (Laplacian) → Abs → Greater → And → Where - Cost: ~16K MACs, score ~15 - **0 matches currently** — edge definition may be too strict **mode.py** — Global majority color fill - Slice → ReduceSum(axes=[2,3]) → ArgMax → Equal+Cast → Expand → Pad - Cost: ~2K, score ~19.5 - **Validated: Task 129** ## 4. Performance **The lstsq conv solver is the speed bottleneck.** Use `--conv_budget` to cap time per task (5s locally, 60s on Kaggle). **Do NOT** try to GPU-accelerate lstsq. The bottleneck is algorithmic (O(n³) SVD), not device. ## 5. Score Accounting (v5.2) | Category | Tasks | Avg Score | Notes | |----------|-------|-----------|-------| | Analytical | 24 | ~16 | identity, constant, color_map, transpose, flip, rotate, shift, tile, mirrors, etc. | | Conv (lstsq) | 25 | ~10.5 | conv_fixed, conv_var, conv_diff, conv_var_diff | | Gravity | 1 | 8.4 | Task 78 | | Mode fill | 1 | 19.5 | Task 129 | | Timing artifact | 1 | 8.2 | Task 61 (conv_var, only on slow hardware) | | **Unsolved** | **348** | **1.0** | Minimum score | | **Total** | **52/400** | | **~710 solved + 348 = ~1058 est LB** | ### Path to 3000+ 1. ✅ ARC-GEN validation (v4) 2. ✅ New analytical solvers (v4) 3. ✅ Opset 17 Slice-based transforms (v5) 4. ✅ lstsq crash fix + modular package (v5) 5. ✅ PCR fallback in conv (v5.1 — 0 new solves but clean code) 6. ✅ Gravity solver (v5.2 — Task 78) 7. ✅ Mode fill solver (v5.2 — Task 129) 8. 🔲 **Phase 3 solvers**: flood fill, composition, color LUT, CumSum — see TODO.md 9. 🔲 **Phase 1a**: Opset 17 conversions for existing analytical tasks (score optimization) 10. 🔲 **Phase 4**: ONNX optimizer, best-of-N selection **Blending is EXPLICITLY excluded** — user's competitive philosophy. ## 6. Submission Checklist Before submitting to Kaggle: - [ ] All models validated against train + test + arc-gen (locally) - [ ] **All 400 tasks attempted** (no exclusions) - [ ] No GatherElements in any model - [ ] No banned ops (Loop, Scan, NonZero, Unique, Script, Function) - [ ] All tensor shapes are static - [ ] **Each .onnx file < 1.44 MB** - [ ] Local estimated score calculated and compared to expected LB - [ ] **A/B test**: ran both old and new solver on same tasks, new solver scores higher ## 7. Files & Locations | Location | Path | Notes | |----------|------|-------| | HF Repo | `rogermt/neurogolf-solver` | All code + data | | **Solver package** | `neurogolf_solver/` | **v5.2 — 19 files, modular** | | Legacy monolith | `neurogolf_solver.py` | v4, kept for reference — do not edit | | Official utils | `neurogolf_utils.py` | Kaggle scoring lib (needs onnx_tool) | | ARC-GEN data | `ARC-GEN-100K.zip` | 400 files, 100K examples | | Notebooks | `neurogolf-2026-solver-notebooks.zip` | 5 reference notebooks | | Kaggle data | `/kaggle/input/competitions/neurogolf-2026/` | task JSONs with arc-gen | | Roadmap | `TODO.md` | Experiment queue with status key | | Learning | `LEARNING.md` | Knowledge accumulation — read before coding | ## 8. LEARNING.md Maintenance Rules `LEARNING.md` is the knowledge accumulation file. Update it when: - A bug is found and fixed — add to Mistakes Log with root cause - A new approach is tried — record what worked, what didn't, and why - Competition analysis reveals new insights — add to Competitive Intelligence - Version milestones — update the Version History table - Performance measurements — add concrete numbers Structure: chronological within sections, newest entries first. Always include dates and version numbers.