rogermt's picture
Move own-solver/SKILL.md to own-solver/
022a14c verified
---
name: neurogolf-solver
description: Build and improve an ONNX model generator for the NeuroGolf Championship (Kaggle). Produces 400 tiny ONNX models (opset 17, IR 8, input/output [1,10,30,30] one-hot float32) for ARC-AGI tasks. Scoring = max(1, 25 - ln(MACs + memory_bytes + params)). Lower cost = higher score. Use this skill whenever working on this competition, debugging submission failures, or starting a fresh session.
---
# NeuroGolf Solver
## Development Methodology: The Closed-Loop
```
Research → Design → Experiment → Analyze → Research → ...
```
**Rule: Loop until we have a CONFIRMED increase in arc-gen validated score.**
| Phase | What | Exit Criteria |
|-------|------|---------------|
| **Research** | Read papers, understand theory, find what works in similar regimes | Have a testable hypothesis with cited evidence |
| **Design** | Write MINIMAL code to test the hypothesis | Code is <200 lines, focused on ONE feature |
| **Experiment** | Run on representative task sample (≥20 tasks, or all 400 if cheap) | Full arc-gen validation completed |
| **Analyze** | Compare with/without feature. Measure: tasks solved, arc-gen survival, total score | Data shows >10% improvement in arc-gen survival rate OR total score |
| **Research** | If failed: why? Read more papers. If succeeded: can we combine with other wins? | Next hypothesis ready |
**Critical rules:**
- NEVER write >200 lines without running them first
- NEVER claim a feature "works" until arc-gen validated on ≥20 tasks
- NEVER upload code to repo that hasn't been validated
- Theory from papers is NOT proof for our data — always test
- If a feature shows no improvement after testing, DELETE it — don't leave dead code
- Make surgical edits to individual files — NEVER rewrite the entire codebase in one shot
## Quick Reference
- **Repo**: `rogermt/neurogolf-solver`
- **Current version**: v5.2 — 52 solved, ~710 score, est LB ~1058
- **Previous best on Kaggle**: v4.3 — 50 arc-gen-validated tasks, est LB ~670
- **Kaggle runtime**: 12 hours for submission
- **Target**: 3000+ LB (our own solver, no blending)
- **Detailed history, mistakes, analysis**: see `LEARNING.md`
- **Roadmap & experiment queue**: see `TODO.md`
## 1. Competition Rules
| Item | Value |
|------|-------|
| Input/Output | `"input"`/`"output"` float32 `[1,10,30,30]` one-hot |
| Opset | 17 (IR 8). Opset 10 also accepted on Kaggle |
| **Max .onnx file size** | **1.44 MB per ONNX file** (not submission zip) |
| Static shapes | **All tensors and parameters must have statically-defined shapes** |
| Banned ops | **Loop, Scan, NonZero, Unique, Script, Function** |
| Scoring | `max(1.0, 25.0 - ln(MACs + memory + params))` per task |
| Tasks | **All 400 count. There are NO excluded tasks. Unsolved = 1.0 pt.** |
| Validation | Models checked against **train + test + arc-gen** (ALL splits) |
| Submission | `submission.zip` with `task001.onnx``task400.onnx` + optional `submission.csv` |
## 2. ARC-GEN Data — THE Critical Factor
**A model that passes train+test but fails arc-gen scores ZERO on Kaggle.**
- Kaggle tasks at `/kaggle/input/competitions/neurogolf-2026/taskNNN.json` contain `{"train":[], "test":[], "arc-gen":[]}`
- Up to 262 arc-gen examples per task (100K total)
- Locally: ARC-GEN in `ARC-GEN-100K/{hex_id}.json` as list of `{input, output}` — merge into task data
- Conv fitting: include arc-gen examples **only when grid sizes match** train/test (otherwise lstsq fails)
- Validation: always check against `arc-gen[:30]` minimum
## 3. Architecture
### Package Structure (v5.2)
```
neurogolf_solver/
├── constants.py # Grid dims, opset, limits (NO excluded tasks)
├── config.py # Runtime providers, opset factory
├── data_loader.py # Task loading, one-hot, example extraction
├── validators.py # Model validation against all splits
├── profiler.py # Static cost profiler (onnx_tool fallback)
├── onnx_helpers.py # Opset 17 builders: Slice, Pad, ReduceSum, mk()
├── gather_helpers.py # Gather-based spatial remapping models
├── submission.py # run_tasks (W&B logging), zip/csv generation
├── main.py # Entry point with argparse
└── solvers/
├── analytical.py # identity, constant, color_map, transpose
├── geometric.py # flip, rotate, shift, crop, gravity (detect only)
├── tiling.py # tile, upscale, mirror, concat, spatial_gather
├── conv.py # lstsq conv (fixed, variable, diffshape, var_diff) + PCR fallback
├── gravity.py # Unrolled bubble-sort gravity (Conv+Where, 4 dirs) — Task 78
├── edge.py # Laplacian edge detection (0 matches currently)
├── mode.py # Mode fill (ReduceSum→ArgMax→Expand) — Task 129
└── solver_registry.py # ANALYTICAL_SOLVERS list + solve_task()
```
Run with: `python -m neurogolf_solver.main [args]`
### Solver Pipeline
```
1. Analytical solvers (instant, zero/low cost, always arc-gen safe):
identity → constant → color_map → transpose → flip → rotate →
shift → tile → upscale → kronecker → nonuniform_scale →
mirror_h → mirror_v → quad_mirror → concat → concat_enhanced →
diagonal_tile → fixed_crop → spatial_gather → varshape_spatial_gather →
gravity_unrolled → edge_detect → mode_fill
2. Conv solvers (lstsq fitted, validated against arc-gen, PCR fallback):
conv_fixed — Slice→Conv→ArgMax→Equal+Cast→Pad
conv_variable — Conv(30×30)→ArgMax→Equal+Cast→Mul(mask)
conv_diffshape — Slice→Conv→Slice(crop)→ArgMax→Equal+Cast→Pad
conv_var_diff — Conv(30×30)→ArgMax→Equal+Cast→Mul(input_mask)
```
### ONNX Building Rules (opset 17)
- **All shapes must be static** — no dynamic dimensions
- **Max 1.44 MB per .onnx file** — checked by Kaggle validator
- **Slice(step=-1)** for flip/rotate — zero MACs, replaces Gather for these transforms
- **Gather** (opset 1) for spatial remapping — used by concat, spatial_gather, mirrors, etc.
- **NEVER** use GatherElements (opset 11)
- **Equal+Cast** for one-hot — NEVER use OneHot (no CUDA kernel)
- **Channel Gather** for permutation color maps (0 MACs, score ~21 vs ~13 for Conv 1×1)
- **Conv 1×1** for non-permutation color maps (has MACs but correct)
- **ReduceSum** with axes as **tensor input** (opset 13+ requirement)
- **Pad** with tensor-based `pads` input (opset 11+ requirement)
- **lstsq calls** must be wrapped in `try/except (LinAlgError, ValueError)` — SVD can fail to converge
- **ArgMax + Equal+Cast** before Pad to ensure clean one-hot in padded region (gravity solver lesson)
### Conv Fitting
**Conv ceiling: ~25 tasks.** Regularization (Ridge, PCA/SVD, skip-ks) all tested and rejected.
Root cause: architecture mismatch — most unsolved tasks need non-local ops, not local conv patches.
**Current fitting strategy (v5.1+):**
- Composable primitives: `_build_patch_matrix` + `_solve_weights` + `_extract_weights`
- PCR fallback via `_solve_weights_pcr` (deferred 2nd pass, 0 new solves but no regressions)
- Kernel sizes: [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29]
- Try no-bias first, then bias
- lstsq wrapped in try/except for SVD non-convergence
- **Validate against arc-gen BEFORE accepting** — reject if fails
### New Solver Architectures (v5.2)
**gravity.py** — Unrolled bubble-sort via Conv+Where
- 4 directions × 10 bg colors, max(IH,IW) steps
- Per step: 2× Conv(3×3 shift), 3× ReduceSum, 3× Greater, 2× And, 2× Where
- Final: ArgMax + Equal+Cast + Pad (clean one-hot)
- Cost: ~16M (10×10 grid), score ~8.4
- **Validated: Task 78 (direction=up, bg=0)**
**edge.py** — Laplacian conv boundary detection
- Conv 1×1 (channel collapse) → Conv 3×3 (Laplacian) → Abs → Greater → And → Where
- Cost: ~16K MACs, score ~15
- **0 matches currently** — edge definition may be too strict
**mode.py** — Global majority color fill
- Slice → ReduceSum(axes=[2,3]) → ArgMax → Equal+Cast → Expand → Pad
- Cost: ~2K, score ~19.5
- **Validated: Task 129**
## 4. Performance
**The lstsq conv solver is the speed bottleneck.** Use `--conv_budget` to cap time per task (5s locally, 60s on Kaggle).
**Do NOT** try to GPU-accelerate lstsq. The bottleneck is algorithmic (O(n³) SVD), not device.
## 5. Score Accounting (v5.2)
| Category | Tasks | Avg Score | Notes |
|----------|-------|-----------|-------|
| Analytical | 24 | ~16 | identity, constant, color_map, transpose, flip, rotate, shift, tile, mirrors, etc. |
| Conv (lstsq) | 25 | ~10.5 | conv_fixed, conv_var, conv_diff, conv_var_diff |
| Gravity | 1 | 8.4 | Task 78 |
| Mode fill | 1 | 19.5 | Task 129 |
| Timing artifact | 1 | 8.2 | Task 61 (conv_var, only on slow hardware) |
| **Unsolved** | **348** | **1.0** | Minimum score |
| **Total** | **52/400** | | **~710 solved + 348 = ~1058 est LB** |
### Path to 3000+
1. ✅ ARC-GEN validation (v4)
2. ✅ New analytical solvers (v4)
3. ✅ Opset 17 Slice-based transforms (v5)
4. ✅ lstsq crash fix + modular package (v5)
5. ✅ PCR fallback in conv (v5.1 — 0 new solves but clean code)
6. ✅ Gravity solver (v5.2 — Task 78)
7. ✅ Mode fill solver (v5.2 — Task 129)
8. 🔲 **Phase 3 solvers**: flood fill, composition, color LUT, CumSum — see TODO.md
9. 🔲 **Phase 1a**: Opset 17 conversions for existing analytical tasks (score optimization)
10. 🔲 **Phase 4**: ONNX optimizer, best-of-N selection
**Blending is EXPLICITLY excluded** — user's competitive philosophy.
## 6. Submission Checklist
Before submitting to Kaggle:
- [ ] All models validated against train + test + arc-gen (locally)
- [ ] **All 400 tasks attempted** (no exclusions)
- [ ] No GatherElements in any model
- [ ] No banned ops (Loop, Scan, NonZero, Unique, Script, Function)
- [ ] All tensor shapes are static
- [ ] **Each .onnx file < 1.44 MB**
- [ ] Local estimated score calculated and compared to expected LB
- [ ] **A/B test**: ran both old and new solver on same tasks, new solver scores higher
## 7. Files & Locations
| Location | Path | Notes |
|----------|------|-------|
| HF Repo | `rogermt/neurogolf-solver` | All code + data |
| **Solver package** | `neurogolf_solver/` | **v5.2 — 19 files, modular** |
| Legacy monolith | `neurogolf_solver.py` | v4, kept for reference — do not edit |
| Official utils | `neurogolf_utils.py` | Kaggle scoring lib (needs onnx_tool) |
| ARC-GEN data | `ARC-GEN-100K.zip` | 400 files, 100K examples |
| Notebooks | `neurogolf-2026-solver-notebooks.zip` | 5 reference notebooks |
| Kaggle data | `/kaggle/input/competitions/neurogolf-2026/` | task JSONs with arc-gen |
| Roadmap | `TODO.md` | Experiment queue with status key |
| Learning | `LEARNING.md` | Knowledge accumulation — read before coding |
## 8. LEARNING.md Maintenance Rules
`LEARNING.md` is the knowledge accumulation file. Update it when:
- A bug is found and fixed — add to Mistakes Log with root cause
- A new approach is tried — record what worked, what didn't, and why
- Competition analysis reveals new insights — add to Competitive Intelligence
- Version milestones — update the Version History table
- Performance measurements — add concrete numbers
Structure: chronological within sections, newest entries first. Always include dates and version numbers.