Move own-solver/SKILL.md to own-solver/

022a14c verified 13 days ago

11.5 kB

	---
	name: neurogolf-solver
	description: Build and improve an ONNX model generator for the NeuroGolf Championship (Kaggle). Produces 400 tiny ONNX models (opset 17, IR 8, input/output [1,10,30,30] one-hot float32) for ARC-AGI tasks. Scoring = max(1, 25 - ln(MACs + memory_bytes + params)). Lower cost = higher score. Use this skill whenever working on this competition, debugging submission failures, or starting a fresh session.
	---

	# NeuroGolf Solver

	## Development Methodology: The Closed-Loop

	```
	Research → Design → Experiment → Analyze → Research → ...
	```

	Rule: Loop until we have a CONFIRMED increase in arc-gen validated score.

	\| Phase \| What \| Exit Criteria \|
	\|-------\|------\|---------------\|
	\| Research \| Read papers, understand theory, find what works in similar regimes \| Have a testable hypothesis with cited evidence \|
	\| Design \| Write MINIMAL code to test the hypothesis \| Code is <200 lines, focused on ONE feature \|
	\| Experiment \| Run on representative task sample (≥20 tasks, or all 400 if cheap) \| Full arc-gen validation completed \|
	\| Analyze \| Compare with/without feature. Measure: tasks solved, arc-gen survival, total score \| Data shows >10% improvement in arc-gen survival rate OR total score \|
	\| Research \| If failed: why? Read more papers. If succeeded: can we combine with other wins? \| Next hypothesis ready \|

	Critical rules:
	- NEVER write >200 lines without running them first
	- NEVER claim a feature "works" until arc-gen validated on ≥20 tasks
	- NEVER upload code to repo that hasn't been validated
	- Theory from papers is NOT proof for our data — always test
	- If a feature shows no improvement after testing, DELETE it — don't leave dead code
	- Make surgical edits to individual files — NEVER rewrite the entire codebase in one shot

	## Quick Reference

	- Repo: `rogermt/neurogolf-solver`
	- Current version: v5.2 — 52 solved, ~710 score, est LB ~1058
	- Previous best on Kaggle: v4.3 — 50 arc-gen-validated tasks, est LB ~670
	- Kaggle runtime: 12 hours for submission
	- Target: 3000+ LB (our own solver, no blending)
	- Detailed history, mistakes, analysis: see `LEARNING.md`
	- Roadmap & experiment queue: see `TODO.md`

	## 1. Competition Rules

	\| Item \| Value \|
	\|------\|-------\|
	\| Input/Output \| `"input"`/`"output"` float32 `[1,10,30,30]` one-hot \|
	\| Opset \| 17 (IR 8). Opset 10 also accepted on Kaggle \|
	\| Max .onnx file size \| 1.44 MB per ONNX file (not submission zip) \|
	\| Static shapes \| All tensors and parameters must have statically-defined shapes \|
	\| Banned ops \| Loop, Scan, NonZero, Unique, Script, Function \|
	\| Scoring \| `max(1.0, 25.0 - ln(MACs + memory + params))` per task \|
	\| Tasks \| All 400 count. There are NO excluded tasks. Unsolved = 1.0 pt. \|
	\| Validation \| Models checked against train + test + arc-gen (ALL splits) \|
	\| Submission \| `submission.zip` with `task001.onnx`–`task400.onnx` + optional `submission.csv` \|

	## 2. ARC-GEN Data — THE Critical Factor

	A model that passes train+test but fails arc-gen scores ZERO on Kaggle.

	- Kaggle tasks at `/kaggle/input/competitions/neurogolf-2026/taskNNN.json` contain `{"train":[], "test":[], "arc-gen":[]}`
	- Up to 262 arc-gen examples per task (100K total)
	- Locally: ARC-GEN in `ARC-GEN-100K/{hex_id}.json` as list of `{input, output}` — merge into task data
	- Conv fitting: include arc-gen examples only when grid sizes match train/test (otherwise lstsq fails)
	- Validation: always check against `arc-gen[:30]` minimum

	## 3. Architecture

	### Package Structure (v5.2)
	```
	neurogolf_solver/
	├── constants.py # Grid dims, opset, limits (NO excluded tasks)
	├── config.py # Runtime providers, opset factory
	├── data_loader.py # Task loading, one-hot, example extraction
	├── validators.py # Model validation against all splits
	├── profiler.py # Static cost profiler (onnx_tool fallback)
	├── onnx_helpers.py # Opset 17 builders: Slice, Pad, ReduceSum, mk()
	├── gather_helpers.py # Gather-based spatial remapping models
	├── submission.py # run_tasks (W&B logging), zip/csv generation
	├── main.py # Entry point with argparse
	└── solvers/
	├── analytical.py # identity, constant, color_map, transpose
	├── geometric.py # flip, rotate, shift, crop, gravity (detect only)
	├── tiling.py # tile, upscale, mirror, concat, spatial_gather
	├── conv.py # lstsq conv (fixed, variable, diffshape, var_diff) + PCR fallback
	├── gravity.py # Unrolled bubble-sort gravity (Conv+Where, 4 dirs) — Task 78
	├── edge.py # Laplacian edge detection (0 matches currently)
	├── mode.py # Mode fill (ReduceSum→ArgMax→Expand) — Task 129
	└── solver_registry.py # ANALYTICAL_SOLVERS list + solve_task()
	```

	Run with: `python -m neurogolf_solver.main [args]`

	### Solver Pipeline
	```
	1. Analytical solvers (instant, zero/low cost, always arc-gen safe):
	identity → constant → color_map → transpose → flip → rotate →
	shift → tile → upscale → kronecker → nonuniform_scale →
	mirror_h → mirror_v → quad_mirror → concat → concat_enhanced →
	diagonal_tile → fixed_crop → spatial_gather → varshape_spatial_gather →
	gravity_unrolled → edge_detect → mode_fill

	2. Conv solvers (lstsq fitted, validated against arc-gen, PCR fallback):
	conv_fixed — Slice→Conv→ArgMax→Equal+Cast→Pad
	conv_variable — Conv(30×30)→ArgMax→Equal+Cast→Mul(mask)
	conv_diffshape — Slice→Conv→Slice(crop)→ArgMax→Equal+Cast→Pad
	conv_var_diff — Conv(30×30)→ArgMax→Equal+Cast→Mul(input_mask)
	```

	### ONNX Building Rules (opset 17)
	- All shapes must be static — no dynamic dimensions
	- Max 1.44 MB per .onnx file — checked by Kaggle validator
	- Slice(step=-1) for flip/rotate — zero MACs, replaces Gather for these transforms
	- Gather (opset 1) for spatial remapping — used by concat, spatial_gather, mirrors, etc.
	- NEVER use GatherElements (opset 11)
	- Equal+Cast for one-hot — NEVER use OneHot (no CUDA kernel)
	- Channel Gather for permutation color maps (0 MACs, score ~21 vs ~13 for Conv 1×1)
	- Conv 1×1 for non-permutation color maps (has MACs but correct)
	- ReduceSum with axes as tensor input (opset 13+ requirement)
	- Pad with tensor-based `pads` input (opset 11+ requirement)
	- lstsq calls must be wrapped in `try/except (LinAlgError, ValueError)` — SVD can fail to converge
	- ArgMax + Equal+Cast before Pad to ensure clean one-hot in padded region (gravity solver lesson)

	### Conv Fitting

	Conv ceiling: ~25 tasks. Regularization (Ridge, PCA/SVD, skip-ks) all tested and rejected.
	Root cause: architecture mismatch — most unsolved tasks need non-local ops, not local conv patches.

	Current fitting strategy (v5.1+):
	- Composable primitives: `_build_patch_matrix` + `_solve_weights` + `_extract_weights`
	- PCR fallback via `_solve_weights_pcr` (deferred 2nd pass, 0 new solves but no regressions)
	- Kernel sizes: [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29]
	- Try no-bias first, then bias
	- lstsq wrapped in try/except for SVD non-convergence
	- Validate against arc-gen BEFORE accepting — reject if fails

	### New Solver Architectures (v5.2)

	gravity.py — Unrolled bubble-sort via Conv+Where
	- 4 directions × 10 bg colors, max(IH,IW) steps
	- Per step: 2× Conv(3×3 shift), 3× ReduceSum, 3× Greater, 2× And, 2× Where
	- Final: ArgMax + Equal+Cast + Pad (clean one-hot)
	- Cost: ~16M (10×10 grid), score ~8.4
	- Validated: Task 78 (direction=up, bg=0)

	edge.py — Laplacian conv boundary detection
	- Conv 1×1 (channel collapse) → Conv 3×3 (Laplacian) → Abs → Greater → And → Where
	- Cost: ~16K MACs, score ~15
	- 0 matches currently — edge definition may be too strict

	mode.py — Global majority color fill
	- Slice → ReduceSum(axes=[2,3]) → ArgMax → Equal+Cast → Expand → Pad
	- Cost: ~2K, score ~19.5
	- Validated: Task 129

	## 4. Performance

	The lstsq conv solver is the speed bottleneck. Use `--conv_budget` to cap time per task (5s locally, 60s on Kaggle).

	Do NOT try to GPU-accelerate lstsq. The bottleneck is algorithmic (O(n³) SVD), not device.

	## 5. Score Accounting (v5.2)

	\| Category \| Tasks \| Avg Score \| Notes \|
	\|----------\|-------\|-----------\|-------\|
	\| Analytical \| 24 \| ~16 \| identity, constant, color_map, transpose, flip, rotate, shift, tile, mirrors, etc. \|
	\| Conv (lstsq) \| 25 \| ~10.5 \| conv_fixed, conv_var, conv_diff, conv_var_diff \|
	\| Gravity \| 1 \| 8.4 \| Task 78 \|
	\| Mode fill \| 1 \| 19.5 \| Task 129 \|
	\| Timing artifact \| 1 \| 8.2 \| Task 61 (conv_var, only on slow hardware) \|
	\| Unsolved \| 348 \| 1.0 \| Minimum score \|
	\| Total \| 52/400 \| \| ~710 solved + 348 = ~1058 est LB \|

	### Path to 3000+
	1. ✅ ARC-GEN validation (v4)
	2. ✅ New analytical solvers (v4)
	3. ✅ Opset 17 Slice-based transforms (v5)
	4. ✅ lstsq crash fix + modular package (v5)
	5. ✅ PCR fallback in conv (v5.1 — 0 new solves but clean code)
	6. ✅ Gravity solver (v5.2 — Task 78)
	7. ✅ Mode fill solver (v5.2 — Task 129)
	8. 🔲 Phase 3 solvers: flood fill, composition, color LUT, CumSum — see TODO.md
	9. 🔲 Phase 1a: Opset 17 conversions for existing analytical tasks (score optimization)
	10. 🔲 Phase 4: ONNX optimizer, best-of-N selection

	Blending is EXPLICITLY excluded — user's competitive philosophy.

	## 6. Submission Checklist

	Before submitting to Kaggle:
	- [ ] All models validated against train + test + arc-gen (locally)
	- [ ] All 400 tasks attempted (no exclusions)
	- [ ] No GatherElements in any model
	- [ ] No banned ops (Loop, Scan, NonZero, Unique, Script, Function)
	- [ ] All tensor shapes are static
	- [ ] Each .onnx file < 1.44 MB
	- [ ] Local estimated score calculated and compared to expected LB
	- [ ] A/B test: ran both old and new solver on same tasks, new solver scores higher

	## 7. Files & Locations

	\| Location \| Path \| Notes \|
	\|----------\|------\|-------\|
	\| HF Repo \| `rogermt/neurogolf-solver` \| All code + data \|
	\| Solver package \| `neurogolf_solver/` \| v5.2 — 19 files, modular \|
	\| Legacy monolith \| `neurogolf_solver.py` \| v4, kept for reference — do not edit \|
	\| Official utils \| `neurogolf_utils.py` \| Kaggle scoring lib (needs onnx_tool) \|
	\| ARC-GEN data \| `ARC-GEN-100K.zip` \| 400 files, 100K examples \|
	\| Notebooks \| `neurogolf-2026-solver-notebooks.zip` \| 5 reference notebooks \|
	\| Kaggle data \| `/kaggle/input/competitions/neurogolf-2026/` \| task JSONs with arc-gen \|
	\| Roadmap \| `TODO.md` \| Experiment queue with status key \|
	\| Learning \| `LEARNING.md` \| Knowledge accumulation — read before coding \|

	## 8. LEARNING.md Maintenance Rules

	`LEARNING.md` is the knowledge accumulation file. Update it when:
	- A bug is found and fixed — add to Mistakes Log with root cause
	- A new approach is tried — record what worked, what didn't, and why
	- Competition analysis reveals new insights — add to Competitive Intelligence
	- Version milestones — update the Version History table
	- Performance measurements — add concrete numbers

	Structure: chronological within sections, newest entries first. Always include dates and version numbers.