rogermt
/

neurogolf-solver

Model card Files Files and versions

xet

Community

rogermt commited on 14 days ago

Commit

72d0404

verified ·

1 Parent(s): bc5d5ee

Update SKILL.md for v5 refactored package

Browse files

Files changed (1) hide show

SKILL.md +67 -40

SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: neurogolf-solver
-description: Build and improve an ONNX model generator for the NeuroGolf Championship (Kaggle). Produces 400 tiny ONNX models (opset 10/17, IR 10, input/output [1,10,30,30] one-hot float32) for ARC-AGI tasks. Scoring = max(1, 25 - ln(MACs + memory_bytes + params)). Lower cost = higher score. Use this skill whenever working on this competition, debugging submission failures, or starting a fresh session.
 ---
 # NeuroGolf Solver
@@ -25,14 +25,15 @@ Research → Design → Experiment → Analyze → Research → ...
 - NEVER write >200 lines without running them first
 - NEVER claim a feature "works" until arc-gen validated on ≥20 tasks
 - NEVER upload code to repo that hasn't been validated
-- NEVER overwrite neurogolf_solver.py with unvalidated code
 - Theory from papers is NOT proof for our data — always test
 - If a feature shows no improvement after testing, DELETE it — don't leave dead code
 ## Quick Reference
 - **Repo**: `rogermt/neurogolf-solver`
-- **Current version**: v4.3 — 50 arc-gen-validated tasks, est LB ~670
 - **Kaggle runtime**: 12 hours for submission
 - **Target**: 3000+ LB (our own solver, no blending)
 - **Detailed history, mistakes, analysis**: see `LEARNING.md`
@@ -43,7 +44,7 @@ Research → Design → Experiment → Analyze → Research → ...
 | Item | Value |
 |------|-------|
 | Input/Output | `"input"`/`"output"` float32 `[1,10,30,30]` one-hot |
-| Opset | 10 (IR 10). **Opeset 17 also works on Kaggle** |
 | Max file size | 1.44 MB per model |
 | Banned ops | Loop, Scan, NonZero, Unique, Script, Function |
 | Scoring | `max(1.0, 25.0 - ln(MACs + memory + params))` per task |
@@ -63,6 +64,28 @@ Research → Design → Experiment → Analyze → Research → ...
 ## 3. Architecture
 ### Solver Pipeline
 ```
 1. Analytical solvers (instant, zero/low cost, always arc-gen safe):
@@ -78,67 +101,71 @@ Research → Design → Experiment → Analyze → Research → ...
    conv_var_diff — Conv(30×30)→ArgMax→Equal+Cast→Mul(input_mask)
 ```
-### ONNX Building Rules
-- **Gather** (opset 1) for spatial remapping — NEVER use GatherElements (opset 11)
 - **Equal+Cast** for one-hot — NEVER use OneHot (no CUDA kernel)
 - **Channel Gather** for permutation color maps (0 MACs, score ~21 vs ~13 for Conv 1×1)
 - **Conv 1×1** for non-permutation color maps (has MACs but correct)
-- **ReduceSum(input, axes=[1])** for variable-shape mask
-- **Pad** (opset 17): use tensor-based `pads` input, NOT attribute-based (opset 10 style)
 ### Conv Fitting — THE #1 BLOCKER
-**We solve 307 locally but only 50 survive arc-gen. This is CATASTROPHIC overfitting, not a hyperparameter problem.**
 - Patch matrix P has n rows (patches) and p columns (10×ks² features)
-- For ks=7 on 7×7 grid: n≈196, p=490 → underdetermined → min-norm among infinite fits → overfits
-- For ks=7 on 21×21 grid: n≈7056, p=490 → determined, but arc-gen still fails
-- **Root cause**: LOW effective rank of patch covariance (~10-40) due to few active colors → noise concentrates in low-rank directions
 - **Double descent**: ks=5,7,9 are at/near interpolation threshold where test error PEAKS
-**Current fitting strategy (v4.2):**
 - lstsq on train+test (+arc-gen when same grid size, capped at 10 examples)
 - Kernel sizes: [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29]
 - Try no-bias first, then bias
 - **Validate against arc-gen BEFORE accepting** — reject if fails
-**What does NOT help lstsq overfitting:**
-- ❌ Ridge/LOOCV λ tuning — theory predicts failure for low effective rank (Bartlett et al., arXiv:2306.13185)
 - ❌ More arc-gen examples in lstsq — adding constraints to underdetermined system doesn't fix wrong model
 - ❌ GPU/CuPy for lstsq — same O(n³) cost, crashes on memory
 **What MIGHT help (evidence-backed, needs testing):**
 - 🔲 Skip ks=5,7,9 — avoid interpolation threshold (double descent peak)
 - 🔲 PCA dimensionality reduction — project to top-20 components, ensure p_reduced << n
-- 🔲 Lasso (ℓ₁) instead of lstsq — matches sparse signal structure (arXiv:2302.00257)
 - 🔲 Gradient descent with early stopping — implicit regularization, don't interpolate
-- 🔲 PyTorch conv trained on arc-gen data — needs GPU, multi-seed, ternary snap
 ## 4. Performance
-**The lstsq conv solver is the speed bottleneck.** For ks=29 on 21×21 grids with 16 examples: 7056×8410 matrix SVD. This is pure math cost — moving to GPU (CuPy) doesn't help.
-**Do NOT** try to GPU-accelerate lstsq. Use `--conv_budget` to cap time per task (10-20s locally, 60s on Kaggle's 12hr runtime). The real win is more analytical solvers + fixing arc-gen survival, not faster conv.
-## 5. Score Accounting (v4.2)
-| Category | Tasks | Avg Score | Total |
-|----------|-------|-----------|-------|
-| Analytical (gather, rotate, etc.) | 25 | ~16 | ~400 |
-| Conv (arc-gen validated) | 25 | ~11 | ~275 |
-| Unsolved | 344 | 1.0 | 344 |
-| **Estimated LB** | | | **~670** |
 ### Path to 3000+
-1. ✅ ARC-GEN validation (fixed: +155 pts by eliminating 0-scoring models)
-2. ✅ New analytical solvers: shift, mirror, crop, quad_mirror (+8 tasks)
-3. ✅ Color map Gather for permutations (+15 pts)
-4. 🔲 **Phase 1: Cheap wins** — opset 17 transforms, channel reduction, composition detectors
-5. 🔲 **Phase 2: Fix arc-gen survival** — PCA, Lasso, skip bad ks, GD with early stopping
-6. 🔲 **Phase 3: Hard tasks** — hash matchers, run-length detectors, LLM rescue
-7. 🔲 **Phase 4: Score optimization** — ONNX optimizer, best-of-N selection
-**Blending with public datasets is EXPLICITLY excluded** — user's competitive philosophy. See LEARNING.md "What Others Do" for market intelligence only.
 ## 6. Submission Checklist
@@ -147,8 +174,8 @@ Before submitting to Kaggle:
 - [ ] EXCLUDED tasks {21,55,80,184,202,366} not included
 - [ ] No GatherElements in any model
 - [ ] No banned ops
-- [ ] Each .onnx < 1.44 MB, submission.zip < 1.44 MB
-- [ ] submission.csv generated
 - [ ] Local estimated score calculated and compared to expected LB
 - [ ] **A/B test**: ran both old and new solver on same tasks, new solver scores higher
@@ -157,12 +184,12 @@ Before submitting to Kaggle:
 | Location | Path | Notes |
 |----------|------|-------|
 | HF Repo | `rogermt/neurogolf-solver` | All code + data |
-| Solver | `neurogolf_solver.py` | v4.2 (repo has unvalidated v5 code at 1919 lines — needs revert or validation) |
 | Official utils | `neurogolf_utils.py` | Kaggle scoring lib (needs onnx_tool) |
 | ARC-GEN data | `ARC-GEN-100K.zip` | 400 files, 100K examples |
 | Notebooks | `neurogolf-2026-solver-notebooks.zip` | 5 reference notebooks |
 | Kaggle data | `/kaggle/input/competitions/neurogolf-2026/` | task JSONs with arc-gen |
-| Local ARC data | `ARC-AGI/data/training/` | 400 hex-named JSONs |
 | Roadmap | `TODO.md` | Experiment queue with status key |
 | Learning | `LEARNING.md` | Knowledge accumulation — read before coding |
@@ -175,4 +202,4 @@ Before submitting to Kaggle:
 - Version milestones — update the Version History table
 - Performance measurements — add concrete numbers
-Structure: chronological within sections, newest entries first. Always include dates and version numbers. The goal is that a fresh agent with zero context can read LEARNING.md and understand every mistake to avoid and every technique that works.

 ---
 name: neurogolf-solver
+description: Build and improve an ONNX model generator for the NeuroGolf Championship (Kaggle). Produces 400 tiny ONNX models (opset 17, IR 8, input/output [1,10,30,30] one-hot float32) for ARC-AGI tasks. Scoring = max(1, 25 - ln(MACs + memory_bytes + params)). Lower cost = higher score. Use this skill whenever working on this competition, debugging submission failures, or starting a fresh session.
 ---
 # NeuroGolf Solver
 - NEVER write >200 lines without running them first
 - NEVER claim a feature "works" until arc-gen validated on ≥20 tasks
 - NEVER upload code to repo that hasn't been validated
 - Theory from papers is NOT proof for our data — always test
 - If a feature shows no improvement after testing, DELETE it — don't leave dead code
+- Make surgical edits to individual files — NEVER rewrite the entire codebase in one shot
 ## Quick Reference
 - **Repo**: `rogermt/neurogolf-solver`
+- **Current version**: v5 — refactored package, opset 17, currently running on Kaggle
+- **Previous best**: v4.3 — 50 arc-gen-validated tasks, est LB ~670
 - **Kaggle runtime**: 12 hours for submission
 - **Target**: 3000+ LB (our own solver, no blending)
 - **Detailed history, mistakes, analysis**: see `LEARNING.md`
 | Item | Value |
 |------|-------|
 | Input/Output | `"input"`/`"output"` float32 `[1,10,30,30]` one-hot |
+| Opset | 17 (IR 8). Opset 10 also accepted on Kaggle |
 | Max file size | 1.44 MB per model |
 | Banned ops | Loop, Scan, NonZero, Unique, Script, Function |
 | Scoring | `max(1.0, 25.0 - ln(MACs + memory + params))` per task |
 ## 3. Architecture
+### Package Structure (v5)
+```
+neurogolf_solver/
+├── constants.py          # Grid dims, opset, excluded tasks, limits
+├── config.py             # Runtime providers, opset factory
+├── data_loader.py        # Task loading, one-hot, example extraction
+├── validators.py         # Model validation against all splits
+├── profiler.py           # Static cost profiler (onnx_tool fallback)
+├── onnx_helpers.py       # Opset 17 builders: Slice, Pad, ReduceSum, mk()
+├── gather_helpers.py     # Gather-based spatial remapping models
+├── submission.py         # run_tasks (W&B logging), zip/csv generation
+├── main.py               # Entry point with argparse
+└── solvers/
+    ├── analytical.py     # identity, constant, color_map, transpose
+    ├── geometric.py      # flip, rotate, shift, crop, gravity
+    ├── tiling.py         # tile, upscale, mirror, concat, spatial_gather
+    ├── conv.py           # lstsq conv (fixed, variable, diffshape, var_diff)
+    └── solver_registry.py # ANALYTICAL_SOLVERS list + solve_task()
+```
+Run with: `python -m neurogolf_solver.main [args]`
 ### Solver Pipeline
 ```
 1. Analytical solvers (instant, zero/low cost, always arc-gen safe):
    conv_var_diff — Conv(30×30)→ArgMax→Equal+Cast→Mul(input_mask)
 ```
+### ONNX Building Rules (opset 17)
+- **Slice(step=-1)** for flip/rotate — zero MACs, replaces Gather for these transforms
+- **Gather** (opset 1) for spatial remapping — used by concat, spatial_gather, mirrors, etc.
+- **NEVER** use GatherElements (opset 11)
 - **Equal+Cast** for one-hot — NEVER use OneHot (no CUDA kernel)
 - **Channel Gather** for permutation color maps (0 MACs, score ~21 vs ~13 for Conv 1×1)
 - **Conv 1×1** for non-permutation color maps (has MACs but correct)
+- **ReduceSum** with axes as **tensor input** (opset 13+ requirement)
+- **Pad** with tensor-based `pads` input (opset 11+ requirement)
+- **lstsq calls** must be wrapped in `try/except (LinAlgError, ValueError)` — SVD can fail to converge
 ### Conv Fitting — THE #1 BLOCKER
+**We solve 307 locally but only ~50 survive arc-gen. This is CATASTROPHIC overfitting.**
 - Patch matrix P has n rows (patches) and p columns (10×ks² features)
+- **Root cause**: LOW effective rank of patch covariance (~10-40) due to few active colors
 - **Double descent**: ks=5,7,9 are at/near interpolation threshold where test error PEAKS
+**Current fitting strategy (v5):**
 - lstsq on train+test (+arc-gen when same grid size, capped at 10 examples)
 - Kernel sizes: [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29]
 - Try no-bias first, then bias
+- lstsq wrapped in try/except for SVD non-convergence
 - **Validate against arc-gen BEFORE accepting** — reject if fails
+**What does NOT help:**
+- ❌ Ridge/LOOCV λ tuning — theory predicts failure for low effective rank
 - ❌ More arc-gen examples in lstsq — adding constraints to underdetermined system doesn't fix wrong model
 - ❌ GPU/CuPy for lstsq — same O(n³) cost, crashes on memory
 **What MIGHT help (evidence-backed, needs testing):**
 - 🔲 Skip ks=5,7,9 — avoid interpolation threshold (double descent peak)
 - 🔲 PCA dimensionality reduction — project to top-20 components, ensure p_reduced << n
+- 🔲 Lasso (ℓ₁) instead of lstsq — matches sparse signal structure
 - 🔲 Gradient descent with early stopping — implicit regularization, don't interpolate
 ## 4. Performance
+**The lstsq conv solver is the speed bottleneck.** Use `--conv_budget` to cap time per task (30s locally, 60s on Kaggle).
+**Do NOT** try to GPU-accelerate lstsq. The bottleneck is algorithmic (O(n³) SVD), not device.
+## 5. Score Accounting
+| Category | Tasks (v4) | Avg Score | Notes |
+|----------|------------|-----------|-------|
+| Analytical (Slice/Gather) | ~25 | ~13-21 | v5 Slice-based should be ~20-25 |
+| Conv (arc-gen validated) | ~25 | ~11 | Unchanged in v5 |
+| Unsolved | ~350 | 1.0 | Minimum score |
+| **v4 Est LB** | | | **~670** |
+| **v5 Est LB** | | | **TBD (running)** |
 ### Path to 3000+
+1. ✅ ARC-GEN validation (v4: +155 pts)
+2. ✅ New analytical solvers: shift, mirror, crop, quad_mirror (v4: +8 tasks)
+3. ✅ Color map Gather for permutations (v4: +15 pts)
+4. ✅ Opset 17 Slice-based flip/rotate (v5: ~0 MACs for these transforms)
+5. ✅ Refactored to modular package (v5)
+6. ✅ lstsq crash fix — try/except for SVD non-convergence (v5)
+7. 🔲 **Fix arc-gen survival** — PCA, Lasso, skip bad ks, GD with early stopping
+8. 🔲 **Hard tasks** — hash matchers, run-length detectors, LLM rescue
+9. 🔲 **Score optimization** — ONNX optimizer, best-of-N selection, channel reduction
+**Blending is EXPLICITLY excluded** — user's competitive philosophy.
 ## 6. Submission Checklist
 - [ ] EXCLUDED tasks {21,55,80,184,202,366} not included
 - [ ] No GatherElements in any model
 - [ ] No banned ops
+- [ ] Each .onnx < 1.44 MB
+- [ ] submission.zip generated and < 1.44 MB
 - [ ] Local estimated score calculated and compared to expected LB
 - [ ] **A/B test**: ran both old and new solver on same tasks, new solver scores higher
 | Location | Path | Notes |
 |----------|------|-------|
 | HF Repo | `rogermt/neurogolf-solver` | All code + data |
+| **Solver package** | `neurogolf_solver/` | **v5 — 16 files, modular** |
+| Legacy monolith | `neurogolf_solver.py` | v4, kept for reference — do not edit |
 | Official utils | `neurogolf_utils.py` | Kaggle scoring lib (needs onnx_tool) |
 | ARC-GEN data | `ARC-GEN-100K.zip` | 400 files, 100K examples |
 | Notebooks | `neurogolf-2026-solver-notebooks.zip` | 5 reference notebooks |
 | Kaggle data | `/kaggle/input/competitions/neurogolf-2026/` | task JSONs with arc-gen |
 | Roadmap | `TODO.md` | Experiment queue with status key |
 | Learning | `LEARNING.md` | Knowledge accumulation — read before coding |
 - Version milestones — update the Version History table
 - Performance measurements — add concrete numbers
+Structure: chronological within sections, newest entries first. Always include dates and version numbers.