rogermt
/

neurogolf-solver

Model card Files Files and versions

xet

Community

rogermt commited on 16 days ago

Commit

260153f

verified ·

1 Parent(s): ae0427a

Split SKILL.md (rules/quick-ref) + LEARNING.md (history/mistakes/analysis)

Browse files

Files changed (1) hide show

SKILL.md +109 -292

SKILL.md CHANGED Viewed

@@ -3,310 +3,127 @@ name: neurogolf-solver
 description: Build and improve an ONNX model generator for the NeuroGolf Championship (Kaggle). Produces 400 tiny ONNX models (opset 10, IR 10, input/output [1,10,30,30] one-hot float32) for ARC-AGI tasks. Scoring = max(1, 25 - ln(MACs + memory_bytes + params)). Lower cost = higher score. Use this skill whenever working on this competition, debugging submission failures, or starting a fresh session.
 ---
-# NeuroGolf Solver — Complete Knowledge Base
-## 1. Competition Format
-### What is NeuroGolf?
-IJCAI-ECAI 2026 NeuroGolf Challenge on Kaggle. You build 400 tiny ONNX neural networks, one per ARC-AGI task. Each network transforms a one-hot encoded grid to another grid. Scoring rewards small, efficient networks.
-### ONNX Model Spec
-- **Input**: `"input"` float32 `[1, 10, 30, 30]` — one-hot encoded grid (10 color channels, 30×30 spatial)
-- **Output**: `"output"` float32 `[1, 10, 30, 30]` — same format
-- **Opset**: 10, IR version: 10 (but opset 17 ALSO works on Kaggle — see §3)
-- **Max file size**: 1.44 MB per model (floppy disk limit)
-- **Banned ops**: Loop, Scan, NonZero, Unique, Script, Function
-### Scoring Formula
-```
-score_per_task = max(1.0, 25.0 - ln(MACs + memory_bytes + params))
-total_score = sum(score_per_task for all 400 tasks)
-```
-- Unsolved tasks score 1.0 (not 0!)
-- Max possible per task: 25.0 (cost=0, e.g. Identity)
-- **Excluded tasks**: {21, 55, 80, 184, 202, 366} — officially excluded, score 0 regardless
-### Submission Format
-- `submission.zip` containing `task001.onnx` through `task400.onnx`
-- Models must pass validation against ALL examples: **train + test + arc-gen**
-- Optional: `submission.csv` with columns `task_id, total_cost`
-### ARC-GEN Data (CRITICAL)
-On Kaggle, each task JSON at `/kaggle/input/competitions/neurogolf-2026/taskNNN.json` contains:
-```json
-{"train": [...], "test": [...], "arc-gen": [...]}
-```
-The `arc-gen` key has **up to 262 additional examples per task** (100K total across 400 tasks) generated by Google's ARC-GEN system. **Models are validated against ALL splits including arc-gen.** A model that passes train+test but fails arc-gen scores ZERO on Kaggle.
-Locally, ARC-GEN data is in separate files at `ARC-GEN-100K/{hex_id}.json` as a list of `{input, output}` dicts. Must be merged with the ARC-AGI task data.
-## 2. Current State (v3 → v4 in progress)
-### v3 Results: 307/400 solved locally, LB score ~501 (NOT ~3267)
-The massive gap (3267 local vs 501 LB) means **most of our conv models fail ARC-GEN validation on Kaggle**. The conv is fitted on ~6 train+test examples but must generalize to ~250 arc-gen examples of varying sizes. Many don't.
-### Solver Breakdown (v3)
-```
-conv_var: 125, conv_fixed: 107, conv_diff: 39, spatial_gather: 16,
-concat: 5, color_map: 4, concat_enhanced: 4, rotate: 3,
-transpose: 2, upscale: 1, varshape_spatial_gather: 1
-```
-### Repository
-- HF: `rogermt/neurogolf-solver`
-- Files: `neurogolf_solver.py`, `neurogolf_utils.py` (official Kaggle utils), `ARC-GEN-100K.zip`, `neurogolf-2026-solver-notebooks.zip`
-## 3. Key Differences: Our Solver vs High-Scoring Notebooks
-### The 4200-point notebook (`neurogolf-2026-tiny-onnx-solver`)
-This is a **BLEND notebook** — it does NOT solve tasks from scratch. It:
-1. **Phase 1**: Loads 12+ other notebooks' `submission.zip` files as inputs
-2. For each task, picks the cheapest valid model across all sources
-3. **Phase 2**: Tries loose ONNX files from dataset inputs
-4. **Phase 3**: Runs its own solver only on remaining unsolved tasks
-5. Validates EVERYTHING against train+test+arc-gen before including
-6. Result: 338/400 solved, est. score 4197.5
-**Critical insight**: The 4200 score comes from BLENDING many solutions, not from a single solver. The solver itself only adds 0 new tasks in Phase 3. All 338 come from other notebooks' pre-built models.
-### The championship notebook (`the-2026-neurogolf-championship`)
-Also a blend but with its own solver. Key differences from ours:
-- Uses **opset 17** (not 10!) — works fine on Kaggle
-- Has **shift detector**, **gravity detector**, **mirror detectors**, **fixed crop detector**, **outline detector**
-- Has **composition detectors**: rotation+color, transpose+color, flip+color
-- Has **channel reduction**: reduces 10→N channels for fewer colors → cheaper models
-- Uses **PyTorch learned conv**: multi-seed Adam training, ternary weight snapping
-- Uses **two-layer conv**: Conv→ReLU→Conv for complex patterns
-- Validates against `train + arc-gen[:30]` (capped at 30 arc-gen examples)
-- Result: 288 from own solver + more from blended inputs
-### What they have that we don't
-| Feature | Them | Us |
-|---------|------|-----|
-| ARC-GEN validation | ✅ validate against arc-gen | ❌ v3 ignores arc-gen |
-| ARC-GEN in fitting | ✅ uses arc-gen[:3] in detectors | ❌ fits only train+test |
-| Opset 17 | ✅ uses freely | ❌ stuck on opset 10 |
-| Shift detector | ✅ | ❌ |
-| Gravity detector | ✅ | ❌ |
-| Mirror detectors | ✅ (h, v, quad) | ❌ |
-| Fixed crop detector | ✅ | ❌ |
-| Extract outline | ✅ | ❌ |
-| Composition (rot+color) | ✅ | ❌ |
-| Channel reduction | ✅ (fewer channels = cheaper) | ❌ |
-| PyTorch learned conv | ✅ (multi-seed, ternary snap) | ❌ (lstsq only) |
-| Two-layer conv | ✅ (Conv→ReLU→Conv) | ❌ |
-| Blend from other notebooks | ✅ (12+ sources) | ❌ |
-## 4. The Submission Score Gap Problem
-### Why LB = 501 when local = 3267
-Our 307 solved tasks generate ONNX models locally. But on Kaggle:
-1. Models are validated against `train + test + arc-gen` (all splits)
-2. Conv models fitted on 6 examples often fail on 250+ arc-gen examples
-3. Failed models score 0 (not even the 1.0 minimum)
-4. Likely only ~40-50 of our 307 models actually pass on Kaggle
-### The fix priority
-1. **Validate locally against arc-gen** before submitting — only include models that pass
-2. **Include arc-gen examples in conv fitting** — more data = better generalization
-3. **Add more analytical solvers** (shift, mirror, gravity, crop) — these always generalize
-4. **Try opset 17** — unlocks more ops, may work fine on Kaggle
-## 5. Architecture & Code Structure
-### `neurogolf_solver.py` structure
-```
-Constants: BATCH=1, CH=10, GH=GW=30
-EXCLUDED_TASKS = {21, 55, 80, 184, 202, 366}
-load_tasks_dir(data_dir, arcgen_dir)  # Load + merge ARC-GEN
-to_onehot(grid)                        # Grid → [1,10,30,30]
-validate(path, td)                     # Check model on ALL splits
-score_network(path)                    # MACs + memory + params
-Analytical Solvers (priority order):
-  identity → constant → color_map → transpose → flip → rotate →
-  tile → upscale → kronecker → concat → concat_enhanced →
-  diagonal_tile → spatial_gather → varshape_spatial_gather
-Conv Solvers:
-  solve_conv_fixed()    — Fixed same-shape: Slice→Conv→ArgMax→Equal+Cast→Pad
-  solve_conv_variable() — Variable same-shape: Conv(30×30)→ArgMax→Equal+Cast→Mul(mask)
-  solve_conv_diffshape()— Fixed diff-shape (output≤input)
-  solve_conv_var_diff() — Variable diff-shape (output≤input)
-Main: solve_task() → run_tasks() → generate submission.zip + submission.csv
-```
-### ONNX Building Patterns (opset 10)
-```python
-# Model skeleton
-def mk(nodes, inits=None):
-    x = helper.make_tensor_value_info("input", DT, [1,10,30,30])
-    y = helper.make_tensor_value_info("output", DT, [1,10,30,30])
-    g = helper.make_graph(nodes, "g", [x], [y], initializer=inits or [])
-    return helper.make_model(g, ir_version=10, opset_imports=[helper.make_opsetid("", 10)])
-# One-hot via Equal+Cast (NOT OneHot — has CUDA issues)
-classes = np.arange(10).reshape(1,10,1,1)
-Equal(argmax_output, classes) → Cast(to=FLOAT)
-# Spatial remap via Gather (NOT GatherElements — requires opset 11!)
-Reshape([1,10,30,30] → [1,10,900]) → Gather(axis=2, indices=[900]) → Reshape back
-# Conv pattern
-Conv(input, W, kernel_shape=[ks,ks], pads=[pad]*4) → ArgMax → Equal+Cast → Mul(mask)
-# Mask for variable-shape: ReduceSum(input, axes=[1], keepdims=1) gives 1 where content exists
-```
-### Critical Op Compatibility
-| Op | Opset Required | Notes |
-|----|---------------|-------|
-| Gather | 1 | ✅ Safe. Use axis=2 on flattened [1,10,900] |
-| GatherElements | 11 | ❌ DO NOT USE with opset 10. Will fail on ORT 1.25+ |
-| OneHot | 9 | ⚠️ No CUDA kernel. Use Equal+Cast instead |
-| Conv | 1 | ✅ Safe |
-| ArgMax | 1 | ✅ Safe |
-| ReduceSum | 1 | ✅ Safe |
-| Pad | 2 (opset 10 syntax) | ✅ Use `pads` attribute for opset 10 |
-| Slice | 10 | ✅ With starts/ends as inputs |
-| Tile | 6 | ✅ Safe |
-| ScatterElements | 11 | ⚠️ Requires opset 11+ |
-## 6. Conv Fitting: lstsq vs PyTorch
-### Current: lstsq (single-layer, closed-form)
-```python
-patches = []  # [N, 10*ks*ks] feature vectors
-targets = []  # [N] integer class labels
-P, T_oh = build_from_examples(exs)
-WT = np.linalg.lstsq(P, T_oh)[0]  # Closed-form optimal weights
-if np.argmax(P @ WT, 1) == T: SUCCESS  # Perfect fit check
 ```
-- Fast, deterministic, optimal for linear case
-- FAILS when: pattern is nonlinear, too few examples, kernel too small
-### Needed: PyTorch gradient descent (multi-layer)
-```python
-class TinyARC(nn.Module):
-    def __init__(self, hidden=32, ks=5):
-        self.conv1 = nn.Conv2d(10, hidden, ks, padding=ks//2)
-        self.conv2 = nn.Conv2d(hidden, 10, ks, padding=ks//2)
-    def forward(self, x):
-        return self.conv2(torch.relu(self.conv1(x)))
-# Train with MSE or cross-entropy, export with torch.onnx.export(model, dummy, path, opset_version=10)
-# Then add argmax+equal+cast+mask post-processing in ONNX manually
 ```
-- Can fit nonlinear patterns lstsq can't
-- Multi-seed training (0, 7, 42) for robustness
-- Ternary weight snapping: round weights to {-1, 0, 1} for smaller models
-### ARC-GEN for conv fitting
-The conv MUST generalize to arc-gen examples. Two approaches:
-1. **Include arc-gen in fitting data** — use `train + test + arc-gen[:20]` for lstsq
-2. **Validate against arc-gen after fitting** — only accept if passes all splits
-## 7. Unsolved Tasks (94 in v3)
-### Categories
-| Category | Count | Why Unsolved |
-|----------|-------|-------------|
-| Variable diff-shape (output smaller) | ~60 | Output shape depends on input content |
-| Variable diff-shape (output larger) | ~17 | Same problem |
-| Same-shape, complex pattern | ~10 | Need larger kernels or multi-layer |
-| Fixed diff-shape, output larger | ~7 | Input-content-dependent patterns |
-### Fundamental Blocker
-Variable-shape tasks where output size depends on input CONTENT cannot be solved with a static ONNX graph. The only workaround: conv learns to put valid content in the right region, masked by input-derived spatial mask.
-## 8. Mistakes Log (DO NOT REPEAT)
-### GatherElements (opset 11) — Fixed in v3
-`GatherElements` requires opset 11. Works on Kaggle's old ORT but fails on ORT 1.25+. Replaced with `Gather` (opset 1) using 1D indices on flattened spatial dim.
-### s_flip still used GatherElements — Fixed in v4
-The `s_flip` solver was still using `GatherElements`. Must use `_build_gather_model()` instead.
-### ARC-GEN not loaded — The #1 score killer
-v3 had `if 'arc-gen' in td` in validate() but never loaded arc-gen data into `td`. So validation always passed (no arc-gen to check), but Kaggle validated against arc-gen and most conv models failed.
-### Conv fitted on too few examples
-Fitting on 6 train+test examples → overfits to small sample. Must include arc-gen examples in fitting data for better generalization.
-### No submission.csv
-Kaggle may need submission.csv alongside submission.zip.
-### Wrong score_network without onnx_tool
-Our fallback `score_network` returned `(0, 0, 0)` instead of real costs. Need static profiler that matches Kaggle's calculation.
-### Ignored EXCLUDED tasks
-Wasted time trying to solve tasks 21, 55, 80, 184, 202, 366 which are officially excluded.
-## 9. Competitive Strategy
-### Path to 4800+ LB score
-1. **Fix ARC-GEN validation** — immediately recover ~200 points from models that actually work
-2. **Add missing analytical solvers** (shift, mirror, gravity, crop, composition) — +20-30 tasks, ~13 points each
-3. **PyTorch multi-layer conv** — solve 5-10 more complex same-shape tasks
-4. **Channel reduction** — reduce cost of existing solutions by 30-50%
-5. **Blend with other notebooks** — the 4200 notebook proves this is the meta-strategy
-### Quick wins
-- Transpose: score=25.0 (cost=0, just permute dims) — already have
-- Identity: score=25.0 — already have
-- Color map via channel Gather: cheaper than Conv 1×1 (params+nbytes only, no MACs)
-- Analytical solvers: ~13 points each (cost ≈ 165K)
-- Small conv (ks=1): ~11-13 points
-- Large conv (ks=29): ~7 points
-## 10. Data & File Locations
-### On Kaggle
-```
-/kaggle/input/competitions/neurogolf-2026/
-  task001.json ... task400.json  (with train+test+arc-gen)
-  neurogolf_utils/neurogolf_utils.py
-```
-### Locally
-```
-ARC-AGI/data/training/           # 400 hex-named .json files (train+test only)
-ARC-GEN-100K/                    # 400 hex-named .json files (arc-gen examples)
-neurogolf-solver/
-  neurogolf_solver.py            # Main solver
-  neurogolf_utils.py             # Official Kaggle utils (needs onnx_tool, IPython)
-```
-### ARC-GEN file format
-```python
-# ARC-GEN-100K/{hex_id}.json is a LIST of examples:
-[{"input": [[...]], "output": [[...]]}, ...]
-# Must be merged into task data as td['arc-gen'] = list_of_examples
-```
-### ARC-GEN GitHub generator
-https://github.com/google/ARC-GEN — Can generate MORE examples per task if needed.
-## 11. Reference Notebooks (in repo as neurogolf-2026-solver-notebooks.zip)
-| Notebook | LB Score | Tasks | Key Technique |
-|----------|----------|-------|---------------|
-| neurogolf-2026-tiny-onnx-solver | ~4200 | 338 | Mega-blend of 12+ notebooks |
-| 4200-v5-neurogolf-fix | ~5700 est | 341 | Same blend, manual LLM rescue tasks |
-| the-2026-neurogolf-championship | ~3200 est | 288 | Own solver + blend |
-| neurogolf-logic-driven-ensembling | — | 401 | Pure ensembling from zips |
-## 12. Testing Checklist
-Before any Kaggle submission:
 - [ ] All models validated against train + test + arc-gen (locally)
 - [ ] EXCLUDED tasks {21,55,80,184,202,366} not included
-- [ ] No GatherElements (opset 11) in any model
-- [ ] No banned ops (Loop, Scan, NonZero, Unique)
-- [ ] Each .onnx file < 1.44 MB
-- [ ] submission.zip < 1.44 MB total
 - [ ] submission.csv generated
-- [ ] Local estimated score calculated with static profiler
-- [ ] Compared local score vs expected LB (should be close now)

 description: Build and improve an ONNX model generator for the NeuroGolf Championship (Kaggle). Produces 400 tiny ONNX models (opset 10, IR 10, input/output [1,10,30,30] one-hot float32) for ARC-AGI tasks. Scoring = max(1, 25 - ln(MACs + memory_bytes + params)). Lower cost = higher score. Use this skill whenever working on this competition, debugging submission failures, or starting a fresh session.
 ---
+# NeuroGolf Solver
+## Quick Reference
+- **Repo**: `rogermt/neurogolf-solver`
+- **Current version**: v4.1 — 50 arc-gen-validated tasks, est LB ~670
+- **Kaggle runtime**: 12 hours for submission
+- **Target**: 4800+ LB (first page)
+- **Detailed history, mistakes, analysis**: see `LEARNING.md`
+## 1. Competition Rules
+| Item | Value |
+|------|-------|
+| Input/Output | `"input"`/`"output"` float32 `[1,10,30,30]` one-hot |
+| Opset | 10 (IR 10). Opset 17 also works on Kaggle |
+| Max file size | 1.44 MB per model |
+| Banned ops | Loop, Scan, NonZero, Unique, Script, Function |
+| Scoring | `max(1.0, 25.0 - ln(MACs + memory + params))` per task |
+| Excluded tasks | {21, 55, 80, 184, 202, 366} — skip these |
+| Validation | Models checked against **train + test + arc-gen** (ALL splits) |
+| Submission | `submission.zip` with `task001.onnx`–`task400.onnx` + optional `submission.csv` |
+## 2. ARC-GEN Data — THE Critical Factor
+**A model that passes train+test but fails arc-gen scores ZERO on Kaggle.**
+- Kaggle tasks at `/kaggle/input/competitions/neurogolf-2026/taskNNN.json` contain `{"train":[], "test":[], "arc-gen":[]}`
+- Up to 262 arc-gen examples per task (100K total)
+- Locally: ARC-GEN in `ARC-GEN-100K/{hex_id}.json` as list of `{input, output}` — merge into task data
+- Conv fitting: include arc-gen examples **only when grid sizes match** train/test (otherwise lstsq fails)
+- Validation: always check against `arc-gen[:30]` minimum
+## 3. Architecture
+### Solver Pipeline
 ```
+1. Analytical solvers (instant, zero/low cost, always arc-gen safe):
+   identity → constant → color_map → transpose → flip → rotate →
+   shift → tile → upscale → kronecker → nonuniform_scale →
+   mirror_h → mirror_v → quad_mirror → concat → concat_enhanced →
+   diagonal_tile → fixed_crop → spatial_gather → varshape_spatial_gather
+2. Conv solvers (lstsq fitted, validated against arc-gen):
+   conv_fixed    — Slice→Conv→ArgMax→Equal+Cast→Pad
+   conv_variable — Conv(30×30)→ArgMax→Equal+Cast→Mul(mask)
+   conv_diffshape— Slice→Conv→Slice(crop)→ArgMax→Equal+Cast→Pad
+   conv_var_diff — Conv(30×30)→ArgMax→Equal+Cast→Mul(input_mask)
 ```
+### ONNX Building Rules
+- **Gather** (opset 1) for spatial remapping — NEVER use GatherElements (opset 11)
+- **Equal+Cast** for one-hot — NEVER use OneHot (no CUDA kernel)
+- **Channel Gather** for permutation color maps (0 MACs, score ~21 vs ~13 for Conv 1×1)
+- **Conv 1×1** for non-permutation color maps (has MACs but correct)
+- **ReduceSum(input, axes=[1])** for variable-shape mask
+### Conv Fitting Strategy
+- lstsq on train+test (+arc-gen when same grid size, capped at 10 examples)
+- Kernel sizes: [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29]
+- Try no-bias first, then bias
+- **Validate against arc-gen BEFORE accepting** — reject if fails
+- Bottleneck is algorithmic (O(n³) SVD), NOT device — GPU/CuPy doesn't help, just crashes
+## 4. Performance Bottleneck
+**The lstsq conv solver is the speed bottleneck.** For ks=29 on 21×21 grids with 16 examples: 7056×8410 matrix SVD. This is pure math cost — moving to GPU (CuPy) doesn't help because:
+1. Same O(n³) algorithmic cost
+2. GPU memory fills up (~1GB for large matrices) and crashes
+3. Falls back to CPU anyway after CUDA error
+**Do NOT** try to GPU-accelerate lstsq. Use `--conv_budget` to cap time per task (10-20s locally, 60s on Kaggle's 12hr runtime). The real win is more analytical solvers, not faster conv.
+## 5. Score Accounting (v4.1)
+| Category | Tasks | Avg Score | Total |
+|----------|-------|-----------|-------|
+| Analytical (gather, rotate, etc.) | 25 | ~16 | ~400 |
+| Conv (arc-gen validated) | 25 | ~11 | ~275 |
+| Unsolved | 344 | 1.0 | 344 |
+| **Estimated LB** | | | **~670** |
+### Path to 4800+
+1. ✅ ARC-GEN validation (fixed: +155 pts by eliminating 0-scoring models)
+2. ✅ New analytical solvers: shift, mirror, crop, quad_mirror (+8 tasks)
+3. ✅ Color map Gather for permutations (+15 pts)
+4. 🔲 PyTorch multi-layer conv with ternary snap (est +20-50 tasks)
+5. 🔲 Channel reduction (fewer colors → cheaper models)
+6. 🔲 Composition detectors: rot+color, flip+color, transpose+color
+7. 🔲 Blend with other notebooks on Kaggle (the meta-strategy for 4000+)
+## 6. Submission Checklist
+Before submitting to Kaggle:
 - [ ] All models validated against train + test + arc-gen (locally)
 - [ ] EXCLUDED tasks {21,55,80,184,202,366} not included
+- [ ] No GatherElements in any model
+- [ ] No banned ops
+- [ ] Each .onnx < 1.44 MB, submission.zip < 1.44 MB
 - [ ] submission.csv generated
+- [ ] Local estimated score calculated and compared to expected LB
+## 7. Files & Locations
+| Location | Path | Notes |
+|----------|------|-------|
+| HF Repo | `rogermt/neurogolf-solver` | All code + data |
+| Solver | `neurogolf_solver.py` | v4.1, 1270 lines |
+| Official utils | `neurogolf_utils.py` | Kaggle scoring lib (needs onnx_tool) |
+| ARC-GEN data | `ARC-GEN-100K.zip` | 400 files, 100K examples |
+| Notebooks | `neurogolf-2026-solver-notebooks.zip` | 5 reference notebooks |
+| Kaggle data | `/kaggle/input/competitions/neurogolf-2026/` | task JSONs with arc-gen |
+| Local ARC data | `ARC-AGI/data/training/` | 400 hex-named JSONs |
+## 8. LEARNING.md Maintenance Rules
+`LEARNING.md` is the knowledge accumulation file. Update it when:
+- A bug is found and fixed — add to Mistakes Log with root cause
+- A new approach is tried — record what worked, what didn't, and why
+- Competition analysis reveals new insights — add to Competitive Intelligence
+- Version milestones — update the Version History table
+- Performance measurements — add concrete numbers
+Structure: chronological within sections, newest entries first. Always include dates and version numbers. The goal is that a fresh agent with zero context can read LEARNING.md and understand every mistake to avoid and every technique that works.