rogermt commited on
Commit
c228c49
·
verified ·
1 Parent(s): feaf007

Add SKILL.md - complete knowledge base for NeuroGolf solver

Browse files
Files changed (1) hide show
  1. SKILL.md +312 -0
SKILL.md ADDED
@@ -0,0 +1,312 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: neurogolf-solver
3
+ description: Build and improve an ONNX model generator for the NeuroGolf Championship (Kaggle). Produces 400 tiny ONNX models (opset 10, IR 10, input/output [1,10,30,30] one-hot float32) for ARC-AGI tasks. Scoring = max(1, 25 - ln(MACs + memory_bytes + params)). Lower cost = higher score. Use this skill whenever working on this competition, debugging submission failures, or starting a fresh session.
4
+ ---
5
+
6
+ # NeuroGolf Solver — Complete Knowledge Base
7
+
8
+ ## 1. Competition Format
9
+
10
+ ### What is NeuroGolf?
11
+ IJCAI-ECAI 2026 NeuroGolf Challenge on Kaggle. You build 400 tiny ONNX neural networks, one per ARC-AGI task. Each network transforms a one-hot encoded grid to another grid. Scoring rewards small, efficient networks.
12
+
13
+ ### ONNX Model Spec
14
+ - **Input**: `"input"` float32 `[1, 10, 30, 30]` — one-hot encoded grid (10 color channels, 30×30 spatial)
15
+ - **Output**: `"output"` float32 `[1, 10, 30, 30]` — same format
16
+ - **Opset**: 10, IR version: 10 (but opset 17 ALSO works on Kaggle — see §3)
17
+ - **Max file size**: 1.44 MB per model (floppy disk limit)
18
+ - **Banned ops**: Loop, Scan, NonZero, Unique, Script, Function
19
+
20
+ ### Scoring Formula
21
+ ```
22
+ score_per_task = max(1.0, 25.0 - ln(MACs + memory_bytes + params))
23
+ total_score = sum(score_per_task for all 400 tasks)
24
+ ```
25
+ - Unsolved tasks score 1.0 (not 0!)
26
+ - Max possible per task: 25.0 (cost=0, e.g. Identity)
27
+ - **Excluded tasks**: {21, 55, 80, 184, 202, 366} — officially excluded, score 0 regardless
28
+
29
+ ### Submission Format
30
+ - `submission.zip` containing `task001.onnx` through `task400.onnx`
31
+ - Models must pass validation against ALL examples: **train + test + arc-gen**
32
+ - Optional: `submission.csv` with columns `task_id, total_cost`
33
+
34
+ ### ARC-GEN Data (CRITICAL)
35
+ On Kaggle, each task JSON at `/kaggle/input/competitions/neurogolf-2026/taskNNN.json` contains:
36
+ ```json
37
+ {"train": [...], "test": [...], "arc-gen": [...]}
38
+ ```
39
+ The `arc-gen` key has **up to 262 additional examples per task** (100K total across 400 tasks) generated by Google's ARC-GEN system. **Models are validated against ALL splits including arc-gen.** A model that passes train+test but fails arc-gen scores ZERO on Kaggle.
40
+
41
+ Locally, ARC-GEN data is in separate files at `ARC-GEN-100K/{hex_id}.json` as a list of `{input, output}` dicts. Must be merged with the ARC-AGI task data.
42
+
43
+ ## 2. Current State (v3 → v4 in progress)
44
+
45
+ ### v3 Results: 307/400 solved locally, LB score ~501 (NOT ~3267)
46
+ The massive gap (3267 local vs 501 LB) means **most of our conv models fail ARC-GEN validation on Kaggle**. The conv is fitted on ~6 train+test examples but must generalize to ~250 arc-gen examples of varying sizes. Many don't.
47
+
48
+ ### Solver Breakdown (v3)
49
+ ```
50
+ conv_var: 125, conv_fixed: 107, conv_diff: 39, spatial_gather: 16,
51
+ concat: 5, color_map: 4, concat_enhanced: 4, rotate: 3,
52
+ transpose: 2, upscale: 1, varshape_spatial_gather: 1
53
+ ```
54
+
55
+ ### Repository
56
+ - HF: `rogermt/neurogolf-solver`
57
+ - Files: `neurogolf_solver.py`, `neurogolf_utils.py` (official Kaggle utils), `ARC-GEN-100K.zip`, `neurogolf-2026-solver-notebooks.zip`
58
+
59
+ ## 3. Key Differences: Our Solver vs High-Scoring Notebooks
60
+
61
+ ### The 4200-point notebook (`neurogolf-2026-tiny-onnx-solver`)
62
+ This is a **BLEND notebook** — it does NOT solve tasks from scratch. It:
63
+ 1. **Phase 1**: Loads 12+ other notebooks' `submission.zip` files as inputs
64
+ 2. For each task, picks the cheapest valid model across all sources
65
+ 3. **Phase 2**: Tries loose ONNX files from dataset inputs
66
+ 4. **Phase 3**: Runs its own solver only on remaining unsolved tasks
67
+ 5. Validates EVERYTHING against train+test+arc-gen before including
68
+ 6. Result: 338/400 solved, est. score 4197.5
69
+
70
+ **Critical insight**: The 4200 score comes from BLENDING many solutions, not from a single solver. The solver itself only adds 0 new tasks in Phase 3. All 338 come from other notebooks' pre-built models.
71
+
72
+ ### The championship notebook (`the-2026-neurogolf-championship`)
73
+ Also a blend but with its own solver. Key differences from ours:
74
+ - Uses **opset 17** (not 10!) — works fine on Kaggle
75
+ - Has **shift detector**, **gravity detector**, **mirror detectors**, **fixed crop detector**, **outline detector**
76
+ - Has **composition detectors**: rotation+color, transpose+color, flip+color
77
+ - Has **channel reduction**: reduces 10→N channels for fewer colors → cheaper models
78
+ - Uses **PyTorch learned conv**: multi-seed Adam training, ternary weight snapping
79
+ - Uses **two-layer conv**: Conv→ReLU→Conv for complex patterns
80
+ - Validates against `train + arc-gen[:30]` (capped at 30 arc-gen examples)
81
+ - Result: 288 from own solver + more from blended inputs
82
+
83
+ ### What they have that we don't
84
+ | Feature | Them | Us |
85
+ |---------|------|-----|
86
+ | ARC-GEN validation | ✅ validate against arc-gen | ❌ v3 ignores arc-gen |
87
+ | ARC-GEN in fitting | ✅ uses arc-gen[:3] in detectors | ❌ fits only train+test |
88
+ | Opset 17 | ✅ uses freely | ❌ stuck on opset 10 |
89
+ | Shift detector | ✅ | ❌ |
90
+ | Gravity detector | ✅ | ❌ |
91
+ | Mirror detectors | ✅ (h, v, quad) | ❌ |
92
+ | Fixed crop detector | ✅ | ❌ |
93
+ | Extract outline | ✅ | ❌ |
94
+ | Composition (rot+color) | ✅ | ❌ |
95
+ | Channel reduction | ✅ (fewer channels = cheaper) | ❌ |
96
+ | PyTorch learned conv | ✅ (multi-seed, ternary snap) | ❌ (lstsq only) |
97
+ | Two-layer conv | ✅ (Conv→ReLU→Conv) | ❌ |
98
+ | Blend from other notebooks | ✅ (12+ sources) | ❌ |
99
+
100
+ ## 4. The Submission Score Gap Problem
101
+
102
+ ### Why LB = 501 when local = 3267
103
+ Our 307 solved tasks generate ONNX models locally. But on Kaggle:
104
+ 1. Models are validated against `train + test + arc-gen` (all splits)
105
+ 2. Conv models fitted on 6 examples often fail on 250+ arc-gen examples
106
+ 3. Failed models score 0 (not even the 1.0 minimum)
107
+ 4. Likely only ~40-50 of our 307 models actually pass on Kaggle
108
+
109
+ ### The fix priority
110
+ 1. **Validate locally against arc-gen** before submitting — only include models that pass
111
+ 2. **Include arc-gen examples in conv fitting** — more data = better generalization
112
+ 3. **Add more analytical solvers** (shift, mirror, gravity, crop) — these always generalize
113
+ 4. **Try opset 17** — unlocks more ops, may work fine on Kaggle
114
+
115
+ ## 5. Architecture & Code Structure
116
+
117
+ ### `neurogolf_solver.py` structure
118
+ ```
119
+ Constants: BATCH=1, CH=10, GH=GW=30
120
+ EXCLUDED_TASKS = {21, 55, 80, 184, 202, 366}
121
+
122
+ load_tasks_dir(data_dir, arcgen_dir) # Load + merge ARC-GEN
123
+ to_onehot(grid) # Grid → [1,10,30,30]
124
+ validate(path, td) # Check model on ALL splits
125
+ score_network(path) # MACs + memory + params
126
+
127
+ Analytical Solvers (priority order):
128
+ identity → constant → color_map → transpose → flip → rotate →
129
+ tile → upscale → kronecker → concat → concat_enhanced →
130
+ diagonal_tile → spatial_gather → varshape_spatial_gather
131
+
132
+ Conv Solvers:
133
+ solve_conv_fixed() — Fixed same-shape: Slice→Conv→ArgMax→Equal+Cast→Pad
134
+ solve_conv_variable() — Variable same-shape: Conv(30×30)→ArgMax→Equal+Cast→Mul(mask)
135
+ solve_conv_diffshape()— Fixed diff-shape (output≤input)
136
+ solve_conv_var_diff() — Variable diff-shape (output≤input)
137
+
138
+ Main: solve_task() → run_tasks() → generate submission.zip + submission.csv
139
+ ```
140
+
141
+ ### ONNX Building Patterns (opset 10)
142
+ ```python
143
+ # Model skeleton
144
+ def mk(nodes, inits=None):
145
+ x = helper.make_tensor_value_info("input", DT, [1,10,30,30])
146
+ y = helper.make_tensor_value_info("output", DT, [1,10,30,30])
147
+ g = helper.make_graph(nodes, "g", [x], [y], initializer=inits or [])
148
+ return helper.make_model(g, ir_version=10, opset_imports=[helper.make_opsetid("", 10)])
149
+
150
+ # One-hot via Equal+Cast (NOT OneHot — has CUDA issues)
151
+ classes = np.arange(10).reshape(1,10,1,1)
152
+ Equal(argmax_output, classes) → Cast(to=FLOAT)
153
+
154
+ # Spatial remap via Gather (NOT GatherElements — requires opset 11!)
155
+ Reshape([1,10,30,30] → [1,10,900]) → Gather(axis=2, indices=[900]) → Reshape back
156
+
157
+ # Conv pattern
158
+ Conv(input, W, kernel_shape=[ks,ks], pads=[pad]*4) → ArgMax → Equal+Cast → Mul(mask)
159
+
160
+ # Mask for variable-shape: ReduceSum(input, axes=[1], keepdims=1) gives 1 where content exists
161
+ ```
162
+
163
+ ### Critical Op Compatibility
164
+ | Op | Opset Required | Notes |
165
+ |----|---------------|-------|
166
+ | Gather | 1 | ✅ Safe. Use axis=2 on flattened [1,10,900] |
167
+ | GatherElements | 11 | ❌ DO NOT USE with opset 10. Will fail on ORT 1.25+ |
168
+ | OneHot | 9 | ⚠️ No CUDA kernel. Use Equal+Cast instead |
169
+ | Conv | 1 | ✅ Safe |
170
+ | ArgMax | 1 | ✅ Safe |
171
+ | ReduceSum | 1 | ✅ Safe |
172
+ | Pad | 2 (opset 10 syntax) | ✅ Use `pads` attribute for opset 10 |
173
+ | Slice | 10 | ✅ With starts/ends as inputs |
174
+ | Tile | 6 | ✅ Safe |
175
+ | ScatterElements | 11 | ⚠️ Requires opset 11+ |
176
+
177
+ ## 6. Conv Fitting: lstsq vs PyTorch
178
+
179
+ ### Current: lstsq (single-layer, closed-form)
180
+ ```python
181
+ patches = [] # [N, 10*ks*ks] feature vectors
182
+ targets = [] # [N] integer class labels
183
+ P, T_oh = build_from_examples(exs)
184
+ WT = np.linalg.lstsq(P, T_oh)[0] # Closed-form optimal weights
185
+ if np.argmax(P @ WT, 1) == T: SUCCESS # Perfect fit check
186
+ ```
187
+ - Fast, deterministic, optimal for linear case
188
+ - FAILS when: pattern is nonlinear, too few examples, kernel too small
189
+
190
+ ### Needed: PyTorch gradient descent (multi-layer)
191
+ ```python
192
+ class TinyARC(nn.Module):
193
+ def __init__(self, hidden=32, ks=5):
194
+ self.conv1 = nn.Conv2d(10, hidden, ks, padding=ks//2)
195
+ self.conv2 = nn.Conv2d(hidden, 10, ks, padding=ks//2)
196
+ def forward(self, x):
197
+ return self.conv2(torch.relu(self.conv1(x)))
198
+
199
+ # Train with MSE or cross-entropy, export with torch.onnx.export(model, dummy, path, opset_version=10)
200
+ # Then add argmax+equal+cast+mask post-processing in ONNX manually
201
+ ```
202
+ - Can fit nonlinear patterns lstsq can't
203
+ - Multi-seed training (0, 7, 42) for robustness
204
+ - Ternary weight snapping: round weights to {-1, 0, 1} for smaller models
205
+
206
+ ### ARC-GEN for conv fitting
207
+ The conv MUST generalize to arc-gen examples. Two approaches:
208
+ 1. **Include arc-gen in fitting data** — use `train + test + arc-gen[:20]` for lstsq
209
+ 2. **Validate against arc-gen after fitting** — only accept if passes all splits
210
+
211
+ ## 7. Unsolved Tasks (94 in v3)
212
+
213
+ ### Categories
214
+ | Category | Count | Why Unsolved |
215
+ |----------|-------|-------------|
216
+ | Variable diff-shape (output smaller) | ~60 | Output shape depends on input content |
217
+ | Variable diff-shape (output larger) | ~17 | Same problem |
218
+ | Same-shape, complex pattern | ~10 | Need larger kernels or multi-layer |
219
+ | Fixed diff-shape, output larger | ~7 | Input-content-dependent patterns |
220
+
221
+ ### Fundamental Blocker
222
+ Variable-shape tasks where output size depends on input CONTENT cannot be solved with a static ONNX graph. The only workaround: conv learns to put valid content in the right region, masked by input-derived spatial mask.
223
+
224
+ ## 8. Mistakes Log (DO NOT REPEAT)
225
+
226
+ ### GatherElements (opset 11) — Fixed in v3
227
+ `GatherElements` requires opset 11. Works on Kaggle's old ORT but fails on ORT 1.25+. Replaced with `Gather` (opset 1) using 1D indices on flattened spatial dim.
228
+
229
+ ### s_flip still used GatherElements — Fixed in v4
230
+ The `s_flip` solver was still using `GatherElements`. Must use `_build_gather_model()` instead.
231
+
232
+ ### ARC-GEN not loaded — The #1 score killer
233
+ v3 had `if 'arc-gen' in td` in validate() but never loaded arc-gen data into `td`. So validation always passed (no arc-gen to check), but Kaggle validated against arc-gen and most conv models failed.
234
+
235
+ ### Conv fitted on too few examples
236
+ Fitting on 6 train+test examples → overfits to small sample. Must include arc-gen examples in fitting data for better generalization.
237
+
238
+ ### No submission.csv
239
+ Kaggle may need submission.csv alongside submission.zip.
240
+
241
+ ### Wrong score_network without onnx_tool
242
+ Our fallback `score_network` returned `(0, 0, 0)` instead of real costs. Need static profiler that matches Kaggle's calculation.
243
+
244
+ ### Ignored EXCLUDED tasks
245
+ Wasted time trying to solve tasks 21, 55, 80, 184, 202, 366 which are officially excluded.
246
+
247
+ ## 9. Competitive Strategy
248
+
249
+ ### Path to 4800+ LB score
250
+ 1. **Fix ARC-GEN validation** — immediately recover ~200 points from models that actually work
251
+ 2. **Add missing analytical solvers** (shift, mirror, gravity, crop, composition) — +20-30 tasks, ~13 points each
252
+ 3. **PyTorch multi-layer conv** — solve 5-10 more complex same-shape tasks
253
+ 4. **Channel reduction** — reduce cost of existing solutions by 30-50%
254
+ 5. **Blend with other notebooks** — the 4200 notebook proves this is the meta-strategy
255
+
256
+ ### Quick wins
257
+ - Transpose: score=25.0 (cost=0, just permute dims) — already have
258
+ - Identity: score=25.0 — already have
259
+ - Color map via channel Gather: cheaper than Conv 1×1 (params+nbytes only, no MACs)
260
+ - Analytical solvers: ~13 points each (cost ≈ 165K)
261
+ - Small conv (ks=1): ~11-13 points
262
+ - Large conv (ks=29): ~7 points
263
+
264
+ ## 10. Data & File Locations
265
+
266
+ ### On Kaggle
267
+ ```
268
+ /kaggle/input/competitions/neurogolf-2026/
269
+ task001.json ... task400.json (with train+test+arc-gen)
270
+ neurogolf_utils/neurogolf_utils.py
271
+ ```
272
+
273
+ ### Locally
274
+ ```
275
+ ARC-AGI/data/training/ # 400 hex-named .json files (train+test only)
276
+ ARC-GEN-100K/ # 400 hex-named .json files (arc-gen examples)
277
+ neurogolf-solver/
278
+ neurogolf_solver.py # Main solver
279
+ neurogolf_utils.py # Official Kaggle utils (needs onnx_tool, IPython)
280
+ ```
281
+
282
+ ### ARC-GEN file format
283
+ ```python
284
+ # ARC-GEN-100K/{hex_id}.json is a LIST of examples:
285
+ [{"input": [[...]], "output": [[...]]}, ...]
286
+ # Must be merged into task data as td['arc-gen'] = list_of_examples
287
+ ```
288
+
289
+ ### ARC-GEN GitHub generator
290
+ https://github.com/google/ARC-GEN — Can generate MORE examples per task if needed.
291
+
292
+ ## 11. Reference Notebooks (in repo as neurogolf-2026-solver-notebooks.zip)
293
+
294
+ | Notebook | LB Score | Tasks | Key Technique |
295
+ |----------|----------|-------|---------------|
296
+ | neurogolf-2026-tiny-onnx-solver | ~4200 | 338 | Mega-blend of 12+ notebooks |
297
+ | 4200-v5-neurogolf-fix | ~5700 est | 341 | Same blend, manual LLM rescue tasks |
298
+ | the-2026-neurogolf-championship | ~3200 est | 288 | Own solver + blend |
299
+ | neurogolf-logic-driven-ensembling | — | 401 | Pure ensembling from zips |
300
+
301
+ ## 12. Testing Checklist
302
+
303
+ Before any Kaggle submission:
304
+ - [ ] All models validated against train + test + arc-gen (locally)
305
+ - [ ] EXCLUDED tasks {21,55,80,184,202,366} not included
306
+ - [ ] No GatherElements (opset 11) in any model
307
+ - [ ] No banned ops (Loop, Scan, NonZero, Unique)
308
+ - [ ] Each .onnx file < 1.44 MB
309
+ - [ ] submission.zip < 1.44 MB total
310
+ - [ ] submission.csv generated
311
+ - [ ] Local estimated score calculated with static profiler
312
+ - [ ] Compared local score vs expected LB (should be close now)