rogermt commited on
Commit
1b5636f
Β·
verified Β·
1 Parent(s): b311dde

Move own-solver/LEARNING.md to own-solver/

Browse files
Files changed (1) hide show
  1. own-solver/LEARNING.md +237 -0
own-solver/LEARNING.md ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NeuroGolf Solver β€” Learning & History
2
+
3
+ > This file accumulates everything learned across sessions. Read it to avoid repeating mistakes and to understand what techniques work. Newest entries first within each section.
4
+
5
+ ## Version History
6
+
7
+ | Version | Date | Tasks (arc-gen validated) | Est LB | Key Changes |
8
+ |---------|------|--------------------------|--------|-------------|
9
+ | **v5.2** | **2026-04-26** | **52 locally, REJECTED on Kaggle** | **~710 (local)** | gravity.py (Task 78), mode.py (Task 129), edge.py (0 matches). **Kaggle rejected submission β€” profiler/validation gap.** |
10
+ | v5.1 | 2026-04-26 | 49 | ~604 | Exp 3: PCA/SVD 0 PCR solves. Refactored conv.py composable primitives. |
11
+ | v5.0 | 2026-04-26 | 49 | ~604 | Refactored to 16-file package, opset 17 (IR 8), Slice-based flip/rotate, lstsq crash fix |
12
+ | v4.3 | 2026-04-25 | 50 | ~670 | Updated docs. NO code changes. |
13
+ | v4.2 | 2026-04-24 | 50 | ~670 | PyTorch learned conv. Needs GPU. |
14
+ | v4.1 | 2026-04-24 | 50 | ~670 | Color map Gather for permutations (+15 pts) |
15
+ | v4.0 | 2026-04-24 | 50 | ~656 | ARC-GEN validation, new analytical solvers, static profiler, submission.csv |
16
+ | v3 | 2026-04-24 | 307 (local) / ~40 (LB) | 501 | concat_enhanced, varshape_spatial_gather, conv_var_diff |
17
+ | v2 | prior | 294 (local) | unknown | Spatial_gather, variable-shape conv, diff-shape conv |
18
+ | v1 | prior | 128 | unknown | Conv solver only |
19
+
20
+ ## Mistakes Log (DO NOT REPEAT)
21
+
22
+ ### 2026-04-26: Agent replaced user's score_network (onnx_tool) with silent fallback β€” CAUSED KAGGLE REJECTION
23
+
24
+ - **What**: The v5 refactor created `profiler.py` with a `_static_profile()` fallback that runs when `onnx_tool` is not installed. The fallback is wrapped in a bare `except: pass`, so if `onnx_tool` fails on a model (dynamic shapes, unsupported ops, opset 17 issues), the code **silently** falls through to a crude static profiler that returns fake scores instead of surfacing the error.
25
+ - **Result**: User's v5.2 submission was **rejected by Kaggle**. The 49 previously-accepted tasks worked, but the 3 new models (gravity.py, edge.py, mode.py) likely failed `onnx_tool.loadmodel()` shape inference or profiling. The local static profiler returned numbers that looked valid, so the user had no warning before submitting.
26
+ - **Root cause**:
27
+ 1. User originally coded `score_network` to call `neurogolf_utils.score_network()` directly β€” which uses `onnx_tool` and surfaces errors properly.
28
+ 2. Agent's v5 refactor wrapped it in `try/except: pass` and added `_static_profile()` fallback.
29
+ 3. `_static_profile()` only counts Conv MACs (misses ReduceSum, Where, MatMul, etc.), only counts initializer bytes, and does NOT verify static shapes or check `onnx_tool` compatibility.
30
+ 4. The fallback **hides failures** β€” models that Kaggle's `score_network` would reject appear to score fine locally.
31
+ - **The official validation pipeline** (from `neurogolf_utils.py`):
32
+ 1. `check_network(filename)` β€” file size ≀ 1.44MB
33
+ 2. `onnxruntime.InferenceSession(filename)` β€” model loads
34
+ 3. `verify_subset(session, examples)` β€” correct outputs on all splits
35
+ 4. `score_network(filename)` β€” uses `onnx_tool.loadmodel()` β†’ `g.shape_infer()` β†’ `g.profile()` β†’ checks `g.valid_profile`, banned ops (UPPERCASE), negative memory. Returns `(None, None, None)` if ANY of these fail β†’ model is NOT READY for submission.
36
+ - **What the static profiler gets wrong**:
37
+ - Only counts Conv MACs β€” gravity model has Conv+ReduceSum+Where+Greater+And+Not per step, all uncounted
38
+ - Banned op check uses mixed-case `{'Loop', 'Scan', ...}` but Kaggle checks `op_type.upper()` against `["LOOP", "SCAN", ...]`
39
+ - No `onnx.checker.check_model()` call
40
+ - No static shape verification
41
+ - No `onnx_tool` compatibility check
42
+ - **Rule**: NEVER silently fall back to a weaker validator. If the official scoring tool fails on a model, that model MUST be treated as unsolved. Surface the error, don't hide it.
43
+ - **Rule**: NEVER change the user's validation pipeline without understanding what it does. The user's `score_network` call was correct β€” it used `onnx_tool` directly.
44
+
45
+ ### Fix Plan (must be done before next submission):
46
+
47
+ 1. **profiler.py**: Remove silent fallback. If `onnx_tool` is available, use it. If it returns `(None, None, None)`, the model is REJECTED (unsolved). If `onnx_tool` is not installed, print a loud WARNING that scores are approximate and may not match Kaggle.
48
+
49
+ 2. **validators.py**: Add `check_network()` equivalent β€” file size check (already done), `onnx.checker.check_model()`, banned op scan (UPPERCASE comparison), static shape verification on all tensors.
50
+
51
+ 3. **solver_registry.py**: After a model passes `validate()` (correct outputs), also run `score_network()` from profiler. If it returns `(None, None, None)` β†’ treat model as failed, try next solver. This catches models that produce correct outputs but can't be scored by Kaggle.
52
+
53
+ 4. **main.py**: `--strict_size` already stops on oversized files. Add `--strict_score` (default True) β€” stop if any solved model returns `(None, None, None)` from `score_network()`.
54
+
55
+ 5. **Test on Kaggle notebook**: Before submitting, run `neurogolf_utils.verify_network()` on ALL solved models in a Kaggle notebook. This is the ONLY way to be sure β€” local testing without `onnx_tool` cannot catch all failure modes.
56
+
57
+ ### 2026-04-26: Agent put entire 1400-line codebase into a single file, repeatedly overwrote user's code
58
+
59
+ - **What**: When implementing v5 opset 17 changes, agent uploaded the entire solver as a single `neurogolf_solver.py` file β€” three times. Each upload overwrote the user's `run_tasks`, `main`, and W&B code that the agent couldn't read (the read tool truncates at ~1000 lines).
60
+ - **Result**: User's W&B logging code was deleted. User's `run_tasks` function was deleted. User had to point agent to a specific commit (3f3d372) to recover.
61
+ - **Root cause**: (1) Agent couldn't read the tail of the file due to tool truncation, so it rewrote the entire file from scratch instead of making surgical edits. (2) Agent prioritized "getting it done" over preserving existing working code.
62
+ - **Rule**: NEVER rewrite an entire file when you can't read all of it. Make surgical edits. NEVER destroy code you can't see.
63
+
64
+ ### 2026-04-26: lstsq SVD non-convergence crash on task 313
65
+
66
+ - **What**: `np.linalg.lstsq(P, T_oh, rcond=None)` raised `LinAlgError: SVD did not converge`.
67
+ - **Fix**: Wrapped lstsq in `try/except (LinAlgError, ValueError): return None` in all call sites.
68
+ - **Rule**: EVERY lstsq call must be guarded.
69
+
70
+ ### 2026-04-26: ReduceSum axes attribute invalid in opset 17
71
+
72
+ - **What**: Code used axes as attribute instead of tensor input (opset 13+ requirement).
73
+ - **Fix**: Created `_build_reducesum()` helper with axes as int64 initializer tensor.
74
+ - **Rule**: Audit ALL operators for breaking API changes when changing opset.
75
+
76
+ ### 2026-04-26: Fake excluded tasks {21, 55, 80, 184, 202, 366}
77
+
78
+ - **What**: Agent added 6 "excluded" tasks to constants.py. There are NO excluded tasks β€” all 400 count.
79
+ - **Fix**: `EXCLUDED_TASKS = set()`
80
+ - **Rule**: All 400 tasks must be attempted. Do not invent exclusions.
81
+
82
+ ### 2026-04-26: est_lb inflated by adding unsolvedΓ—1.0
83
+
84
+ - **What**: `est_lb = total_score + unsolved_count * 1.0` double-counted unsolved task scores.
85
+ - **Fix**: Report only solved score. Unsolved tasks get 1.0 on Kaggle automatically.
86
+ - **Rule**: est_lb should reflect only what we earn from solved tasks.
87
+
88
+ ### 2026-04-25: Agent wrote 1919 lines of v5 code WITHOUT running full 400-task arc-gen validation
89
+ - **Rule**: NEVER mark a feature as done until validated against full arc-gen.
90
+
91
+ ### 2026-04-25: Agent created version-named file violating project convention
92
+ - **Rule**: No version numbers in filenames.
93
+
94
+ ### 2026-04-25: Agent claimed LOOCV Ridge tuning would improve arc-gen without evidence
95
+ - **Rule**: Theory from papers is NOT proof. Test first.
96
+
97
+ ### 2026-04-25: Agent misrepresented user's intent β€” BLENDING is NOT the strategy
98
+ - **Rule**: LEARNING.md reflects USER'S strategy.
99
+
100
+ ### 2026-04-25: Composition detectors, channel reduction wrapper β€” untested dead code
101
+ - **Rule**: Only add a solver if it demonstrably solves β‰₯1 task.
102
+
103
+ ### 2026-04-25: Agent delivered untested code and asked user to validate it
104
+ - **Rule**: VALIDATE FIRST, DELIVER SECOND.
105
+
106
+ ### 2026-04-24: PyTorch 2-layer conv β€” fits training but doesn't generalize to arc-gen
107
+ ### 2026-04-24: Arc-gen in lstsq fitting exposes overfitting
108
+ ### 2026-04-24: CuPy/GPU for lstsq β€” DOES NOT HELP
109
+ ### 2026-04-24: Channel Gather for non-permutation color maps β€” WRONG OUTPUT
110
+ ### 2026-04-24: ARC-GEN not loaded β€” THE #1 SCORE KILLER (v3β†’v4 fix)
111
+ ### 2026-04-24: s_flip used GatherElements β€” OPSET 11 BUG
112
+ ### 2026-04-24: score_network fallback returned (0,0,0)
113
+ ### 2026-04-24: Ignored EXCLUDED tasks
114
+
115
+ ## Competitive Intelligence
116
+
117
+ ### What Others Do (For Awareness Only β€” We Do NOT Blend)
118
+
119
+ #### Why top notebooks score 4000+ and we score ~670
120
+
121
+ Top notebooks are **BLENDERS** β€” they assemble pre-solved ONNX models from public sources.
122
+
123
+ **Our strategy**: Build our own solver. No blending. No public datasets.
124
+
125
+ #### The 6 Key Techniques They Have That We Lack
126
+
127
+ 1. **Opset 17** β€” βœ… DONE in v5. Slice+Transpose for near-zero cost transforms.
128
+ 2. **Channel Reduction Wrapper** β€” πŸ”² Not yet. Conv1x1(10β†’N) β†’ transform β†’ Conv1x1(Nβ†’10).
129
+ 3. **Composition Detectors** β€” πŸ”² Not yet. Need to scan 400 tasks to find actual instances first.
130
+ 4. **Best-of-N Model Selection** β€” πŸ”² Not yet. Generate 20+ candidates, keep cheapest valid.
131
+ 5. **ONNX Optimizer Pass** β€” πŸ”² Not yet. onnxoptimizer.optimize() for dead-code elimination.
132
+ 6. **LLM Rescue** β€” πŸ”² Not yet. Per-task ONNX graphs for algorithmic tasks (gravity, outline, etc.)
133
+
134
+ ## Deep Research Findings
135
+
136
+ ### Exp 3: PCA/Truncated SVD Before lstsq β€” FULL RESULTS (2026-04-26)
137
+
138
+ **Implementation:** Refactored conv.py into composable primitives:
139
+ - `_build_patch_matrix(exs, ks, bias, full_30)` β†’ P, T, T_oh
140
+ - `_solve_weights(P, T, T_oh)` β†’ WT via raw lstsq
141
+ - `_solve_weights_pcr(P, T, T_oh, thresholds)` β†’ WT via PCA regression
142
+ - `_extract_weights(WT, ks, bias)` β†’ Wconv, B for ONNX
143
+
144
+ **Full 400-task run:** 0 PCR solves, 0 regressions.
145
+
146
+ **Conclusion:** Architecture mismatch, not regularization. Regularization experiments exhausted.
147
+
148
+ ### ONNX Opset 17 Migration Notes (2026-04-26)
149
+
150
+ **Breaking changes from opset 10:**
151
+ | Operator | Opset 10 | Opset 13+ (incl. 17) |
152
+ |----------|----------|----------------------|
153
+ | ReduceSum | axes as **attribute** | axes as **tensor input** |
154
+ | ReduceMean | axes as **attribute** | axes as **tensor input** |
155
+ | Pad | pads as **attribute** | pads as **tensor input** (since opset 11) |
156
+ | Slice | no steps input | **steps** added as 5th tensor input |
157
+
158
+ ### Official Scoring Pipeline (from neurogolf_utils.py) β€” READ BEFORE CODING
159
+
160
+ ```python
161
+ # This is what Kaggle runs. Our validator MUST match this.
162
+ def check_network(filename):
163
+ # 1. File must exist
164
+ # 2. File size ≀ 1.44MB (1.44 * 1024 * 1024 bytes)
165
+
166
+ def score_network(filename):
167
+ # Uses onnx_tool.loadmodel() β†’ shape_infer() β†’ profile()
168
+ # Checks: g.valid_profile (static shapes required)
169
+ # Checks: op_type.upper() not in ["LOOP","SCAN","NONZERO","UNIQUE","SCRIPT","FUNCTION"]
170
+ # Checks: g.nodemap[key].memory >= 0
171
+ # Returns (macs, memory, params) or (None, None, None) on ANY failure
172
+ # (None, None, None) = "Your network performance could not be measured" = REJECTED
173
+
174
+ def verify_network(network, task_num, examples):
175
+ # 1. onnx.save β†’ check_network (size)
176
+ # 2. InferenceSession (loads ok?)
177
+ # 3. verify_subset on train+test (correct outputs?)
178
+ # 4. verify_subset on arc-gen (correct outputs?)
179
+ # 5. score_network (scoreable by onnx_tool?)
180
+ # ALL must pass for "IS READY for submission"
181
+ ```
182
+
183
+ ## What Has NOT Worked
184
+
185
+ | Technique | Result | Why |
186
+ |-----------|--------|-----|
187
+ | **PCA/Truncated SVD (Exp 3)** | **0/400 PCR solves** | **Signal in noise dims; unsolved tasks = architecture mismatch** |
188
+ | **Silent profiler fallback** | **Kaggle rejection** | **Hides onnx_tool failures, returns fake scores** |
189
+ | Ridge/LOOCV Ξ» | Fails arc-gen | Catastrophic, not benign overfitting |
190
+ | Skip ks=5,7,9 (Exp 1) | Hurts 2 tasks | Some tasks genuinely need interpolation-regime ks |
191
+ | CuPy GPU lstsq | OOM + same speed | O(nΒ³) SVD bottleneck |
192
+ | PyTorch 2-layer (no arc-gen) | 0/30 arc-gen pass | Memorizes training |
193
+
194
+ ## Technical Notes
195
+
196
+ ### ARC-AGI Task Statistics
197
+ - 400 tasks total. NO excluded tasks β€” all 400 count.
198
+ - ~25 analytical tasks, ~25 conv tasks survive arc-gen, ~350 unsolved
199
+
200
+ ### Score Calculation (official, from neurogolf_utils.py)
201
+ ```python
202
+ # Uses onnx_tool for exact MACs/memory/params β€” NOT our static profiler
203
+ macs, memory, params = score_network(filename) # onnx_tool based
204
+ points = max(1.0, 25.0 - math.log(macs + memory + params))
205
+ ```
206
+
207
+ ## Session Notes for Future Agents
208
+
209
+ **Before touching code:**
210
+ 1. Read this file (LEARNING.md) β€” all the way through
211
+ 2. Read SKILL.md β€” especially "Development Methodology" and "Submission Checklist"
212
+ 3. Read TODO.md β€” check experiment log and research queue
213
+ 4. Run the current solver on 20-50 tasks to establish baseline
214
+ 5. Only then: design experiment, implement, validate, compare
215
+
216
+ **Code structure (v5.2):**
217
+ - The solver is a Python package at `neurogolf_solver/`
218
+ - Run with `python -m neurogolf_solver.main [args]`
219
+ - Solvers in separate files: `analytical.py`, `geometric.py`, `tiling.py`, `conv.py`, `gravity.py`, `edge.py`, `mode.py`
220
+ - Edit individual files surgically β€” NEVER rewrite the whole package
221
+ - The legacy `neurogolf_solver.py` at root is v4, kept for reference β€” do NOT edit it
222
+
223
+ **CRITICAL: Scoring & Validation:**
224
+ - The ONLY reliable scoring is `neurogolf_utils.score_network()` which uses `onnx_tool`
225
+ - `profiler.py`'s `_static_profile()` is a fallback that DOES NOT match Kaggle scoring
226
+ - Before submitting: run `neurogolf_utils.verify_network()` on ALL solved models in a Kaggle notebook
227
+ - If `score_network` returns `(None, None, None)`, the model is REJECTED β€” do not submit it
228
+
229
+ **Before claiming a feature works:**
230
+ - Must pass arc-gen on β‰₯20 tasks (or full 400 if cheap)
231
+ - Must pass `neurogolf_utils.verify_network()` β€” not just our own validate()
232
+ - Must include A/B comparison
233
+
234
+ **Before uploading code:**
235
+ - Must have run full 400-task arc-gen validation
236
+ - Must confirm total score β‰₯ previous best
237
+ - NEVER change the scoring/validation pipeline without understanding what it does