rogermt commited on
Commit
1963657
Β·
verified Β·
1 Parent(s): 2231985

LEARNING.md: log profiler.py silent fallback mistake + Kaggle rejection + fix plan

Browse files
Files changed (1) hide show
  1. LEARNING.md +112 -121
LEARNING.md CHANGED
@@ -6,56 +6,99 @@
6
 
7
  | Version | Date | Tasks (arc-gen validated) | Est LB | Key Changes |
8
  |---------|------|--------------------------|--------|-------------|
9
- | **v5.1** | **2026-04-26** | **49** | **~603.6** | Exp 3: PCA/SVD tested on 400 tasks, 0 PCR solves. Refactored conv.py into composable primitives. PCR fallback added (deferred 2nd pass). No regressions. |
10
- | v5.0 | 2026-04-26 | 49 | ~603.6 | Refactored to 16-file package, opset 17 (IR 8), Slice-based flip/rotate (0 MACs), tensor-based Pad & ReduceSum, lstsq crash fix |
11
- | v4.3 | 2026-04-25 | 50 | ~670 | Updated TODO.md + SKILL.md + LEARNING.md with closed-loop methodology. NO code changes. |
12
- | v4.2 | 2026-04-24 | 50 | ~670 | Added PyTorch learned conv (single+two-layer, multi-seed, ternary snap). Needs GPU. |
 
13
  | v4.1 | 2026-04-24 | 50 | ~670 | Color map Gather for permutations (+15 pts) |
14
- | v4.0 | 2026-04-24 | 50 | ~656 | ARC-GEN validation, new analytical solvers, s_flip fix, static profiler, submission.csv |
15
- | v3 | 2026-04-24 | 307 (local) / ~40 (LB) | 501 | Added concat_enhanced, varshape_spatial_gather, conv_var_diff |
16
  | v2 | prior | 294 (local) | unknown | Spatial_gather, variable-shape conv, diff-shape conv |
17
  | v1 | prior | 128 | unknown | Conv solver only |
18
 
19
  ## Mistakes Log (DO NOT REPEAT)
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ### 2026-04-26: Agent put entire 1400-line codebase into a single file, repeatedly overwrote user's code
22
 
23
  - **What**: When implementing v5 opset 17 changes, agent uploaded the entire solver as a single `neurogolf_solver.py` file β€” three times. Each upload overwrote the user's `run_tasks`, `main`, and W&B code that the agent couldn't read (the read tool truncates at ~1000 lines).
24
  - **Result**: User's W&B logging code was deleted. User's `run_tasks` function was deleted. User had to point agent to a specific commit (3f3d372) to recover.
25
- - **Root cause**: (1) Agent couldn't read the tail of the file due to tool truncation, so it rewrote the entire file from scratch instead of making surgical edits. (2) No Python best practice says "put all code in one file" β€” the opposite is true. (3) Agent prioritized "getting it done" over preserving existing working code.
26
- - **Rule**: NEVER rewrite an entire file when you can't read all of it. Use the `edit` tool for targeted string replacements. If the file is too large to read, split it into smaller files FIRST (which is what the user ultimately had to specify). NEVER destroy code you can't see.
27
 
28
  ### 2026-04-26: lstsq SVD non-convergence crash on task 313
29
 
30
- - **What**: `np.linalg.lstsq(P, T_oh, rcond=None)` raised `LinAlgError: SVD did not converge` during `solve_conv_variable` for task 313.
31
- - **Result**: Entire solver crashed, no further tasks processed.
32
- - **Root cause**: The `_lstsq_conv` function had no try/except around the lstsq call. `solve_conv_var_diff` already had one, but `_lstsq_conv` (used by `solve_conv_fixed` and `solve_conv_variable`) did not.
33
- - **Fix**: Wrapped lstsq in `try/except (np.linalg.LinAlgError, ValueError): return None` in all three call sites (`_lstsq_conv`, `solve_conv_diffshape` inline lstsq).
34
- - **Rule**: EVERY lstsq call must be guarded. SVD non-convergence is rare but real, especially for ill-conditioned patch matrices from unusual grid patterns.
35
 
36
  ### 2026-04-26: ReduceSum axes attribute invalid in opset 17
37
 
38
- - **What**: Code used `ReduceSum(['data'], ['output'], axes=[1,2,3], keepdims=1)` which puts axes as a node attribute. In opset 13+, axes must be a tensor input, not an attribute.
39
- - **Result**: Models would fail ONNX checker validation and potentially fail on Kaggle inference server.
40
- - **Fix**: Created `_build_reducesum()` helper that adds axes as an int64 initializer tensor and passes it as the 2nd input to ReduceSum. Applied to `s_constant` (axes=[1,2,3]), `solve_conv_variable` (axes=[1]), `solve_conv_var_diff` (axes=[1]).
41
- - **Rule**: When changing opset version, audit ALL operators for breaking API changes. Key opset 13 changes: ReduceSum, ReduceMean, ReduceMax all moved axes from attribute to tensor input. Pad moved pads from attribute to tensor input at opset 11. Slice added steps input at opset 13.
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ### 2026-04-25: Agent wrote 1919 lines of v5 code WITHOUT running full 400-task arc-gen validation
44
- - **What**: Generated neurogolf_solver_v5.py with opset 17 Slice-based transforms, LOOCV Ridge tuning, stride_tricks, composition detectors, channel reduction wrapper β€” claimed all features were "working" in the docstring and README
45
- - **Result**: Uploaded to repo, overwrote neurogolf_solver.py. Tested only 10 individual tasks manually. 3/10 FAILED arc-gen validation (tasks 4, 6, 241 conv models). NEVER ran full 400 with arc-gen validation. LOOCV Ridge theory in code was never tested against actual data. Estimated LB score is UNKNOWN β€” cannot claim improvement over v4's proven ~670.
46
- - **Rule**: NEVER mark a feature as done until it is validated against full arc-gen data on a representative sample of tasks.
47
 
48
- ### 2026-04-25: Agent created version-named file (neurogolf_solver_v5.py) violating project convention
49
- - **Rule**: No version numbers in filenames. Use git commits for version tracking.
50
 
51
- ### 2026-04-25: Agent claimed LOOCV Ridge tuning would improve arc-gen survival without evidence
52
- - **Rule**: Theory from papers is NOT proof for our specific data. Run A/B experiments first.
53
 
54
- ### 2026-04-25: Agent misrepresented user's intent β€” BLENDING is NOT the user's strategy
55
- - **Rule**: LEARNING.md must reflect the USER'S strategy. Competitive intelligence goes in "What Others Do" section only.
56
 
57
  ### 2026-04-25: Composition detectors, channel reduction wrapper β€” untested dead code
58
- - **Rule**: Only add a solver if it demonstrably solves β‰₯1 task. Delete dead code.
59
 
60
  ### 2026-04-25: Agent delivered untested code and asked user to validate it
61
  - **Rule**: VALIDATE FIRST, DELIVER SECOND.
@@ -98,56 +141,9 @@ Top notebooks are **BLENDERS** β€” they assemble pre-solved ONNX models from pub
98
  - `_solve_weights_pcr(P, T, T_oh, thresholds)` β†’ WT via PCA regression
99
  - `_extract_weights(WT, ks, bias)` β†’ Wconv, B for ONNX
100
 
101
- All 4 conv solvers use deferred 2-pass design:
102
- - Pass 1: raw lstsq at all ks (identical behavior to baseline)
103
- - Pass 2: PCR on ks values where lstsq fit train but failed arc-gen validation
104
-
105
- **PCR algorithm:**
106
- ```python
107
- U, s, Vt = SVD(P)
108
- cumvar = cumsum(sΒ²) / sum(sΒ²)
109
- for thresh in [0.999, 0.99, 0.95]:
110
- k = searchsorted(cumvar, thresh) + 1
111
- k = max(k, 5)
112
- P_red = U[:,:k] * s[:k] # project to top-k components
113
- w_red = lstsq(P_red, T_oh)
114
- w_full = Vt[:k].T @ w_red # map back to full p-dim
115
- ```
116
-
117
- **Diagnostic results on 25 solved conv tasks:**
118
 
119
- | p/n regime | # Tasks | PCR train-fit? | Arc-gen impact |
120
- |------------|---------|----------------|----------------|
121
- | < 0.5 | 17 | Yes (0.99 thresh) | Already 100% β€” no improvement |
122
- | 0.5-1.0 | 0 | N/A | N/A |
123
- | > 1.0 | 8 | 4/8 fail at ALL thresholds | PCR removes signal-carrying dimensions |
124
-
125
- Key observation: at p/n > 1.0, the "noise" dimensions PCA removes actually carry part of the training signal. Truncation causes train_fail β€” the model can't even fit training data after dimensionality reduction.
126
-
127
- **Diagnostic results on 345 unsolved tasks (same-shape, ks≀9):**
128
-
129
- - Only **10 tasks** have any ks where lstsq fits training
130
- - PCR improves arc-gen on **4 tasks** but none reach 100%:
131
- - Task 32: 87.5% β†’ 94.9% (+7.4%)
132
- - Task 389: 87.2% β†’ 95.7% (+8.5%)
133
- - Task 129: 59.6% β†’ 63.0% (+3.4%)
134
- - Task 229: 57.0% β†’ 60.0% (+3.0%)
135
-
136
- **Full 400-task run:** 0 PCR solves, 0 regressions, 49/49 baseline tasks preserved.
137
-
138
- **Why it failed:** Three distinct failure modes:
139
- 1. **p/n < 0.5 (17/25 solved tasks):** lstsq already generalizes perfectly. PCR is unnecessary overhead.
140
- 2. **p/n > 1.0 (8/25 solved tasks):** Signal requires ALL dimensions. PCA truncation destroys the training fit. The minimum-norm solution from lstsq distributes weight across ALL singular vectors, and removing any causes prediction errors.
141
- 3. **335/345 unsolved tasks:** No ks fits training at all. The task requires non-local operations (flood fill, mode counting, conditional logic) that conv can't represent regardless of regularization.
142
-
143
- **Conclusion:** The "overfitting hypothesis" from Nakkiran 2019 was correct in theory but inapplicable. The tasks where conv fails arc-gen fail because conv is architecturally wrong, not because of bad regularization. Regularization experiments (Ridge, PCA, skip-ks) are exhausted.
144
-
145
- ### lstsq Conv Research (2026-04-25)
146
-
147
- **Key Finding: Our overfitting is CATASTROPHIC, not benign.**
148
- - Bartlett et al. benign overfitting requires high effective rank of covariance. Our one-hot patches have LOW effective rank.
149
- - Double descent peak at ks=5,7,9 (p β‰ˆ n).
150
- - Ridge predicted to fail; Lasso (ℓ₁) theoretically better for sparse signals.
151
 
152
  ### ONNX Opset 17 Migration Notes (2026-04-26)
153
 
@@ -158,61 +154,56 @@ Key observation: at p/n > 1.0, the "noise" dimensions PCA removes actually carry
158
  | ReduceMean | axes as **attribute** | axes as **tensor input** |
159
  | Pad | pads as **attribute** | pads as **tensor input** (since opset 11) |
160
  | Slice | no steps input | **steps** added as 5th tensor input |
161
- | Conv | pads as attribute | pads as attribute βœ… (unchanged) |
162
- | Transpose | perm as attribute | perm as attribute βœ… (unchanged) |
163
- | Gather | unchanged | unchanged βœ… |
164
 
165
- **IR version**: Opset 17 requires IR ≀ 8. We use IR=8.
166
 
167
- **Slice(step=-1) for reversing:**
168
- - `starts=[dim-1], ends=[INT64_MIN], axes=[ax], steps=[-1]` β€” reverses entire axis
169
- - INT64_MIN as end sentinel (not -1, which means dim-1 in ONNX)
170
- - Zero MACs, zero params, near-zero memory (just 4 int64 scalars)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
 
172
  ## What Has NOT Worked
173
 
174
  | Technique | Result | Why |
175
  |-----------|--------|-----|
176
  | **PCA/Truncated SVD (Exp 3)** | **0/400 PCR solves** | **Signal in noise dims; unsolved tasks = architecture mismatch** |
 
177
  | Ridge/LOOCV Ξ» | Fails arc-gen | Catastrophic, not benign overfitting |
178
  | Skip ks=5,7,9 (Exp 1) | Hurts 2 tasks | Some tasks genuinely need interpolation-regime ks |
179
  | CuPy GPU lstsq | OOM + same speed | O(nΒ³) SVD bottleneck |
180
  | PyTorch 2-layer (no arc-gen) | 0/30 arc-gen pass | Memorizes training |
181
- | Composition detectors | No tasks found | May not exist in dataset |
182
- | Channel reduction wrapper | Never executed | Disabled due to Gather incompatibility |
183
 
184
  ## Technical Notes
185
 
186
  ### ARC-AGI Task Statistics
187
- - 400 tasks total, 6 excluded: {21, 55, 80, 184, 202, 366}
188
- - ~25 analytical tasks, ~25 conv tasks that survive arc-gen, ~350 unsolved
189
 
190
- ### Score Calculation
191
  ```python
192
- score = max(1.0, 25.0 - math.log(macs + memory_bytes + params))
 
 
193
  ```
194
 
195
- ### Conv Solver SVD Spectrum Analysis (Exp 3 data, 2026-04-26)
196
-
197
- Effective rank at 99% variance for solved conv tasks:
198
- | Task | ks | n patches | p features | p/n | eff_rank_99 | arc-gen acc |
199
- |------|----|-----------|-----------:|----:|------------:|------------:|
200
- | 171 | 3 | 799 | 90 | 0.11 | 5 | 100% |
201
- | 120 | 3 | 4103 | 90 | 0.02 | 22 | 100% |
202
- | 305 | 9 | 3584 | 810 | 0.23 | 416 | 100% |
203
- | 60 | 11 | 715 | 1210 | 1.69 | 245 | 98.5% |
204
- | 136 | 15 | 1400 | 2250 | 1.61 | 237 | 99.6% |
205
- | 322 | 5 | 126 | 250 | 1.98 | 100 | 97.0% |
206
-
207
- Key pattern: tasks with p/n < 0.5 β†’ 100% arc-gen. Tasks with p/n > 1.0 β†’ 97-99.6% arc-gen. The 0.4-3% error is the interpolation-regime overfitting, but it still passes validation.
208
-
209
- ### Lstsq Matrix Sizes (for reference)
210
- | Grid | Examples | Patches (n) | ks=3 (p=90) | ks=7 (p=490) | ks=29 (p=8410) |
211
- |------|----------|-------------|-------------|--------------|----------------|
212
- | 7Γ—7 | 4 | 196 | 196Γ—90 | **196Γ—490 (under!)** | 196Γ—8410 |
213
- | 12Γ—12| 6 | 576 | 576Γ—90 | 576Γ—490 | 576Γ—8410 |
214
- | 21Γ—21| 16 | 7056 | 7056Γ—90 | 7056Γ—490 | **7056Γ—8410** |
215
-
216
  ## Session Notes for Future Agents
217
 
218
  **Before touching code:**
@@ -222,25 +213,25 @@ Key pattern: tasks with p/n < 0.5 β†’ 100% arc-gen. Tasks with p/n > 1.0 β†’ 97-
222
  4. Run the current solver on 20-50 tasks to establish baseline
223
  5. Only then: design experiment, implement, validate, compare
224
 
225
- **Code structure (v5.1):**
226
  - The solver is a Python package at `neurogolf_solver/`
227
  - Run with `python -m neurogolf_solver.main [args]`
228
- - **conv.py** now has composable primitives: `_build_patch_matrix` + `_solve_weights` + `_extract_weights`
229
- - To add new fitting methods: implement `_solve_weights_XXX(P, T, T_oh)` returning WT or None
230
  - Edit individual files surgically β€” NEVER rewrite the whole package
231
  - The legacy `neurogolf_solver.py` at root is v4, kept for reference β€” do NOT edit it
232
 
 
 
 
 
 
 
233
  **Before claiming a feature works:**
234
  - Must pass arc-gen on β‰₯20 tasks (or full 400 if cheap)
235
- - Must show >10% improvement in arc-gen survival rate OR total score
236
  - Must include A/B comparison
237
 
238
  **Before uploading code:**
239
  - Must have run full 400-task arc-gen validation
240
  - Must confirm total score β‰₯ previous best
241
-
242
- **What to focus on next (post Exp 3):**
243
- 1. **Phase 3: New solver types** β€” hash matchers, pattern detectors, LLM rescue
244
- 2. **Phase 1a: Opset 17 analytical conversions** β€” reduce cost on existing 24 analytical tasks
245
- 3. **Phase 4: ONNX optimizer** β€” reduce cost on all 49 solved tasks
246
- 4. Lasso (Exp 5) is low priority β€” only 10 unsolved tasks even have lstsq fits, ceiling is very low
 
6
 
7
  | Version | Date | Tasks (arc-gen validated) | Est LB | Key Changes |
8
  |---------|------|--------------------------|--------|-------------|
9
+ | **v5.2** | **2026-04-26** | **52 locally, REJECTED on Kaggle** | **~710 (local)** | gravity.py (Task 78), mode.py (Task 129), edge.py (0 matches). **Kaggle rejected submission β€” profiler/validation gap.** |
10
+ | v5.1 | 2026-04-26 | 49 | ~604 | Exp 3: PCA/SVD 0 PCR solves. Refactored conv.py composable primitives. |
11
+ | v5.0 | 2026-04-26 | 49 | ~604 | Refactored to 16-file package, opset 17 (IR 8), Slice-based flip/rotate, lstsq crash fix |
12
+ | v4.3 | 2026-04-25 | 50 | ~670 | Updated docs. NO code changes. |
13
+ | v4.2 | 2026-04-24 | 50 | ~670 | PyTorch learned conv. Needs GPU. |
14
  | v4.1 | 2026-04-24 | 50 | ~670 | Color map Gather for permutations (+15 pts) |
15
+ | v4.0 | 2026-04-24 | 50 | ~656 | ARC-GEN validation, new analytical solvers, static profiler, submission.csv |
16
+ | v3 | 2026-04-24 | 307 (local) / ~40 (LB) | 501 | concat_enhanced, varshape_spatial_gather, conv_var_diff |
17
  | v2 | prior | 294 (local) | unknown | Spatial_gather, variable-shape conv, diff-shape conv |
18
  | v1 | prior | 128 | unknown | Conv solver only |
19
 
20
  ## Mistakes Log (DO NOT REPEAT)
21
 
22
+ ### 2026-04-26: Agent replaced user's score_network (onnx_tool) with silent fallback β€” CAUSED KAGGLE REJECTION
23
+
24
+ - **What**: The v5 refactor created `profiler.py` with a `_static_profile()` fallback that runs when `onnx_tool` is not installed. The fallback is wrapped in a bare `except: pass`, so if `onnx_tool` fails on a model (dynamic shapes, unsupported ops, opset 17 issues), the code **silently** falls through to a crude static profiler that returns fake scores instead of surfacing the error.
25
+ - **Result**: User's v5.2 submission was **rejected by Kaggle**. The 49 previously-accepted tasks worked, but the 3 new models (gravity.py, edge.py, mode.py) likely failed `onnx_tool.loadmodel()` shape inference or profiling. The local static profiler returned numbers that looked valid, so the user had no warning before submitting.
26
+ - **Root cause**:
27
+ 1. User originally coded `score_network` to call `neurogolf_utils.score_network()` directly β€” which uses `onnx_tool` and surfaces errors properly.
28
+ 2. Agent's v5 refactor wrapped it in `try/except: pass` and added `_static_profile()` fallback.
29
+ 3. `_static_profile()` only counts Conv MACs (misses ReduceSum, Where, MatMul, etc.), only counts initializer bytes, and does NOT verify static shapes or check `onnx_tool` compatibility.
30
+ 4. The fallback **hides failures** β€” models that Kaggle's `score_network` would reject appear to score fine locally.
31
+ - **The official validation pipeline** (from `neurogolf_utils.py`):
32
+ 1. `check_network(filename)` β€” file size ≀ 1.44MB
33
+ 2. `onnxruntime.InferenceSession(filename)` β€” model loads
34
+ 3. `verify_subset(session, examples)` β€” correct outputs on all splits
35
+ 4. `score_network(filename)` β€” uses `onnx_tool.loadmodel()` β†’ `g.shape_infer()` β†’ `g.profile()` β†’ checks `g.valid_profile`, banned ops (UPPERCASE), negative memory. Returns `(None, None, None)` if ANY of these fail β†’ model is NOT READY for submission.
36
+ - **What the static profiler gets wrong**:
37
+ - Only counts Conv MACs β€” gravity model has Conv+ReduceSum+Where+Greater+And+Not per step, all uncounted
38
+ - Banned op check uses mixed-case `{'Loop', 'Scan', ...}` but Kaggle checks `op_type.upper()` against `["LOOP", "SCAN", ...]`
39
+ - No `onnx.checker.check_model()` call
40
+ - No static shape verification
41
+ - No `onnx_tool` compatibility check
42
+ - **Rule**: NEVER silently fall back to a weaker validator. If the official scoring tool fails on a model, that model MUST be treated as unsolved. Surface the error, don't hide it.
43
+ - **Rule**: NEVER change the user's validation pipeline without understanding what it does. The user's `score_network` call was correct β€” it used `onnx_tool` directly.
44
+
45
+ ### Fix Plan (must be done before next submission):
46
+
47
+ 1. **profiler.py**: Remove silent fallback. If `onnx_tool` is available, use it. If it returns `(None, None, None)`, the model is REJECTED (unsolved). If `onnx_tool` is not installed, print a loud WARNING that scores are approximate and may not match Kaggle.
48
+
49
+ 2. **validators.py**: Add `check_network()` equivalent β€” file size check (already done), `onnx.checker.check_model()`, banned op scan (UPPERCASE comparison), static shape verification on all tensors.
50
+
51
+ 3. **solver_registry.py**: After a model passes `validate()` (correct outputs), also run `score_network()` from profiler. If it returns `(None, None, None)` β†’ treat model as failed, try next solver. This catches models that produce correct outputs but can't be scored by Kaggle.
52
+
53
+ 4. **main.py**: `--strict_size` already stops on oversized files. Add `--strict_score` (default True) β€” stop if any solved model returns `(None, None, None)` from `score_network()`.
54
+
55
+ 5. **Test on Kaggle notebook**: Before submitting, run `neurogolf_utils.verify_network()` on ALL solved models in a Kaggle notebook. This is the ONLY way to be sure β€” local testing without `onnx_tool` cannot catch all failure modes.
56
+
57
  ### 2026-04-26: Agent put entire 1400-line codebase into a single file, repeatedly overwrote user's code
58
 
59
  - **What**: When implementing v5 opset 17 changes, agent uploaded the entire solver as a single `neurogolf_solver.py` file β€” three times. Each upload overwrote the user's `run_tasks`, `main`, and W&B code that the agent couldn't read (the read tool truncates at ~1000 lines).
60
  - **Result**: User's W&B logging code was deleted. User's `run_tasks` function was deleted. User had to point agent to a specific commit (3f3d372) to recover.
61
+ - **Root cause**: (1) Agent couldn't read the tail of the file due to tool truncation, so it rewrote the entire file from scratch instead of making surgical edits. (2) Agent prioritized "getting it done" over preserving existing working code.
62
+ - **Rule**: NEVER rewrite an entire file when you can't read all of it. Make surgical edits. NEVER destroy code you can't see.
63
 
64
  ### 2026-04-26: lstsq SVD non-convergence crash on task 313
65
 
66
+ - **What**: `np.linalg.lstsq(P, T_oh, rcond=None)` raised `LinAlgError: SVD did not converge`.
67
+ - **Fix**: Wrapped lstsq in `try/except (LinAlgError, ValueError): return None` in all call sites.
68
+ - **Rule**: EVERY lstsq call must be guarded.
 
 
69
 
70
  ### 2026-04-26: ReduceSum axes attribute invalid in opset 17
71
 
72
+ - **What**: Code used axes as attribute instead of tensor input (opset 13+ requirement).
73
+ - **Fix**: Created `_build_reducesum()` helper with axes as int64 initializer tensor.
74
+ - **Rule**: Audit ALL operators for breaking API changes when changing opset.
75
+
76
+ ### 2026-04-26: Fake excluded tasks {21, 55, 80, 184, 202, 366}
77
+
78
+ - **What**: Agent added 6 "excluded" tasks to constants.py. There are NO excluded tasks β€” all 400 count.
79
+ - **Fix**: `EXCLUDED_TASKS = set()`
80
+ - **Rule**: All 400 tasks must be attempted. Do not invent exclusions.
81
+
82
+ ### 2026-04-26: est_lb inflated by adding unsolvedΓ—1.0
83
+
84
+ - **What**: `est_lb = total_score + unsolved_count * 1.0` double-counted unsolved task scores.
85
+ - **Fix**: Report only solved score. Unsolved tasks get 1.0 on Kaggle automatically.
86
+ - **Rule**: est_lb should reflect only what we earn from solved tasks.
87
 
88
  ### 2026-04-25: Agent wrote 1919 lines of v5 code WITHOUT running full 400-task arc-gen validation
89
+ - **Rule**: NEVER mark a feature as done until validated against full arc-gen.
 
 
90
 
91
+ ### 2026-04-25: Agent created version-named file violating project convention
92
+ - **Rule**: No version numbers in filenames.
93
 
94
+ ### 2026-04-25: Agent claimed LOOCV Ridge tuning would improve arc-gen without evidence
95
+ - **Rule**: Theory from papers is NOT proof. Test first.
96
 
97
+ ### 2026-04-25: Agent misrepresented user's intent β€” BLENDING is NOT the strategy
98
+ - **Rule**: LEARNING.md reflects USER'S strategy.
99
 
100
  ### 2026-04-25: Composition detectors, channel reduction wrapper β€” untested dead code
101
+ - **Rule**: Only add a solver if it demonstrably solves β‰₯1 task.
102
 
103
  ### 2026-04-25: Agent delivered untested code and asked user to validate it
104
  - **Rule**: VALIDATE FIRST, DELIVER SECOND.
 
141
  - `_solve_weights_pcr(P, T, T_oh, thresholds)` β†’ WT via PCA regression
142
  - `_extract_weights(WT, ks, bias)` β†’ Wconv, B for ONNX
143
 
144
+ **Full 400-task run:** 0 PCR solves, 0 regressions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
+ **Conclusion:** Architecture mismatch, not regularization. Regularization experiments exhausted.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
 
148
  ### ONNX Opset 17 Migration Notes (2026-04-26)
149
 
 
154
  | ReduceMean | axes as **attribute** | axes as **tensor input** |
155
  | Pad | pads as **attribute** | pads as **tensor input** (since opset 11) |
156
  | Slice | no steps input | **steps** added as 5th tensor input |
 
 
 
157
 
158
+ ### Official Scoring Pipeline (from neurogolf_utils.py) β€” READ BEFORE CODING
159
 
160
+ ```python
161
+ # This is what Kaggle runs. Our validator MUST match this.
162
+ def check_network(filename):
163
+ # 1. File must exist
164
+ # 2. File size ≀ 1.44MB (1.44 * 1024 * 1024 bytes)
165
+
166
+ def score_network(filename):
167
+ # Uses onnx_tool.loadmodel() β†’ shape_infer() β†’ profile()
168
+ # Checks: g.valid_profile (static shapes required)
169
+ # Checks: op_type.upper() not in ["LOOP","SCAN","NONZERO","UNIQUE","SCRIPT","FUNCTION"]
170
+ # Checks: g.nodemap[key].memory >= 0
171
+ # Returns (macs, memory, params) or (None, None, None) on ANY failure
172
+ # (None, None, None) = "Your network performance could not be measured" = REJECTED
173
+
174
+ def verify_network(network, task_num, examples):
175
+ # 1. onnx.save β†’ check_network (size)
176
+ # 2. InferenceSession (loads ok?)
177
+ # 3. verify_subset on train+test (correct outputs?)
178
+ # 4. verify_subset on arc-gen (correct outputs?)
179
+ # 5. score_network (scoreable by onnx_tool?)
180
+ # ALL must pass for "IS READY for submission"
181
+ ```
182
 
183
  ## What Has NOT Worked
184
 
185
  | Technique | Result | Why |
186
  |-----------|--------|-----|
187
  | **PCA/Truncated SVD (Exp 3)** | **0/400 PCR solves** | **Signal in noise dims; unsolved tasks = architecture mismatch** |
188
+ | **Silent profiler fallback** | **Kaggle rejection** | **Hides onnx_tool failures, returns fake scores** |
189
  | Ridge/LOOCV Ξ» | Fails arc-gen | Catastrophic, not benign overfitting |
190
  | Skip ks=5,7,9 (Exp 1) | Hurts 2 tasks | Some tasks genuinely need interpolation-regime ks |
191
  | CuPy GPU lstsq | OOM + same speed | O(nΒ³) SVD bottleneck |
192
  | PyTorch 2-layer (no arc-gen) | 0/30 arc-gen pass | Memorizes training |
 
 
193
 
194
  ## Technical Notes
195
 
196
  ### ARC-AGI Task Statistics
197
+ - 400 tasks total. NO excluded tasks β€” all 400 count.
198
+ - ~25 analytical tasks, ~25 conv tasks survive arc-gen, ~350 unsolved
199
 
200
+ ### Score Calculation (official, from neurogolf_utils.py)
201
  ```python
202
+ # Uses onnx_tool for exact MACs/memory/params β€” NOT our static profiler
203
+ macs, memory, params = score_network(filename) # onnx_tool based
204
+ points = max(1.0, 25.0 - math.log(macs + memory + params))
205
  ```
206
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
207
  ## Session Notes for Future Agents
208
 
209
  **Before touching code:**
 
213
  4. Run the current solver on 20-50 tasks to establish baseline
214
  5. Only then: design experiment, implement, validate, compare
215
 
216
+ **Code structure (v5.2):**
217
  - The solver is a Python package at `neurogolf_solver/`
218
  - Run with `python -m neurogolf_solver.main [args]`
219
+ - Solvers in separate files: `analytical.py`, `geometric.py`, `tiling.py`, `conv.py`, `gravity.py`, `edge.py`, `mode.py`
 
220
  - Edit individual files surgically β€” NEVER rewrite the whole package
221
  - The legacy `neurogolf_solver.py` at root is v4, kept for reference β€” do NOT edit it
222
 
223
+ **CRITICAL: Scoring & Validation:**
224
+ - The ONLY reliable scoring is `neurogolf_utils.score_network()` which uses `onnx_tool`
225
+ - `profiler.py`'s `_static_profile()` is a fallback that DOES NOT match Kaggle scoring
226
+ - Before submitting: run `neurogolf_utils.verify_network()` on ALL solved models in a Kaggle notebook
227
+ - If `score_network` returns `(None, None, None)`, the model is REJECTED β€” do not submit it
228
+
229
  **Before claiming a feature works:**
230
  - Must pass arc-gen on β‰₯20 tasks (or full 400 if cheap)
231
+ - Must pass `neurogolf_utils.verify_network()` β€” not just our own validate()
232
  - Must include A/B comparison
233
 
234
  **Before uploading code:**
235
  - Must have run full 400-task arc-gen validation
236
  - Must confirm total score β‰₯ previous best
237
+ - NEVER change the scoring/validation pipeline without understanding what it does