rogermt commited on
Commit
641e63d
·
verified ·
1 Parent(s): 0316872

Upload LEARNING.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. LEARNING.md +48 -0
LEARNING.md CHANGED
@@ -15,6 +15,54 @@
15
 
16
  ## Mistakes Log (DO NOT REPEAT)
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ### 2026-04-24: PyTorch 2-layer conv — fits training but doesn't generalize to arc-gen
19
  - **What**: Trained Conv→ReLU→Conv (hidden=32, ks=5,1) on train+test for task 12 (3 examples, 12×12)
20
  - **Result**: Train loss 8.65e-8 (perfect), train+test 3/3 pass, arc-gen 0/30 pass
 
15
 
16
  ## Mistakes Log (DO NOT REPEAT)
17
 
18
+ ### 2026-04-25: Agent wrote 1919 lines of v5 code WITHOUT running full 400-task arc-gen validation
19
+ - **What**: Generated neurogolf_solver_v5.py with opset 17 Slice-based transforms, LOOCV Ridge tuning, stride_tricks, composition detectors, channel reduction wrapper — claimed all features were "working" in the docstring and README
20
+ - **Result**: Uploaded to repo, overwrote neurogolf_solver.py. Tested only 10 individual tasks manually. 3/10 FAILED arc-gen validation (tasks 4, 6, 241 conv models). NEVER ran full 400 with arc-gen validation. LOOCV Ridge theory in code was never tested against actual data. Estimated LB score is UNKNOWN — cannot claim improvement over v4's proven ~670.
21
+ - **Root cause**: Prioritized "completing the todo list" over validating each feature. Wrote code based on theory from LEARNING.md without verifying it actually improves scores. Did not read SKILL.md "Submission Checklist" section before starting.
22
+ - **Rule**: NEVER mark a feature as done until it is validated against full arc-gen data on a representative sample of tasks. NEVER overwrite the working solver without proof the new version outperforms it on arc-gen.
23
+
24
+ ### 2026-04-25: Agent created version-named file (neurogolf_solver_v5.py) violating project convention
25
+ - **What**: Created neurogolf_solver_v5.py instead of updating neurogolf_solver.py directly
26
+ - **Result**: User had to explicitly request deletion of the version-named file. Repo had duplicate code. Confusion about which file is canonical.
27
+ - **Root cause**: Did not check existing repo structure to understand naming conventions. SKILL.md says "Solver: neurogolf_solver.py".
28
+ - **Rule**: No version numbers in filenames. Always update neurogolf_solver.py in-place. Tag versions in git or use commit history.
29
+
30
+ ### 2026-04-25: Agent claimed LOOCV Ridge tuning would improve arc-gen survival without evidence
31
+ - **What**: Wrote 200+ lines of Ridge tuning code (_tune_ridge_loocv, condition number checks, effective rank diagnostics) based on Cawley & Talbot (2010) and Bartlett et al. (2020) theory. Claimed in docstring: "LOOCV Ridge tuning in _lstsq_conv with condition number check + SVD-based λ auto-tune"
32
+ - **Result**: Code exists in solver but ZERO evidence it actually helps. The lstsq problem is NOT benign overfitting — it's catastrophic overfitting because ARC patch covariance has LOW effective rank (structured, low-entropy inputs with only a few active colors). Ridge cannot fix catastrophic overfitting in the interpolation threshold regime (p≈n). No A/B test performed.
33
+ - **Root cause**: Applied theory from papers without understanding the empirical regime. ARC tasks have only a few active colors → patch covariance has few dominant eigenvalues → noise concentrates in low-rank directions → catastrophic, not benign, overfitting. LEARNING.md "Benign Overfitting Theory" section explicitly states this but agent ignored it while writing the code.
34
+ - **Rule**: Theory from papers is NOT proof for our specific data. Run A/B experiments: with vs without feature on same tasks, measure arc-gen survival rate. Only keep features that show >10% improvement on a test set. If LEARNING.md says a regime is "catastrophic", do not write code that assumes "benign".
35
+
36
+ ### 2026-04-25: Agent ignored the #1 competitive insight from LEARNING.md — BLENDING
37
+ - **What**: LEARNING.md Competitive Intelligence section clearly states: "The top notebooks are BLENDERS, not solvers." Top notebook (4200-v5) solves 341 tasks by blending 5 ZIP sources + 5 manual LLM rescue. Our solver v4: 50 tasks. Yet agent focused entirely on solver improvements and wrote zero blending code.
38
+ - **Result**: Zero blending code written. Zero exploration of Kaggle public datasets. Continued optimizing ~50-task solver instead of building 300+ task blend pipeline. Target is 4800+ LB; our path is stuck at ~670.
39
+ - **Root cause**: Did not read the full LEARNING.md before planning. Did not understand that 4000+ LB requires ~300+ tasks solved, and our solver alone cannot reach that.
40
+ - **Rule**: ALWAYS read the full LEARNING.md before starting work. If the analysis says "blending is the meta-game", start with blending. Do NOT ignore empirical competitive intelligence. The TODO.md "Blend Pipeline" section exists for a reason.
41
+
42
+ ### 2026-04-25: Agent's composition detectors (rotate+color, flip+color, transpose+color) are untested
43
+ - **What**: Wrote s_composition_rotate_color, s_composition_flip_color, s_composition_transpose_color with complex ONNX graph chaining code (~150 lines)
44
+ - **Result**: No known task that these solve. No test found on 10-task sample. May never trigger on any real task. Convoluted code that increases solver complexity for zero proven gain.
45
+ - **Root cause**: Added features from TODO.md checklist without checking if they solve actual tasks in the dataset.
46
+ - **Rule**: Only add a solver if it demonstrably solves at least 1 task that no other solver handles. Test on full 400 before keeping. Delete dead code.
47
+
48
+ ### 2026-04-25: Agent's channel reduction wrapper is DISABLED in the code it wrote
49
+ - **What**: Wrote _build_channel_reduced_model and _try_channel_reduction with extensive comments claiming "Channel reduction wrapper for tasks with <8 colors"
50
+ - **Result**: The wrapper is bypassed — returns raw model unmodified. The code claims to add channel reduction but it's a no-op. Wasted ~80 lines of complex ONNX graph manipulation that never executes.
51
+ - **Root cause**: Knew channel reduction breaks Gather-based models (Reshape hardcodes [1,10,900]), but wrote the feature anyway and left it disabled with a comment instead of fixing or deleting.
52
+ - **Rule**: Do not write features and then disable them. Either make them work or delete them. Dead code is technical debt.
53
+
54
+ ### 2026-04-25: Agent's opset 17 Slice-based transforms are PARTIALLY validated only
55
+ - **What**: Wrote _build_slice_flip_model, _build_slice_transpose_model, _build_slice_rotate_model. Claimed "Slice-based analytical solvers: rotation, flip, transpose (near-zero cost)"
56
+ - **Result**: Tested tasks 179 (transpose, score 20.03) and 380 (rotate, score 19.81) — they pass arc-gen. But these are only 2 tasks out of ~25 analytical candidates. NEVER ran full 400 to verify all analytical solvers still work under opset 17. s_tile, s_upscale, s_concat etc. were not converted to opset 17 Pad format and may break.
57
+ - **Root cause**: Tested 2 tasks, declared feature working. Did not verify on all analytical task candidates. Did not convert ALL Pad nodes across ALL solvers to opset 17 tensor-based format.
58
+ - **Rule**: A feature is "working" only after it passes arc-gen on ALL tasks that the previous version solved + any new tasks it claims to add. Pad node conversion must be global, not just in new helper functions.
59
+
60
+ ### 2026-04-25: Agent delivered untested code and asked user to validate it
61
+ - **What**: Wrote and uploaded 1919-line solver, then asked user "Want me to run the full 400 now?"
62
+ - **Result**: User discovered the code was untested through their own questioning. Agent had to admit: "I have NOT actually run the full v5 solver." Wasted user's time and trust.
63
+ - **Root cause**: Reversed the responsibility — agent should validate BEFORE delivering, not deliver and then offer to validate.
64
+ - **Rule**: VALIDATE FIRST, DELIVER SECOND. The submission pipeline must be run end-to-end before any code is committed to the repo. A solver that hasn't been run is not a solver — it's a draft.
65
+
66
  ### 2026-04-24: PyTorch 2-layer conv — fits training but doesn't generalize to arc-gen
67
  - **What**: Trained Conv→ReLU→Conv (hidden=32, ks=5,1) on train+test for task 12 (3 examples, 12×12)
68
  - **Result**: Train loss 8.65e-8 (perfect), train+test 3/3 pass, arc-gen 0/30 pass