v4.3: Update TODO.md with experiment queue, research loop, status key, explicit blending exclusion
Browse files
TODO.md
CHANGED
|
@@ -1,74 +1,174 @@
|
|
| 1 |
# NeuroGolf Solver β Roadmap
|
| 2 |
|
| 3 |
-
> Current: v4.
|
|
|
|
|
|
|
| 4 |
|
| 5 |
## Phase 1: Cheap Wins (est +400 pts β ~1100)
|
| 6 |
|
| 7 |
-
|
|
|
|
| 8 |
- Rotation: `Crop β Transpose β Slice(step=-1)` = ~0 cost (was ~165K)
|
| 9 |
- Flip: `Crop β Slice(step=-1)` = ~0 cost (was ~165K)
|
| 10 |
- Transpose: `Crop β Transpose(perm)` = ~0 cost (was ~36K)
|
| 11 |
-
-
|
| 12 |
-
-
|
| 13 |
-
|
| 14 |
-
-
|
| 15 |
-
-
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
## Phase 2: Fix Arc-Gen Survival (est +100-150 tasks β ~2000-2500)
|
| 19 |
|
| 20 |
-
This is the #1 blocker. We solve 307 locally but only 50 survive arc-gen.
|
| 21 |
-
|
| 22 |
-
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
-
|
| 29 |
-
-
|
| 30 |
-
- [ ] **
|
| 31 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
-
-
|
| 38 |
-
- Flatten input β MatMul(hash_weights) β match against all known examples β apply stored delta
|
| 39 |
-
- Requires opset 17 (ScatterND)
|
| 40 |
-
- Works for ANY task where all examples fit in 1.44MB model
|
| 41 |
-
- Build a generic `build_hash_matcher(task_data) β onnx_bytes` function
|
| 42 |
-
- [ ] **Per-task LLM rescue** β for the ~20 hardest tasks with algorithmic patterns
|
| 43 |
-
- Feed task JSON + Python solution to LLM, get back ONNX builder function
|
| 44 |
-
- Priority tasks: gravity, flood fill, outline extraction, pattern counting
|
| 45 |
-
- [ ] **Run-length / gap pattern detector** β like task096 in the notebooks
|
| 46 |
-
- Depthwise conv to detect runs of N, gap patterns
|
| 47 |
-
- Template for a class of "count and classify" tasks
|
| 48 |
|
| 49 |
## Phase 4: Score Optimization (est +200-500 pts on existing tasks)
|
| 50 |
|
| 51 |
-
|
|
|
|
| 52 |
- Top notebooks do this; can shrink models 5-20%
|
| 53 |
-
- [ ] **
|
| 54 |
-
-
|
| 55 |
-
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
-
##
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
- [ ]
|
| 63 |
-
- [ ]
|
| 64 |
-
- [ ] Attach public datasets (
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
## Status Key
|
| 68 |
|
| 69 |
| Symbol | Meaning |
|
| 70 |
|--------|---------|
|
| 71 |
-
| `[ ]` | Not started |
|
| 72 |
-
| `[~]` | In progress |
|
| 73 |
-
| `[x]` | Done |
|
| 74 |
-
| `[!]` | Blocked |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# NeuroGolf Solver β Roadmap
|
| 2 |
|
| 3 |
+
> Current: v4.3 Β· 50 arc-gen validated Β· ~670 LB Β· Target: 3000+
|
| 4 |
+
> Philosophy: **Research β Design β Experiment β Analyze β Research** loop until confirmed score increase.
|
| 5 |
+
> Rule: **NEVER claim a feature works without full arc-gen validation on representative tasks.**
|
| 6 |
|
| 7 |
## Phase 1: Cheap Wins (est +400 pts β ~1100)
|
| 8 |
|
| 9 |
+
### 1a: Opset 17 Slice-Based Analytical Solvers (~0 cost)
|
| 10 |
+
- [ ] **Convert ALL analytical solvers to opset 17** β not just new ones
|
| 11 |
- Rotation: `Crop β Transpose β Slice(step=-1)` = ~0 cost (was ~165K)
|
| 12 |
- Flip: `Crop β Slice(step=-1)` = ~0 cost (was ~165K)
|
| 13 |
- Transpose: `Crop β Transpose(perm)` = ~0 cost (was ~36K)
|
| 14 |
+
- Pad nodes: all must use opset 17 tensor-based `pads` input (not attribute)
|
| 15 |
+
- Affected solvers: s_tile, s_upscale, s_concat, s_concat_enhanced, s_kronecker, s_diagonal_tile, s_shift, s_mirror_h, s_mirror_v, s_quad_mirror, s_fixed_crop, s_spatial_gather, s_varshape_spatial_gather
|
| 16 |
+
- [ ] **Validate**: Full 400 arc-gen run. Compare analytical task count vs v4.
|
| 17 |
+
- Target: ~25 analytical tasks scoring ~25 pts each (was ~15)
|
| 18 |
+
- Accept only if >10% improvement in analytical category total score.
|
| 19 |
+
|
| 20 |
+
### 1b: Composition Detectors
|
| 21 |
+
- [ ] **Identify actual tasks** that are rotation+recolor, flip+recolor, transpose+recolor
|
| 22 |
+
- Scan 400 tasks: apply rotate β check if color_map solves, etc.
|
| 23 |
+
- Only implement solvers for combinations that exist in dataset
|
| 24 |
+
- [ ] **Build composition solver** β chain analytical + color_map as single ONNX graph
|
| 25 |
+
- [ ] **Validate**: Full 400 arc-gen. Count new tasks solved. Accept only if >0 new tasks.
|
| 26 |
+
|
| 27 |
+
### 1c: Channel Reduction Wrapper
|
| 28 |
+
- [ ] **Design for Gather compatibility** β current Reshape hardcodes [1,10,900]
|
| 29 |
+
- Option A: Add Conv1x1(10βN) before + Conv1x1(Nβ10) after for conv-based models
|
| 30 |
+
- Option B: Use Slice to extract active channels + Gather remapping for pure spatial transforms
|
| 31 |
+
- [ ] **Validate**: Pick 5 tasks with <5 colors. Compare score with/without wrapper.
|
| 32 |
+
- Accept only if >5% score improvement per task AND arc-gen still passes.
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
|
| 36 |
## Phase 2: Fix Arc-Gen Survival (est +100-150 tasks β ~2000-2500)
|
| 37 |
|
| 38 |
+
> **This is the #1 blocker.** We solve 307 locally but only 50 survive arc-gen.
|
| 39 |
+
> Research (Bartlett et al., Belkin et al., arXiv:2306.13185) shows:
|
| 40 |
+
> - Our patch covariance has LOW effective rank (~10-40) vs nβ600 patches
|
| 41 |
+
> - This is CATASTROPHIC overfitting regime, NOT benign
|
| 42 |
+
> - Ridge/LOOCV Ξ» tuning CANNOT fix this β theory predicts failure
|
| 43 |
+
|
| 44 |
+
### 2a: Skip Interpolation Threshold Kernels
|
| 45 |
+
- [ ] **Remove ks=5,7,9 from conv fitting** β these are at/near double descent peak
|
| 46 |
+
- Try ks list: [1, 3, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29]
|
| 47 |
+
- Rationale: ks=5 (p=490, nβ600) is worst-case. ks=1 (p=10) is safe. ks=29 (p=8410) is overparameterized but at least past the peak.
|
| 48 |
+
- [ ] **Validate**: Full 400 arc-gen. Compare arc-gen survival rate vs v4.
|
| 49 |
+
- Accept only if survival rate improves by >10% (5+ more tasks).
|
| 50 |
+
|
| 51 |
+
### 2b: PCA Dimensionality Reduction Before lstsq
|
| 52 |
+
- [ ] **PCA pre-processing**: project patch matrix P to top-k components (k=15-25 matching effective rank)
|
| 53 |
+
- Fit PCA on training patches, transform both P and test patches, then lstsq on reduced space
|
| 54 |
+
- Ensures p_reduced << n, avoiding interpolation regime entirely
|
| 55 |
+
- [ ] **Validate**: Test on 20 tasks that currently fail arc-gen at ks=7,9.
|
| 56 |
+
- Compare: raw lstsq vs PCA+lstsq. Measure arc-gen pass rate.
|
| 57 |
+
- Accept only if >20% of previously-failing tasks now pass.
|
| 58 |
+
|
| 59 |
+
### 2c: Gradient Descent with Early Stopping (Alternative to lstsq)
|
| 60 |
+
- [ ] **Iterative solver**: Adam on conv weights, early stop at ~95% train accuracy (don't interpolate)
|
| 61 |
+
- Implicit ββ-like regularization β theory predicts better generalization than explicit Ridge
|
| 62 |
+
- Use small model: ks=3 single-layer or ks=(3,1) two-layer
|
| 63 |
+
- [ ] **Validate**: Same 20 failing tasks. Compare lstsq vs early-stopping GD.
|
| 64 |
+
- Accept only if >15% improvement in arc-gen survival.
|
| 65 |
+
|
| 66 |
+
### 2d: Lasso / Sparse Regression
|
| 67 |
+
- [ ] **Replace np.linalg.lstsq with sklearn.linear_model.Lasso**
|
| 68 |
+
- Ξ± tuning via cross-validation on training data
|
| 69 |
+
- Matches sparse signal structure of one-hot patches
|
| 70 |
+
- [ ] **Validate**: Same 20 failing tasks. Compare lstsq vs Lasso.
|
| 71 |
+
- Accept only if >15% improvement.
|
| 72 |
+
|
| 73 |
+
### 2e: PyTorch Multi-Seed with Arc-Gen Training (GPU Required)
|
| 74 |
+
- [ ] **Train ConvβReLUβConv on train+test+arc-gen** (all available examples matching grid size)
|
| 75 |
+
- Multi-seed (0,7,42), 3000 steps, lr=0.03, early stopping on arc-gen loss
|
| 76 |
+
- ks=(3,1) or (5,1) two-layer
|
| 77 |
+
- **Ternary snap**: after training, snap weights to {-1,0,1}, re-validate on arc-gen
|
| 78 |
+
- [ ] **Validate**: Run on 50 tasks. Compare arc-gen survival vs lstsq baseline.
|
| 79 |
+
- Needs GPU (T4 minimum). CPU too slow for 400Γ3 seeds.
|
| 80 |
+
- Accept only if >10% improvement AND total runtime <12hr Kaggle limit.
|
| 81 |
+
|
| 82 |
+
### 2f: Generate More ARC-GEN Data
|
| 83 |
+
- [ ] **Use ARC-GEN generator** (github.com/google/ARC-GEN) to produce 1000+ examples/task
|
| 84 |
+
- More fitting data = more constraints, but ONLY helps if we avoid interpolation regime
|
| 85 |
+
- Combine with PCA or GD β lstsq with more rows still overfits if p > n
|
| 86 |
+
- [ ] **Validate**: Test on 20 tasks with 1000 vs 250 arc-gen examples.
|
| 87 |
+
- Compare arc-gen survival. Accept only if >10% improvement.
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## Phase 3: Hard Tasks β Hash Matchers & Pattern Detectors (est +20-50 tasks β ~2500-3000)
|
| 92 |
+
|
| 93 |
+
### 3a: Hash-Based Matcher Builder
|
| 94 |
+
- [ ] **Generic hash matcher**: flatten input β MatMul(hash_weights) β match β apply stored delta
|
| 95 |
+
- Requires opset 17 (ScatterND)
|
| 96 |
+
- Works for ANY task where all examples fit in 1.44MB model
|
| 97 |
+
- Build `build_hash_matcher(task_data) β onnx_bytes`
|
| 98 |
+
- [ ] **Validate**: Identify 10 tasks that no solver handles. Test hash matcher on them.
|
| 99 |
+
- Accept if it solves β₯2 tasks that are currently unsolved.
|
| 100 |
|
| 101 |
+
### 3b: Run-Length / Gap Pattern Detector
|
| 102 |
+
- [ ] **Depthwise conv to detect runs of N, gap patterns** β like task096 in public notebooks
|
| 103 |
+
- Template for "count and classify" tasks
|
| 104 |
+
- [ ] **Validate**: Find tasks with run-length structure. Test detector.
|
| 105 |
+
- Accept if it solves β₯2 new tasks.
|
| 106 |
|
| 107 |
+
### 3c: Per-Task LLM Rescue
|
| 108 |
+
- [ ] **For ~20 hardest tasks**: feed task JSON + Python solution to LLM β get ONNX builder
|
| 109 |
+
- Priority: gravity, flood fill, outline extraction, pattern counting
|
| 110 |
+
- [ ] **Validate**: Build 5 rescue models. Arc-gen validate. Accept if β₯3 pass.
|
| 111 |
|
| 112 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
## Phase 4: Score Optimization (est +200-500 pts on existing tasks)
|
| 115 |
|
| 116 |
+
### 4a: ONNX Optimizer Pass
|
| 117 |
+
- [ ] **`onnxoptimizer.optimize()`** with dead-code elimination, identity removal
|
| 118 |
- Top notebooks do this; can shrink models 5-20%
|
| 119 |
+
- [ ] **Validate**: Run on all 400 models. Compare total score before/after.
|
| 120 |
+
- Accept if total score improves by >2%.
|
| 121 |
+
|
| 122 |
+
### 4b: Best-of-N Model Selection
|
| 123 |
+
- [ ] **For each task**: generate multiple candidates (different ks, bias/no-bias, PCA vs raw, etc.)
|
| 124 |
+
- Keep cheapest valid one
|
| 125 |
+
- [ ] **Validate**: Full 400 run. Compare total score vs single-candidate selection.
|
| 126 |
+
- Accept if total score improves by >3%.
|
| 127 |
+
|
| 128 |
+
### 4c: Official Scoring Alignment
|
| 129 |
+
- [ ] **Use `neurogolf_utils.score_network()`** β `onnx_tool` for exact cost matching
|
| 130 |
+
- Our static profiler may diverge on edge cases
|
| 131 |
+
- [ ] **Validate**: Compare static profiler vs onnx_tool on 50 random models.
|
| 132 |
+
- Accept if divergence >5% and fix profiler.
|
| 133 |
+
|
| 134 |
+
---
|
| 135 |
|
| 136 |
+
## BLENDING β EXPLICITLY EXCLUDED
|
| 137 |
|
| 138 |
+
> **User's competitive philosophy**: "I am writing my own models no blending. This is major flaw in the competition loophole."
|
| 139 |
|
| 140 |
+
- [ ] ~~Blend pipeline~~ β **NOT DONE. Not our strategy.**
|
| 141 |
+
- [ ] ~~Upload submission.zip as Kaggle dataset~~ β **NOT DONE.**
|
| 142 |
+
- [ ] ~~Attach public datasets (24 sources)~~ β **NOT DONE.**
|
| 143 |
+
|
| 144 |
+
Competitive intelligence on blending stays in LEARNING.md "What Others Do" section only.
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## Experiment Log
|
| 149 |
+
|
| 150 |
+
| Date | Experiment | Tasks Tested | Result | Decision |
|
| 151 |
+
|------|-----------|-------------|--------|----------|
|
| 152 |
+
| 2026-04-24 | v4.2 baseline | 400 | 50 arc-gen, ~670 LB | Keep |
|
| 153 |
+
| 2026-04-25 | v5 untested code | 10 | 3/10 FAILED arc-gen | **REVERTED** |
|
| 154 |
+
| 2026-04-25 | LOOCV Ridge theory | 0 | Never tested β theory predicts failure | **NOT IMPLEMENTED** |
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
|
| 158 |
## Status Key
|
| 159 |
|
| 160 |
| Symbol | Meaning |
|
| 161 |
|--------|---------|
|
| 162 |
+
| `[ ]` | Not started β need research/design first |
|
| 163 |
+
| `[~]` | In progress β experiment running |
|
| 164 |
+
| `[x]` | Done β validated with arc-gen on β₯20 tasks, confirmed score increase |
|
| 165 |
+
| `[!]` | Blocked β needs prerequisite or resource (e.g., GPU) |
|
| 166 |
+
| `[-]` | Rejected β tested, did not improve arc-gen survival or score |
|
| 167 |
+
|
| 168 |
+
## Research Queue (Next 3 Papers to Read)
|
| 169 |
+
|
| 170 |
+
1. **arXiv:2302.00257** β "Benign overfitting in ridge regression..." (Lasso vs Ridge in sparse regimes)
|
| 171 |
+
2. **Belkin et al. (2019) PNAS** β "Reconciling modern machine-learning practice..." (double descent, interpolation threshold)
|
| 172 |
+
3. **CITE NEEDED** β ARC-AGI solver papers from NeurIPS 2024 / ICML 2024 workshops
|
| 173 |
+
|
| 174 |
+
> Loop: Research β Design β Experiment β Analyze β Research β ... until score increases.
|