Lessons Learned (Critical Notes)
Core lessons
- Log the candidate arrays early. Without the actual candidate field, debugging accepted candidates is guesswork. Always include a small, quantized snapshot in logs for reproducibility.
- Avoid side effects inside the solver run. External logging (W&B) must be optional and executed outside the core run to avoid recursion and hidden failures.
- Transforms must produce visible edits. If transforms are no‑ops for typical inputs, the beam cannot reduce residue. Unit‑test transforms with small inputs.
- Coerce logged gate values to booleans. Strings like
"True"break automated gate analysis and mislead debugging. - Quantize and resize consistently. Residue must be computed on the same quantized and tiled array the beam evaluates (use
np.rintandtile_transform).
Lessons from the σ=0 breakthrough
Verify your ground truth first
The example1 target had 4 wrong cells in row 7. Every experiment showed σ=98 not because the solver was broken, but because the target was wrong. Cross‑reference task data against the canonical source (e.g. fchollet/ARC-AGI on GitHub). A wrong target makes every downstream analysis meaningless.
Shape‑changing transforms need the original input
The beam was resizing phi_in (3×3) to the target shape (9×9) before applying transforms. This meant Kronecker (3×3 → 9×9) was applied to a 9×9 field, producing 81×81, which was then tiled back — garbage. The fix: a dual‑strategy beam that tries each transform on both the resized field AND the original input. Any solver that composes shape‑changing primitives needs this.
The Kronecker self‑similar pattern
ARC task 007bbfb7 has one of the cleanest rules in the dataset: output = np.kron((input != 0), input). The input's own nonzero mask becomes the meta‑layout for placing copies of itself. This is a single depth‑1 transform — but it was unreachable because (a) the transform didn't exist in the library, and (b) the beam couldn't feed it the right input shape.
Lambda closures capture by reference, not value
lambda p: tile_transform(p, (task['target_shape'][0], ...)) captures task by reference. In a sweep loop where task changes, all previously‑created lambdas silently point to the new task. Fix: bind with default args — lambda p, _h=h, _w=w: ....
Idempotent transforms are invisible to the beam
tile_to_target ∘ tile_to_target = tile_to_target. When the only accepted transform is idempotent, σ flatlines across all depths. The symptom is a constant sigma trace like [98, 98, 98, 98]. When you see this, the library is too limited — not the search.
Duplicate class definitions cause subtle bugs
solver_core.py defined Transform with .compose() and params. transforms.py defined its own Transform without those. Both had .apply(), so things ran — but composed transforms could silently lose parameters. Rule: one canonical class, import everywhere else.
Don't commit .pyc files
They cause stale‑import bugs when the source changes. Add __pycache__/ and *.pyc to .gitignore from day one.
Lessons from the 10% evaluation (object layer + greedy stacker)
Brute-force DSL works but hits a ceiling fast
Going from 19 → 33 transforms only gained 9 tasks (31 → 40). Each new transform has diminishing returns because most unsolved tasks need task-level reasoning (analyzing all training pairs together), not just more primitives. The DSL beam search treats each pair independently — it can't learn "frame size determines fill color" because it never compares pairs.
The greedy stacker unlocks overlay composition
Sequential composition T2(T1(x)) fails when the output is two independent views overlaid. The greedy stacker tries overlay(T1(x), T2(x)) — two transforms applied independently to the input, then combined. Task ded97339 was solved this way: overlay(ConnectSameColorV, ConnectSameColorH). Without the stacker, no amount of beam depth would have found this.
Object extraction is necessary but not sufficient
Adding ExtractLargestObject, ExtractSmallestObject, etc. solved tasks like 1f85a75f and 23b5c85d. But most object-centric tasks (40% of ARC) need more than extraction — they need object movement, recoloring by attribute, and conditional placement. Extraction without manipulation is like having eyes without hands.
Every new transform should be tested on the full 400
When we added 14 transforms, 6 contributed to new solves. The other 8 didn't hurt (zero regressions) but didn't help either. Running the full 400-task eval after each change is cheap (51 seconds) and tells you immediately whether a transform matters.
Lessons from the original ITT source repo
Read the source repo first
The original ITT implementation (Sensei-Intent-Tensor/0.0_ARC_AGI) contains architectural decisions we should have adopted from the start. We built a DSL brute-force solver when the PEMF framework is actually a physics-first solver with derived operations. The source has ITT_PURE_SOLVER.py (1339 lines), ITT_ARC_FOUNDATION.py (635 lines), itt_primitives.py (546 lines), and three solver versions (v4, v5B, v5C).
The dual-field split is not optional
The original uses two fields simultaneously:
Φ_q= integer grid (ARC colors 0–9) for semantic decisions (what color? which object?)Φ̃= smoothed float field (2-step discrete diffusion of Φ_q) for operator computation (gradient, Laplacian, boundary charge)
Rule: Read Φ_q for semantics. Compute on Φ̃ for math. We've been using a single float array for everything. This conflates semantic truth with numerical stability. The smooth field makes operators (gradient, Laplacian, boundary charge) stable and continuous; the quantized field keeps color identity exact.
ρ_q is third-order, not FFT imaginary
Our layer_minus_one.py uses FFT imaginary components to find edit zones. The original defines boundary charge as ρ_q = |∇(∇²Φ̃)| — the gradient of the Laplacian of the smooth field. This is a third-order derivative that detects where curvature changes, not just where values change. The threshold is physics-derived: μ + 1.5σ on nonzero ρ_q values (statistical outlier detection), not an arbitrary percentile.
Fan Signatures route tasks before search
The original classifies each task with a 6-bit signature [Δ₁..Δ₆]:
- Δ₁ (∇Φ): gradient/boundary active
- Δ₂ (∇×F): curl/rotation/reflection active
- Δ₃ (+∇²Φ): expansion/tiling
- Δ₄ (−∇²Φ): compression/interior detection
- Δ₅ (∂Φ/∂t): temporal/period detection
- Δ₆ (Φ₀): scalar anchor/color remapping
This routes the task to the correct solver family before any transform enumeration. Only 2⁶ = 64 possible signatures, and ~15–20 correspond to real ARC patterns. We brute-force all 33 transforms on every task — the Fan Signature would eliminate 80%+ of irrelevant transforms per task.
SigmaResidue types the transformation, not just measures it
The original doesn't just compute σ = Σ|Φ'(p) − Φ(p)|. It classifies the residue into: fill/enclosed, expansion/size_increase, compression/size_decrease, recolor/substitution, erase/removal, identity/none, mixed/complex. This classification determines which rule to try. We skip this and try everything — which works for 33 transforms but won't scale.
TransformationRule.learn() replaces brute-force with targeted learning
The original learns rules from training pairs: tile_pattern (which reflection per block), size_to_color (frame size → fill color), frame_to_fill (fallback mapping), color_map, shape_to_color (Laplacian eigenspectrum → output color). This handles multi-region fill, shape indicator, and periodic extension tasks that our beam search can't reach because the rule depends on comparing across training pairs, not just minimizing σ on one pair.
Shape matching via Laplacian eigenspectrum
The eigenvalues of the restricted Laplacian on an object's positions form a translation and rotation invariant shape fingerprint. The original uses this to solve "shape indicator" tasks where the shape of a small object determines the output color. Our BFS-based object extraction can find objects but can't compare their shapes this way.
Spectral region separation — no BFS
The original partitions connected components via eigendecomposition of the graph Laplacian (Fiedler vector), not BFS flood-fill. The number of near-zero eigenvalues equals the number of components. This is derived from the field operators, not an algorithm bolted on.
The "smuggling" self-audit is a model of intellectual honesty
docs/ITT_DERIVATION_GAPS.md explicitly lists which concepts are properly derived from ITT axioms and which are "smuggled" (borrowed from external algorithms). v4 claims to close most gaps: explicit Φ_q (not np.round), spectral cuts (not adjacency growth), level sets (not flood-fill), physics-derived thresholds (not arbitrary). This audit discipline is worth adopting.
Architecture direction: ITT-first, DSL-fallback
The plan going forward is to integrate the ITT physics as a first-pass solver before the DSL beam search:
- Compute PhiField (Φ_q + Φ̃) for each task
- Compute Fan Signature → route to pattern class
- Run TransformationRule.learn() on training pairs → typed rule
- If learned rule achieves σ=0 on ALL train pairs → use it (high confidence)
- Otherwise → fall through to DSL beam search + greedy stacker (our existing 33-transform pipeline)
This means ITT handles tasks it can derive from the field (fill, tile, recolor, shape indicator, period), and DSL handles the rest (object extraction, gravity, mirror, compress). No regression risk — the DSL path is unchanged, ITT only adds capability.
The 10-phase implementation plan is in TODO.md.
Debugging checklist
- Confirm
experiments/contains*_phi_best.npyand*_logs.json. - Run
scripts/fix_and_inspect_logs.pyto coerce gates and attach candidate snapshot. - Inspect
logs[0]forcandidate_arrayand recomputed residue. - Run
itt_solver.tests.run_atomic_effects()to verify transforms change the input. - Run
tests/test_transforms.pyto verify all transform unit tests pass. - Run a relaxed smoke beam (lock_coeff=0, max_fraction=1.0, beam_width≥6) and inspect
sigmas. - If σ flatlines: check whether transforms are idempotent on the input the beam feeds them. Print the shape of the field each transform receives.
- If σ is nonzero but close: diff
phi_bestagainst target cell by cell. The pattern of mismatches usually reveals the missing transform. - If ITT rule-learning fails: check Fan Signature routing — is the task classified correctly? Print
SigmaResidue.change_typefor each training pair to verify the transformation is typed correctly.
Practical tips
- After editing package files, clear Python module cache or restart the interpreter.
- Keep
candidate_arrayoptional in logs to avoid huge files; include it for debugging runs. - Use small, deterministic transforms for initial debugging (Rotate, Reflect, ShiftedTile).
- Always cross‑check targets against the original ARC dataset before concluding the solver is wrong.
- When adding a new transform, write a unit test immediately — transforms that silently return the input waste hours of debugging.
- The
default_atomic_factoryis just a starting point. For new ARC task families, inspect the input→output mapping manually (decompose into sub‑blocks, check symmetries, test Kronecker/mirror/upscale) and add targeted transforms. - Run the full 400-task eval after every change. It takes 51 seconds and catches regressions instantly.
- Read the original ITT source before implementing anything. The physics derivation often suggests a cleaner approach than ad-hoc DSL heuristics.
Key references
- Original ITT ARC solver: Sensei-Intent-Tensor/0.0_ARC_AGI
- ITT physics textbook: Sensei-Intent-Tensor/0.0._Executable_Physics
- Zenodo record: https://zenodo.org/records/18077258
- ARC dataset: fchollet/ARC-AGI
- Icecuber DSL analysis: arxiv:2402.03507
- SOAR (52% ARC-1): arxiv:2507.14172
- arc-dsl primitives: michaelhodel/arc-dsl