ARC-AGI / LEARNING.md

Update LEARNING.md — add ITT source repo findings, object layer lessons, architecture direction

01918a5 verified 9 days ago

preview code

raw

history blame contribute delete

12.8 kB

Lessons Learned (Critical Notes)

Core lessons

Log the candidate arrays early. Without the actual candidate field, debugging accepted candidates is guesswork. Always include a small, quantized snapshot in logs for reproducibility.
Avoid side effects inside the solver run. External logging (W&B) must be optional and executed outside the core run to avoid recursion and hidden failures.
Transforms must produce visible edits. If transforms are no‑ops for typical inputs, the beam cannot reduce residue. Unit‑test transforms with small inputs.
Coerce logged gate values to booleans. Strings like "True" break automated gate analysis and mislead debugging.
Quantize and resize consistently. Residue must be computed on the same quantized and tiled array the beam evaluates (use np.rint and tile_transform).

Lessons from the σ=0 breakthrough

Verify your ground truth first

The example1 target had 4 wrong cells in row 7. Every experiment showed σ=98 not because the solver was broken, but because the target was wrong. Cross‑reference task data against the canonical source (e.g. fchollet/ARC-AGI on GitHub). A wrong target makes every downstream analysis meaningless.

Shape‑changing transforms need the original input

The beam was resizing phi_in (3×3) to the target shape (9×9) before applying transforms. This meant Kronecker (3×3 → 9×9) was applied to a 9×9 field, producing 81×81, which was then tiled back — garbage. The fix: a dual‑strategy beam that tries each transform on both the resized field AND the original input. Any solver that composes shape‑changing primitives needs this.

The Kronecker self‑similar pattern

ARC task 007bbfb7 has one of the cleanest rules in the dataset: output = np.kron((input != 0), input). The input's own nonzero mask becomes the meta‑layout for placing copies of itself. This is a single depth‑1 transform — but it was unreachable because (a) the transform didn't exist in the library, and (b) the beam couldn't feed it the right input shape.

Lambda closures capture by reference, not value

lambda p: tile_transform(p, (task['target_shape'][0], ...)) captures task by reference. In a sweep loop where task changes, all previously‑created lambdas silently point to the new task. Fix: bind with default args — lambda p, _h=h, _w=w: ....

Idempotent transforms are invisible to the beam

tile_to_target ∘ tile_to_target = tile_to_target. When the only accepted transform is idempotent, σ flatlines across all depths. The symptom is a constant sigma trace like [98, 98, 98, 98]. When you see this, the library is too limited — not the search.

Duplicate class definitions cause subtle bugs

solver_core.py defined Transform with .compose() and params. transforms.py defined its own Transform without those. Both had .apply(), so things ran — but composed transforms could silently lose parameters. Rule: one canonical class, import everywhere else.

Don't commit `.pyc` files

They cause stale‑import bugs when the source changes. Add __pycache__/ and *.pyc to .gitignore from day one.

Lessons from the 10% evaluation (object layer + greedy stacker)

Brute-force DSL works but hits a ceiling fast

Going from 19 → 33 transforms only gained 9 tasks (31 → 40). Each new transform has diminishing returns because most unsolved tasks need task-level reasoning (analyzing all training pairs together), not just more primitives. The DSL beam search treats each pair independently — it can't learn "frame size determines fill color" because it never compares pairs.

The greedy stacker unlocks overlay composition

Sequential composition T2(T1(x)) fails when the output is two independent views overlaid. The greedy stacker tries overlay(T1(x), T2(x)) — two transforms applied independently to the input, then combined. Task ded97339 was solved this way: overlay(ConnectSameColorV, ConnectSameColorH). Without the stacker, no amount of beam depth would have found this.

Object extraction is necessary but not sufficient

Adding ExtractLargestObject, ExtractSmallestObject, etc. solved tasks like 1f85a75f and 23b5c85d. But most object-centric tasks (40% of ARC) need more than extraction — they need object movement, recoloring by attribute, and conditional placement. Extraction without manipulation is like having eyes without hands.

Every new transform should be tested on the full 400

When we added 14 transforms, 6 contributed to new solves. The other 8 didn't hurt (zero regressions) but didn't help either. Running the full 400-task eval after each change is cheap (51 seconds) and tells you immediately whether a transform matters.

Lessons from the original ITT source repo

Read the source repo first

The original ITT implementation (Sensei-Intent-Tensor/0.0_ARC_AGI) contains architectural decisions we should have adopted from the start. We built a DSL brute-force solver when the PEMF framework is actually a physics-first solver with derived operations. The source has ITT_PURE_SOLVER.py (1339 lines), ITT_ARC_FOUNDATION.py (635 lines), itt_primitives.py (546 lines), and three solver versions (v4, v5B, v5C).

The dual-field split is not optional

The original uses two fields simultaneously:

Φ_q = integer grid (ARC colors 0–9) for semantic decisions (what color? which object?)
Φ̃ = smoothed float field (2-step discrete diffusion of Φ_q) for operator computation (gradient, Laplacian, boundary charge)

Rule: Read Φ_q for semantics. Compute on Φ̃ for math. We've been using a single float array for everything. This conflates semantic truth with numerical stability. The smooth field makes operators (gradient, Laplacian, boundary charge) stable and continuous; the quantized field keeps color identity exact.

ρ_q is third-order, not FFT imaginary

Our layer_minus_one.py uses FFT imaginary components to find edit zones. The original defines boundary charge as ρ_q = |∇(∇²Φ̃)| — the gradient of the Laplacian of the smooth field. This is a third-order derivative that detects where curvature changes, not just where values change. The threshold is physics-derived: μ + 1.5σ on nonzero ρ_q values (statistical outlier detection), not an arbitrary percentile.

Fan Signatures route tasks before search

The original classifies each task with a 6-bit signature [Δ₁..Δ₆]:

Δ₁ (∇Φ): gradient/boundary active
Δ₂ (∇×F): curl/rotation/reflection active
Δ₃ (+∇²Φ): expansion/tiling
Δ₄ (−∇²Φ): compression/interior detection
Δ₅ (∂Φ/∂t): temporal/period detection
Δ₆ (Φ₀): scalar anchor/color remapping

This routes the task to the correct solver family before any transform enumeration. Only 2⁶ = 64 possible signatures, and ~15–20 correspond to real ARC patterns. We brute-force all 33 transforms on every task — the Fan Signature would eliminate 80%+ of irrelevant transforms per task.

SigmaResidue types the transformation, not just measures it

The original doesn't just compute σ = Σ|Φ'(p) − Φ(p)|. It classifies the residue into: fill/enclosed, expansion/size_increase, compression/size_decrease, recolor/substitution, erase/removal, identity/none, mixed/complex. This classification determines which rule to try. We skip this and try everything — which works for 33 transforms but won't scale.

TransformationRule.learn() replaces brute-force with targeted learning

The original learns rules from training pairs: tile_pattern (which reflection per block), size_to_color (frame size → fill color), frame_to_fill (fallback mapping), color_map, shape_to_color (Laplacian eigenspectrum → output color). This handles multi-region fill, shape indicator, and periodic extension tasks that our beam search can't reach because the rule depends on comparing across training pairs, not just minimizing σ on one pair.

Shape matching via Laplacian eigenspectrum

The eigenvalues of the restricted Laplacian on an object's positions form a translation and rotation invariant shape fingerprint. The original uses this to solve "shape indicator" tasks where the shape of a small object determines the output color. Our BFS-based object extraction can find objects but can't compare their shapes this way.

Spectral region separation — no BFS

The original partitions connected components via eigendecomposition of the graph Laplacian (Fiedler vector), not BFS flood-fill. The number of near-zero eigenvalues equals the number of components. This is derived from the field operators, not an algorithm bolted on.

The "smuggling" self-audit is a model of intellectual honesty

docs/ITT_DERIVATION_GAPS.md explicitly lists which concepts are properly derived from ITT axioms and which are "smuggled" (borrowed from external algorithms). v4 claims to close most gaps: explicit Φ_q (not np.round), spectral cuts (not adjacency growth), level sets (not flood-fill), physics-derived thresholds (not arbitrary). This audit discipline is worth adopting.

Architecture direction: ITT-first, DSL-fallback

The plan going forward is to integrate the ITT physics as a first-pass solver before the DSL beam search:

Compute PhiField (Φ_q + Φ̃) for each task
Compute Fan Signature → route to pattern class
Run TransformationRule.learn() on training pairs → typed rule
If learned rule achieves σ=0 on ALL train pairs → use it (high confidence)
Otherwise → fall through to DSL beam search + greedy stacker (our existing 33-transform pipeline)

This means ITT handles tasks it can derive from the field (fill, tile, recolor, shape indicator, period), and DSL handles the rest (object extraction, gravity, mirror, compress). No regression risk — the DSL path is unchanged, ITT only adds capability.

The 10-phase implementation plan is in TODO.md.

Debugging checklist

Confirm experiments/ contains *_phi_best.npy and *_logs.json.
Run scripts/fix_and_inspect_logs.py to coerce gates and attach candidate snapshot.
Inspect logs[0] for candidate_array and recomputed residue.
Run itt_solver.tests.run_atomic_effects() to verify transforms change the input.
Run tests/test_transforms.py to verify all transform unit tests pass.
Run a relaxed smoke beam (lock_coeff=0, max_fraction=1.0, beam_width≥6) and inspect sigmas.
If σ flatlines: check whether transforms are idempotent on the input the beam feeds them. Print the shape of the field each transform receives.
If σ is nonzero but close: diff phi_best against target cell by cell. The pattern of mismatches usually reveals the missing transform.
If ITT rule-learning fails: check Fan Signature routing — is the task classified correctly? Print SigmaResidue.change_type for each training pair to verify the transformation is typed correctly.

Practical tips

After editing package files, clear Python module cache or restart the interpreter.
Keep candidate_array optional in logs to avoid huge files; include it for debugging runs.
Use small, deterministic transforms for initial debugging (Rotate, Reflect, ShiftedTile).
Always cross‑check targets against the original ARC dataset before concluding the solver is wrong.
When adding a new transform, write a unit test immediately — transforms that silently return the input waste hours of debugging.
The default_atomic_factory is just a starting point. For new ARC task families, inspect the input→output mapping manually (decompose into sub‑blocks, check symmetries, test Kronecker/mirror/upscale) and add targeted transforms.
Run the full 400-task eval after every change. It takes 51 seconds and catches regressions instantly.
Read the original ITT source before implementing anything. The physics derivation often suggests a cleaner approach than ad-hoc DSL heuristics.

Key references

Original ITT ARC solver: Sensei-Intent-Tensor/0.0_ARC_AGI
ITT physics textbook: Sensei-Intent-Tensor/0.0._Executable_Physics
Zenodo record: https://zenodo.org/records/18077258
ARC dataset: fchollet/ARC-AGI
Icecuber DSL analysis: arxiv:2402.03507
SOAR (52% ARC-1): arxiv:2507.14172
arc-dsl primitives: michaelhodel/arc-dsl