Add benchmark comparison vs LeCun world models (JEPA, V-JEPA, DINO-WM, LeWM)

6e04050 verified 16 days ago

13.1 kB

	# LeWorld Memory Architecture vs. LeCun's World Models — Benchmark Comparison

	## ⚠️ Honest Status: This Is a Theoretical Comparison

	Our architecture has NOT been benchmarked on the standard world-model evaluation suites yet. What follows is:
	1. A factual catalog of every LeCun-family world model, their exact published numbers, and benchmarks
	2. An architectural comparison showing where our design is fundamentally different
	3. A concrete plan for which benchmarks to run and what we'd need to demonstrate

	We are not claiming results we don't have. Here's what exists and what's needed.

	---

	## 1. The LeCun World Model Family — Published Numbers

	### Model Lineup (by publication date)

	\| Model \| Paper \| Params \| Training \| Key Innovation \|
	\|-------\|-------\|--------\|----------\|----------------\|
	\| I-JEPA \| arxiv:2301.08243 \| 632M (ViT-H/14) \| ImageNet, 16×A100 \| Predicts masked image patch representations \|
	\| V-JEPA \| arxiv:2404.08471 \| 632M (ViT-H/16) \| VideoMix2M (2M videos) \| Predicts masked video region features \|
	\| DINO-WM \| arxiv:2411.04983 \| ~300M (frozen DINOv2 + predictor) \| Offline trajectories \| Plans in DINOv2 patch-feature space \|
	\| LeWM \| arxiv:2603.19312 \| ~15M \| Single GPU, few hours \| End-to-end from pixels, 2 loss terms only \|
	\| V-JEPA 2.1 \| arxiv:2603.14482 \| 1.1–1.9B (ViT-g/G) \| Massive video corpus \| Dense features, multi-layer loss \|

	### LeWM is our primary comparison target (both ~15M params).

	---

	## 2. Published Benchmark Results — Robotic Planning

	### Push-T (Tabletop Block Pushing → Success Rate ↑)

	\| Model \| Params \| Success Rate \| Planning Time \| Notes \|
	\|-------\|--------\|-------------\|---------------\|-------\|
	\| DINO-WM \| ~300M+ (frozen DINOv2) \| 0.90 \| ~48 seconds \| Uses pretrained DINOv2 encoder \|
	\| LeWM \| ~15M \| 0.88 (beats DINO-WM pixel-only) \| <1 second \| End-to-end from pixels, 48× faster \|
	\| PLDM \| ~15M \| 0.70 \| <1 second \| End-to-end, 7 loss terms (VICReg) \|
	\| DreamerV3 \| 12-400M \| 0.30 \| — \| Model-based RL, needs rewards \|
	\| IRIS \| — \| 0.32 \| — \| — \|
	\| TD-MPC2 \| — \| 0.00 \| — \| Fails without reward signal \|
	\| Ours (LeWorld Memory) \| 13.5M \| ❌ Not yet tested \| — \| — \|

	### PointMaze / Wall Navigation → Success Rate ↑

	\| Model \| Maze SR \| Wall SR \| Notes \|
	\|-------\|---------\|---------\|-------\|
	\| DINO-WM \| 0.98 \| 0.96 \| Near-perfect \|
	\| DreamerV3 \| 1.00 \| 1.00 \| Perfect on simple navigation \|
	\| LeWM \| Lower than DINO-WM \| Lower \| Struggles on very simple envs (SIGReg limitation) \|
	\| Ours \| ❌ Not yet tested \| ❌ Not yet tested \| — \|

	### Reach (Robotic Arm) → Success Rate ↑

	\| Model \| Success Rate \|
	\|-------\|-------------\|
	\| DINO-WM \| 0.92 \|
	\| DreamerV3 \| 0.64 \|
	\| IRIS \| 0.18 \|
	\| Ours \| ❌ Not yet tested \|

	### Rope & Granular Manipulation → Chamfer Distance ↓

	\| Model \| Rope CD ↓ \| Granular CD ↓ \|
	\|-------\|-----------\|---------------\|
	\| DINO-WM \| 0.41 \| 0.26 \|
	\| DreamerV3 \| 2.49 \| 1.05 \|
	\| IRIS \| 1.11 \| 0.37 \|
	\| Ours \| ❌ Not yet tested \| ❌ Not yet tested \|

	---

	## 3. Published Benchmark Results — Physical Understanding

	### Physical Latent Probing on Push-T (Pearson r ↑, higher = better)

	\| Property \| DINO-WM (MLP) \| LeWM (MLP) \| PLDM (MLP) \|
	\|----------\|--------------\|------------\|------------\|
	\| Agent Location \| r = 0.999 \| r = 0.998 \| r = 0.993 \|
	\| Block Location \| r = 0.999 \| r = 0.999 \| r = 0.994 \|
	\| Block Angle \| r = 0.995 \| r = 0.990 \| r = 0.972 \|

	LeWM at 15M achieves near-parity with DINO-WM (300M+ pretrained) on physical probing.

	### Violation-of-Expectation (VoE) — Physics Anomaly Detection

	\| Perturbation \| LeWM \| PLDM \|
	\|-------------\|------\|------\|
	\| Teleportation (physically impossible) \| Detects (p<0.01) \| Detects \|
	\| Color change (visual only) \| Does NOT flag \| Does NOT flag \|
	\| Correct distinction? \| ✅ Yes \| ✅ Yes \|

	### IntPhys 2 (Intuitive Physics) — Accuracy ↑

	\| Model \| Accuracy \| Notes \|
	\|-------\|----------\|-------\|
	\| Human \| 96.4% \| Ceiling \|
	\| V-JEPA 2 (1B+ params) \| 57.5% \| Best JEPA model \|
	\| Gemini 2.5 Flash \| 55.6% \| Best commercial LLM \|
	\| GPT-4o \| ~50% \| Near random \|
	\| Ours \| ❌ Not yet tested \| Potential differentiator (see below) \|

	The IntPhys gap (57.5% vs 96.4% human) is the biggest open problem in world models.

	---

	## 4. Published Benchmark Results — Video Understanding

	### Kinetics-400 (Video Action Recognition) → Top-1 Accuracy ↑

	\| Model \| Params \| Accuracy \| Probe Type \|
	\|-------\|--------\|----------\|------------\|
	\| V-JEPA 2.1 ViT-G \| 1.9B \| ~88% \| Attentive \|
	\| V-JEPA ViT-H/16 \| 632M \| 81.9% \| Frozen, attentive \|
	\| VideoMAEv2 ViT-H \| ~632M \| 87% \| Fine-tuned \|
	\| Ours \| 13.5M \| N/A \| Different modality (state vectors, not video) \|

	### Something-Something-v2 (Temporal Reasoning) → Top-1 Accuracy ↑

	\| Model \| Accuracy \|
	\|-------\|----------\|
	\| V-JEPA 2.1 ViT-G \| 77.7% \|
	\| V-JEPA ViT-H/16 \| 72.2% \|
	\| Ours \| N/A \|

	> Note: Our architecture operates on state vectors + bit-level memory, not pixels/video. The video benchmarks above are not directly applicable without adding a visual encoder front-end.

	---

	## 5. Architectural Comparison — Where We're Different

	\| Dimension \| LeWM / JEPA Family \| Our LeWorld Memory Architecture \|
	\|-----------\|-------------------\|--------------------------------\|
	\| Memory \| Implicit (in network weights + latent state) \| Explicit bit-level RAM (64K × 32-bit words, address-range access) \|
	\| State Prediction \| Single model predicts next embedding \| Hierarchical: 3 SLMs find memory, 1 BLM aggregates + predicts \|
	\| Information Retrieval \| All in one forward pass \| Active retrieval: BLM asks "what do I need?", SLMs fetch from memory \|
	\| Model Selection \| N/A (single model) \| Binary routing [1,0,1]: BLM selects which SLMs to trust \|
	\| Collapse Prevention \| SIGReg (LeWM), EMA (V-JEPA), frozen encoder (DINO-WM) \| Diversity loss + load-balance loss + temperature annealing \|
	\| Training \| Single loss (LeWM: 2 terms) \| 3-phase: pre-train → joint → info-request refinement \|
	\| Params \| 15M (LeWM), 632M (V-JEPA), 1.9B (V-JEPA 2.1) \| 13.5M (3×745K SLMs + 11.2M BLM) \|
	\| Input Modality \| Pixels / video frames \| State vectors + characteristics (extensible to pixels with encoder) \|
	\| Planning \| CEM in latent space \| BLM next-state prediction + info-request loop \|
	\| Gradient through discrete \| N/A (continuous latent space) \| ST-Sigmoid for routing, product-key CE for addressing \|

	### What Our Architecture Adds That JEPA Doesn't Have

	1. Explicit addressable memory — JEPA has no equivalent. All "memory" in JEPA is implicit in weights. Our architecture has a literal 256KB RAM that models can read/write to by address.

	2. Multi-agent retrieval — 3 independent SLMs each search different memory regions. This is like having 3 specialized "attention heads" that look at different parts of a knowledge base, with a gating mechanism that selects the most useful ones.

	3. Active information request — The BLM generates "what do I need next?" queries that influence what SLMs look for. JEPA models have no equivalent — they receive all information passively.

	4. CPU-inspired structure — Address bus → RAM → data bus → processor pipeline mirrors actual computer architecture. This structural prior could help with systematic, compositional reasoning that neural networks typically struggle with.

	---

	## 6. What Benchmarks We Need to Run (Roadmap)

	### Tier 1 — Must Run (direct comparison with LeWM/DINO-WM)

	\| Benchmark \| What's Needed \| Expected Difficulty \|
	\|-----------\|--------------\|-------------------\|
	\| Push-T \| Add pixel encoder to our architecture, train on Push-T trajectories \| Medium — need ~18K trajectories, visual encoder front-end \|
	\| PointMaze/Wall \| Same as above \| Easy — simple navigation \|
	\| OGBench-Cube \| Same + 3D rendering \| Medium-Hard \|
	\| Physical Probing \| Train linear/MLP probes on our latent space \| Easy — we already have latent representations \|
	\| VoE (Violation of Expectation) \| Inject anomalies, measure surprise \| Easy — our architecture naturally computes prediction error \|

	### Tier 2 — High-Impact Differentiators

	\| Benchmark \| Why It Matters \| Our Advantage \|
	\|-----------\|---------------\|---------------\|
	\| IntPhys 2 \| ALL JEPA models fail (≤57.5% vs 96.4% human) \| Our explicit memory could help with object permanence \|
	\| Long-horizon planning \| JEPA models degrade over long rollouts \| Our info-request loop provides feedback for multi-step \|
	\| Memory-dependent tasks \| Tasks requiring recall of past observations \| Direct advantage — our architecture has literal memory \|

	### Tier 3 — Efficiency Benchmarks

	\| Metric \| LeWM \| Our Target \|
	\|--------\|------\|-----------\|
	\| Planning time \| <1 second \| Should be comparable (similar param count) \|
	\| Training time \| Single GPU, few hours \| Same — 13.5M params \|
	\| Training data efficiency \| Scales with dataset size \| To be measured \|

	---

	## 7. Honest Assessment — Strengths and Weaknesses

	### Where Our Architecture Should Excel (Hypotheses)

	1. Memory-dependent tasks: Any task where the agent must remember and recall past observations to make current decisions. JEPA has no explicit memory — it's all in the latent state. Our 64K-word memory is persistent.

	2. Compositional state tracking: Tasks with multiple objects where different "aspects" of the state need different information sources. Our 3 SLMs can specialize (one tracks the agent, one tracks the object, one tracks the environment).

	3. Anomaly detection / physics violation: Our explicit memory + multi-step prediction error should catch "impossible" events better than implicit models. The info-request loop acts as an active hypothesis tester.

	4. Interpretability: You can literally inspect which memory addresses were read, which SLMs were selected, what the info-request query was. JEPA is a black box.

	### Where Our Architecture Will Likely Struggle

	1. Raw pixel processing: Our current architecture works on state vectors. Adding a visual encoder is engineering work — but JEPA models are built pixel-first.

	2. Large-scale visual representation: V-JEPA 2.1 at 1.9B params has seen millions of videos. Our 13.5M model can't compete on raw representation quality for visual tasks.

	3. Simple tasks: LeWM already struggles on trivial environments (TwoRoom). Our more complex architecture might face similar issues — the overhead of memory + routing may not help when the task is simple.

	4. Training stability: 3-phase training is more complex than LeWM's elegant 2-loss setup. More things can go wrong.

	---

	## 8. Comparison Summary Table

	\| \| I-JEPA \| V-JEPA \| DINO-WM \| LeWM \| V-JEPA 2.1 \| Ours \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Params \| 632M \| 632M \| ~300M \| 15M \| 1.9B \| 13.5M \|
	\| Input \| Images \| Video \| Pixels+Act \| Pixels+Act \| Video \| State vec \|
	\| Memory \| None \| None \| None \| None \| None \| Explicit 256KB \|
	\| Multi-model routing \| No \| No \| No \| No \| No \| Yes \|
	\| Active info request \| No \| No \| No \| No \| No \| Yes \|
	\| Push-T SR \| — \| — \| 0.90 \| 0.88 \| — \| ❌ TBD \|
	\| Maze SR \| — \| — \| 0.98 \| — \| — \| ❌ TBD \|
	\| Reach SR \| — \| — \| 0.92 \| — \| — \| ❌ TBD \|
	\| IntPhys 2 \| — \| — \| — \| — \| 57.5% \| ❌ TBD \|
	\| K400 Acc \| — \| 81.9% \| — \| — \| ~88% \| N/A \|
	\| Planning Speed \| — \| — \| ~48s \| <1s \| — \| ~<1s (est) \|
	\| Training \| 16×A100 \| Cluster \| Offline \| 1 GPU \| Cluster \| 1 GPU \|
	\| Interpretable \| Low \| Low \| Low \| Low \| Low \| High \|

	---

	## 9. Next Steps to Get Real Numbers

	1. Add visual encoder to our architecture (small CNN or ViT-Tiny) for pixel observations → enables Push-T, Maze, Reach benchmarks
	2. Integrate with `stable-worldmodel` evaluation suite (arxiv:2602.08968) for standardized comparison
	3. Run Push-T first — most-used benchmark, open code, our architecture could show SLM specialization
	4. Design memory-dependent benchmark — a custom task where agents MUST recall past observations to solve current goals. This is where we should clearly beat all JEPA models.

	---

	## 10. References

	\| Paper \| ArXiv ID \| Key Numbers \|
	\|-------\|----------\|-------------\|
	\| I-JEPA \| 2301.08243 \| ImageNet linear: 80%+, 632M params \|
	\| V-JEPA \| 2404.08471 \| K400: 81.9%, SSv2: 72.2%, 632M params \|
	\| DINO-WM \| 2411.04983 \| Push-T: 0.90 SR, Reach: 0.92 SR \|
	\| LeWM \| 2603.19312 \| Push-T: 0.88 SR, 15M params, <1s planning, 48× faster \|
	\| V-JEPA 2.1 \| 2603.14482 \| Ego4D: 7.71 mAP, SSv2: 77.7%, 1.9B params \|
	\| IntPhys 2 \| 2506.09849 \| V-JEPA 2: 57.5%, Human: 96.4% \|
	\| JEPA-WMs study \| 2512.24497 \| CEM best planner, proprioception critical \|
	\| DreamerV3 \| 2301.04104 \| Atari SOTA, Push-T: 0.30 SR \|
	\| LeCun position paper \| 2306.02572 \| Theoretical H-JEPA architecture \|