LeWorld Memory Architecture vs. LeCun's World Models β Benchmark Comparison
β οΈ Honest Status: This Is a Theoretical Comparison
Our architecture has NOT been benchmarked on the standard world-model evaluation suites yet. What follows is:
- A factual catalog of every LeCun-family world model, their exact published numbers, and benchmarks
- An architectural comparison showing where our design is fundamentally different
- A concrete plan for which benchmarks to run and what we'd need to demonstrate
We are not claiming results we don't have. Here's what exists and what's needed.
1. The LeCun World Model Family β Published Numbers
Model Lineup (by publication date)
| Model | Paper | Params | Training | Key Innovation |
|---|---|---|---|---|
| I-JEPA | arxiv:2301.08243 | 632M (ViT-H/14) | ImageNet, 16ΓA100 | Predicts masked image patch representations |
| V-JEPA | arxiv:2404.08471 | 632M (ViT-H/16) | VideoMix2M (2M videos) | Predicts masked video region features |
| DINO-WM | arxiv:2411.04983 | ~300M (frozen DINOv2 + predictor) | Offline trajectories | Plans in DINOv2 patch-feature space |
| LeWM | arxiv:2603.19312 | ~15M | Single GPU, few hours | End-to-end from pixels, 2 loss terms only |
| V-JEPA 2.1 | arxiv:2603.14482 | 1.1β1.9B (ViT-g/G) | Massive video corpus | Dense features, multi-layer loss |
LeWM is our primary comparison target (both ~15M params).
2. Published Benchmark Results β Robotic Planning
Push-T (Tabletop Block Pushing β Success Rate β)
| Model | Params | Success Rate | Planning Time | Notes |
|---|---|---|---|---|
| DINO-WM | ~300M+ (frozen DINOv2) | 0.90 | ~48 seconds | Uses pretrained DINOv2 encoder |
| LeWM | ~15M | 0.88 (beats DINO-WM pixel-only) | <1 second | End-to-end from pixels, 48Γ faster |
| PLDM | ~15M | 0.70 | <1 second | End-to-end, 7 loss terms (VICReg) |
| DreamerV3 | 12-400M | 0.30 | β | Model-based RL, needs rewards |
| IRIS | β | 0.32 | β | β |
| TD-MPC2 | β | 0.00 | β | Fails without reward signal |
| Ours (LeWorld Memory) | 13.5M | β Not yet tested | β | β |
PointMaze / Wall Navigation β Success Rate β
| Model | Maze SR | Wall SR | Notes |
|---|---|---|---|
| DINO-WM | 0.98 | 0.96 | Near-perfect |
| DreamerV3 | 1.00 | 1.00 | Perfect on simple navigation |
| LeWM | Lower than DINO-WM | Lower | Struggles on very simple envs (SIGReg limitation) |
| Ours | β Not yet tested | β Not yet tested | β |
Reach (Robotic Arm) β Success Rate β
| Model | Success Rate |
|---|---|
| DINO-WM | 0.92 |
| DreamerV3 | 0.64 |
| IRIS | 0.18 |
| Ours | β Not yet tested |
Rope & Granular Manipulation β Chamfer Distance β
| Model | Rope CD β | Granular CD β |
|---|---|---|
| DINO-WM | 0.41 | 0.26 |
| DreamerV3 | 2.49 | 1.05 |
| IRIS | 1.11 | 0.37 |
| Ours | β Not yet tested | β Not yet tested |
3. Published Benchmark Results β Physical Understanding
Physical Latent Probing on Push-T (Pearson r β, higher = better)
| Property | DINO-WM (MLP) | LeWM (MLP) | PLDM (MLP) |
|---|---|---|---|
| Agent Location | r = 0.999 | r = 0.998 | r = 0.993 |
| Block Location | r = 0.999 | r = 0.999 | r = 0.994 |
| Block Angle | r = 0.995 | r = 0.990 | r = 0.972 |
LeWM at 15M achieves near-parity with DINO-WM (300M+ pretrained) on physical probing.
Violation-of-Expectation (VoE) β Physics Anomaly Detection
| Perturbation | LeWM | PLDM |
|---|---|---|
| Teleportation (physically impossible) | Detects (p<0.01) | Detects |
| Color change (visual only) | Does NOT flag | Does NOT flag |
| Correct distinction? | β Yes | β Yes |
IntPhys 2 (Intuitive Physics) β Accuracy β
| Model | Accuracy | Notes |
|---|---|---|
| Human | 96.4% | Ceiling |
| V-JEPA 2 (1B+ params) | 57.5% | Best JEPA model |
| Gemini 2.5 Flash | 55.6% | Best commercial LLM |
| GPT-4o | ~50% | Near random |
| Ours | β Not yet tested | Potential differentiator (see below) |
The IntPhys gap (57.5% vs 96.4% human) is the biggest open problem in world models.
4. Published Benchmark Results β Video Understanding
Kinetics-400 (Video Action Recognition) β Top-1 Accuracy β
| Model | Params | Accuracy | Probe Type |
|---|---|---|---|
| V-JEPA 2.1 ViT-G | 1.9B | ~88% | Attentive |
| V-JEPA ViT-H/16 | 632M | 81.9% | Frozen, attentive |
| VideoMAEv2 ViT-H | ~632M | 87% | Fine-tuned |
| Ours | 13.5M | N/A | Different modality (state vectors, not video) |
Something-Something-v2 (Temporal Reasoning) β Top-1 Accuracy β
| Model | Accuracy |
|---|---|
| V-JEPA 2.1 ViT-G | 77.7% |
| V-JEPA ViT-H/16 | 72.2% |
| Ours | N/A |
Note: Our architecture operates on state vectors + bit-level memory, not pixels/video. The video benchmarks above are not directly applicable without adding a visual encoder front-end.
5. Architectural Comparison β Where We're Different
| Dimension | LeWM / JEPA Family | Our LeWorld Memory Architecture |
|---|---|---|
| Memory | Implicit (in network weights + latent state) | Explicit bit-level RAM (64K Γ 32-bit words, address-range access) |
| State Prediction | Single model predicts next embedding | Hierarchical: 3 SLMs find memory, 1 BLM aggregates + predicts |
| Information Retrieval | All in one forward pass | Active retrieval: BLM asks "what do I need?", SLMs fetch from memory |
| Model Selection | N/A (single model) | Binary routing [1,0,1]: BLM selects which SLMs to trust |
| Collapse Prevention | SIGReg (LeWM), EMA (V-JEPA), frozen encoder (DINO-WM) | Diversity loss + load-balance loss + temperature annealing |
| Training | Single loss (LeWM: 2 terms) | 3-phase: pre-train β joint β info-request refinement |
| Params | 15M (LeWM), 632M (V-JEPA), 1.9B (V-JEPA 2.1) | 13.5M (3Γ745K SLMs + 11.2M BLM) |
| Input Modality | Pixels / video frames | State vectors + characteristics (extensible to pixels with encoder) |
| Planning | CEM in latent space | BLM next-state prediction + info-request loop |
| Gradient through discrete | N/A (continuous latent space) | ST-Sigmoid for routing, product-key CE for addressing |
What Our Architecture Adds That JEPA Doesn't Have
Explicit addressable memory β JEPA has no equivalent. All "memory" in JEPA is implicit in weights. Our architecture has a literal 256KB RAM that models can read/write to by address.
Multi-agent retrieval β 3 independent SLMs each search different memory regions. This is like having 3 specialized "attention heads" that look at different parts of a knowledge base, with a gating mechanism that selects the most useful ones.
Active information request β The BLM generates "what do I need next?" queries that influence what SLMs look for. JEPA models have no equivalent β they receive all information passively.
CPU-inspired structure β Address bus β RAM β data bus β processor pipeline mirrors actual computer architecture. This structural prior could help with systematic, compositional reasoning that neural networks typically struggle with.
6. What Benchmarks We Need to Run (Roadmap)
Tier 1 β Must Run (direct comparison with LeWM/DINO-WM)
| Benchmark | What's Needed | Expected Difficulty |
|---|---|---|
| Push-T | Add pixel encoder to our architecture, train on Push-T trajectories | Medium β need ~18K trajectories, visual encoder front-end |
| PointMaze/Wall | Same as above | Easy β simple navigation |
| OGBench-Cube | Same + 3D rendering | Medium-Hard |
| Physical Probing | Train linear/MLP probes on our latent space | Easy β we already have latent representations |
| VoE (Violation of Expectation) | Inject anomalies, measure surprise | Easy β our architecture naturally computes prediction error |
Tier 2 β High-Impact Differentiators
| Benchmark | Why It Matters | Our Advantage |
|---|---|---|
| IntPhys 2 | ALL JEPA models fail (β€57.5% vs 96.4% human) | Our explicit memory could help with object permanence |
| Long-horizon planning | JEPA models degrade over long rollouts | Our info-request loop provides feedback for multi-step |
| Memory-dependent tasks | Tasks requiring recall of past observations | Direct advantage β our architecture has literal memory |
Tier 3 β Efficiency Benchmarks
| Metric | LeWM | Our Target |
|---|---|---|
| Planning time | <1 second | Should be comparable (similar param count) |
| Training time | Single GPU, few hours | Same β 13.5M params |
| Training data efficiency | Scales with dataset size | To be measured |
7. Honest Assessment β Strengths and Weaknesses
Where Our Architecture Should Excel (Hypotheses)
Memory-dependent tasks: Any task where the agent must remember and recall past observations to make current decisions. JEPA has no explicit memory β it's all in the latent state. Our 64K-word memory is persistent.
Compositional state tracking: Tasks with multiple objects where different "aspects" of the state need different information sources. Our 3 SLMs can specialize (one tracks the agent, one tracks the object, one tracks the environment).
Anomaly detection / physics violation: Our explicit memory + multi-step prediction error should catch "impossible" events better than implicit models. The info-request loop acts as an active hypothesis tester.
Interpretability: You can literally inspect which memory addresses were read, which SLMs were selected, what the info-request query was. JEPA is a black box.
Where Our Architecture Will Likely Struggle
Raw pixel processing: Our current architecture works on state vectors. Adding a visual encoder is engineering work β but JEPA models are built pixel-first.
Large-scale visual representation: V-JEPA 2.1 at 1.9B params has seen millions of videos. Our 13.5M model can't compete on raw representation quality for visual tasks.
Simple tasks: LeWM already struggles on trivial environments (TwoRoom). Our more complex architecture might face similar issues β the overhead of memory + routing may not help when the task is simple.
Training stability: 3-phase training is more complex than LeWM's elegant 2-loss setup. More things can go wrong.
8. Comparison Summary Table
| I-JEPA | V-JEPA | DINO-WM | LeWM | V-JEPA 2.1 | Ours | |
|---|---|---|---|---|---|---|
| Params | 632M | 632M | ~300M | 15M | 1.9B | 13.5M |
| Input | Images | Video | Pixels+Act | Pixels+Act | Video | State vec |
| Memory | None | None | None | None | None | Explicit 256KB |
| Multi-model routing | No | No | No | No | No | Yes |
| Active info request | No | No | No | No | No | Yes |
| Push-T SR | β | β | 0.90 | 0.88 | β | β TBD |
| Maze SR | β | β | 0.98 | β | β | β TBD |
| Reach SR | β | β | 0.92 | β | β | β TBD |
| IntPhys 2 | β | β | β | β | 57.5% | β TBD |
| K400 Acc | β | 81.9% | β | β | ~88% | N/A |
| Planning Speed | β | β | ~48s | <1s | β | ~<1s (est) |
| Training | 16ΓA100 | Cluster | Offline | 1 GPU | Cluster | 1 GPU |
| Interpretable | Low | Low | Low | Low | Low | High |
9. Next Steps to Get Real Numbers
- Add visual encoder to our architecture (small CNN or ViT-Tiny) for pixel observations β enables Push-T, Maze, Reach benchmarks
- Integrate with
stable-worldmodelevaluation suite (arxiv:2602.08968) for standardized comparison - Run Push-T first β most-used benchmark, open code, our architecture could show SLM specialization
- Design memory-dependent benchmark β a custom task where agents MUST recall past observations to solve current goals. This is where we should clearly beat all JEPA models.
10. References
| Paper | ArXiv ID | Key Numbers |
|---|---|---|
| I-JEPA | 2301.08243 | ImageNet linear: 80%+, 632M params |
| V-JEPA | 2404.08471 | K400: 81.9%, SSv2: 72.2%, 632M params |
| DINO-WM | 2411.04983 | Push-T: 0.90 SR, Reach: 0.92 SR |
| LeWM | 2603.19312 | Push-T: 0.88 SR, 15M params, <1s planning, 48Γ faster |
| V-JEPA 2.1 | 2603.14482 | Ego4D: 7.71 mAP, SSv2: 77.7%, 1.9B params |
| IntPhys 2 | 2506.09849 | V-JEPA 2: 57.5%, Human: 96.4% |
| JEPA-WMs study | 2512.24497 | CEM best planner, proprioception critical |
| DreamerV3 | 2301.04104 | Atari SOTA, Push-T: 0.30 SR |
| LeCun position paper | 2306.02572 | Theoretical H-JEPA architecture |