inv0krr's picture
Add benchmark comparison vs LeCun world models (JEPA, V-JEPA, DINO-WM, LeWM)
6e04050 verified

LeWorld Memory Architecture vs. LeCun's World Models β€” Benchmark Comparison

⚠️ Honest Status: This Is a Theoretical Comparison

Our architecture has NOT been benchmarked on the standard world-model evaluation suites yet. What follows is:

  1. A factual catalog of every LeCun-family world model, their exact published numbers, and benchmarks
  2. An architectural comparison showing where our design is fundamentally different
  3. A concrete plan for which benchmarks to run and what we'd need to demonstrate

We are not claiming results we don't have. Here's what exists and what's needed.


1. The LeCun World Model Family β€” Published Numbers

Model Lineup (by publication date)

Model Paper Params Training Key Innovation
I-JEPA arxiv:2301.08243 632M (ViT-H/14) ImageNet, 16Γ—A100 Predicts masked image patch representations
V-JEPA arxiv:2404.08471 632M (ViT-H/16) VideoMix2M (2M videos) Predicts masked video region features
DINO-WM arxiv:2411.04983 ~300M (frozen DINOv2 + predictor) Offline trajectories Plans in DINOv2 patch-feature space
LeWM arxiv:2603.19312 ~15M Single GPU, few hours End-to-end from pixels, 2 loss terms only
V-JEPA 2.1 arxiv:2603.14482 1.1–1.9B (ViT-g/G) Massive video corpus Dense features, multi-layer loss

LeWM is our primary comparison target (both ~15M params).


2. Published Benchmark Results β€” Robotic Planning

Push-T (Tabletop Block Pushing β†’ Success Rate ↑)

Model Params Success Rate Planning Time Notes
DINO-WM ~300M+ (frozen DINOv2) 0.90 ~48 seconds Uses pretrained DINOv2 encoder
LeWM ~15M 0.88 (beats DINO-WM pixel-only) <1 second End-to-end from pixels, 48Γ— faster
PLDM ~15M 0.70 <1 second End-to-end, 7 loss terms (VICReg)
DreamerV3 12-400M 0.30 β€” Model-based RL, needs rewards
IRIS β€” 0.32 β€” β€”
TD-MPC2 β€” 0.00 β€” Fails without reward signal
Ours (LeWorld Memory) 13.5M ❌ Not yet tested β€” β€”

PointMaze / Wall Navigation β†’ Success Rate ↑

Model Maze SR Wall SR Notes
DINO-WM 0.98 0.96 Near-perfect
DreamerV3 1.00 1.00 Perfect on simple navigation
LeWM Lower than DINO-WM Lower Struggles on very simple envs (SIGReg limitation)
Ours ❌ Not yet tested ❌ Not yet tested β€”

Reach (Robotic Arm) β†’ Success Rate ↑

Model Success Rate
DINO-WM 0.92
DreamerV3 0.64
IRIS 0.18
Ours ❌ Not yet tested

Rope & Granular Manipulation β†’ Chamfer Distance ↓

Model Rope CD ↓ Granular CD ↓
DINO-WM 0.41 0.26
DreamerV3 2.49 1.05
IRIS 1.11 0.37
Ours ❌ Not yet tested ❌ Not yet tested

3. Published Benchmark Results β€” Physical Understanding

Physical Latent Probing on Push-T (Pearson r ↑, higher = better)

Property DINO-WM (MLP) LeWM (MLP) PLDM (MLP)
Agent Location r = 0.999 r = 0.998 r = 0.993
Block Location r = 0.999 r = 0.999 r = 0.994
Block Angle r = 0.995 r = 0.990 r = 0.972

LeWM at 15M achieves near-parity with DINO-WM (300M+ pretrained) on physical probing.

Violation-of-Expectation (VoE) β€” Physics Anomaly Detection

Perturbation LeWM PLDM
Teleportation (physically impossible) Detects (p<0.01) Detects
Color change (visual only) Does NOT flag Does NOT flag
Correct distinction? βœ… Yes βœ… Yes

IntPhys 2 (Intuitive Physics) β€” Accuracy ↑

Model Accuracy Notes
Human 96.4% Ceiling
V-JEPA 2 (1B+ params) 57.5% Best JEPA model
Gemini 2.5 Flash 55.6% Best commercial LLM
GPT-4o ~50% Near random
Ours ❌ Not yet tested Potential differentiator (see below)

The IntPhys gap (57.5% vs 96.4% human) is the biggest open problem in world models.


4. Published Benchmark Results β€” Video Understanding

Kinetics-400 (Video Action Recognition) β†’ Top-1 Accuracy ↑

Model Params Accuracy Probe Type
V-JEPA 2.1 ViT-G 1.9B ~88% Attentive
V-JEPA ViT-H/16 632M 81.9% Frozen, attentive
VideoMAEv2 ViT-H ~632M 87% Fine-tuned
Ours 13.5M N/A Different modality (state vectors, not video)

Something-Something-v2 (Temporal Reasoning) β†’ Top-1 Accuracy ↑

Model Accuracy
V-JEPA 2.1 ViT-G 77.7%
V-JEPA ViT-H/16 72.2%
Ours N/A

Note: Our architecture operates on state vectors + bit-level memory, not pixels/video. The video benchmarks above are not directly applicable without adding a visual encoder front-end.


5. Architectural Comparison β€” Where We're Different

Dimension LeWM / JEPA Family Our LeWorld Memory Architecture
Memory Implicit (in network weights + latent state) Explicit bit-level RAM (64K Γ— 32-bit words, address-range access)
State Prediction Single model predicts next embedding Hierarchical: 3 SLMs find memory, 1 BLM aggregates + predicts
Information Retrieval All in one forward pass Active retrieval: BLM asks "what do I need?", SLMs fetch from memory
Model Selection N/A (single model) Binary routing [1,0,1]: BLM selects which SLMs to trust
Collapse Prevention SIGReg (LeWM), EMA (V-JEPA), frozen encoder (DINO-WM) Diversity loss + load-balance loss + temperature annealing
Training Single loss (LeWM: 2 terms) 3-phase: pre-train β†’ joint β†’ info-request refinement
Params 15M (LeWM), 632M (V-JEPA), 1.9B (V-JEPA 2.1) 13.5M (3Γ—745K SLMs + 11.2M BLM)
Input Modality Pixels / video frames State vectors + characteristics (extensible to pixels with encoder)
Planning CEM in latent space BLM next-state prediction + info-request loop
Gradient through discrete N/A (continuous latent space) ST-Sigmoid for routing, product-key CE for addressing

What Our Architecture Adds That JEPA Doesn't Have

  1. Explicit addressable memory β€” JEPA has no equivalent. All "memory" in JEPA is implicit in weights. Our architecture has a literal 256KB RAM that models can read/write to by address.

  2. Multi-agent retrieval β€” 3 independent SLMs each search different memory regions. This is like having 3 specialized "attention heads" that look at different parts of a knowledge base, with a gating mechanism that selects the most useful ones.

  3. Active information request β€” The BLM generates "what do I need next?" queries that influence what SLMs look for. JEPA models have no equivalent β€” they receive all information passively.

  4. CPU-inspired structure β€” Address bus β†’ RAM β†’ data bus β†’ processor pipeline mirrors actual computer architecture. This structural prior could help with systematic, compositional reasoning that neural networks typically struggle with.


6. What Benchmarks We Need to Run (Roadmap)

Tier 1 β€” Must Run (direct comparison with LeWM/DINO-WM)

Benchmark What's Needed Expected Difficulty
Push-T Add pixel encoder to our architecture, train on Push-T trajectories Medium β€” need ~18K trajectories, visual encoder front-end
PointMaze/Wall Same as above Easy β€” simple navigation
OGBench-Cube Same + 3D rendering Medium-Hard
Physical Probing Train linear/MLP probes on our latent space Easy β€” we already have latent representations
VoE (Violation of Expectation) Inject anomalies, measure surprise Easy β€” our architecture naturally computes prediction error

Tier 2 β€” High-Impact Differentiators

Benchmark Why It Matters Our Advantage
IntPhys 2 ALL JEPA models fail (≀57.5% vs 96.4% human) Our explicit memory could help with object permanence
Long-horizon planning JEPA models degrade over long rollouts Our info-request loop provides feedback for multi-step
Memory-dependent tasks Tasks requiring recall of past observations Direct advantage β€” our architecture has literal memory

Tier 3 β€” Efficiency Benchmarks

Metric LeWM Our Target
Planning time <1 second Should be comparable (similar param count)
Training time Single GPU, few hours Same β€” 13.5M params
Training data efficiency Scales with dataset size To be measured

7. Honest Assessment β€” Strengths and Weaknesses

Where Our Architecture Should Excel (Hypotheses)

  1. Memory-dependent tasks: Any task where the agent must remember and recall past observations to make current decisions. JEPA has no explicit memory β€” it's all in the latent state. Our 64K-word memory is persistent.

  2. Compositional state tracking: Tasks with multiple objects where different "aspects" of the state need different information sources. Our 3 SLMs can specialize (one tracks the agent, one tracks the object, one tracks the environment).

  3. Anomaly detection / physics violation: Our explicit memory + multi-step prediction error should catch "impossible" events better than implicit models. The info-request loop acts as an active hypothesis tester.

  4. Interpretability: You can literally inspect which memory addresses were read, which SLMs were selected, what the info-request query was. JEPA is a black box.

Where Our Architecture Will Likely Struggle

  1. Raw pixel processing: Our current architecture works on state vectors. Adding a visual encoder is engineering work β€” but JEPA models are built pixel-first.

  2. Large-scale visual representation: V-JEPA 2.1 at 1.9B params has seen millions of videos. Our 13.5M model can't compete on raw representation quality for visual tasks.

  3. Simple tasks: LeWM already struggles on trivial environments (TwoRoom). Our more complex architecture might face similar issues β€” the overhead of memory + routing may not help when the task is simple.

  4. Training stability: 3-phase training is more complex than LeWM's elegant 2-loss setup. More things can go wrong.


8. Comparison Summary Table

I-JEPA V-JEPA DINO-WM LeWM V-JEPA 2.1 Ours
Params 632M 632M ~300M 15M 1.9B 13.5M
Input Images Video Pixels+Act Pixels+Act Video State vec
Memory None None None None None Explicit 256KB
Multi-model routing No No No No No Yes
Active info request No No No No No Yes
Push-T SR β€” β€” 0.90 0.88 β€” ❌ TBD
Maze SR β€” β€” 0.98 β€” β€” ❌ TBD
Reach SR β€” β€” 0.92 β€” β€” ❌ TBD
IntPhys 2 β€” β€” β€” β€” 57.5% ❌ TBD
K400 Acc β€” 81.9% β€” β€” ~88% N/A
Planning Speed β€” β€” ~48s <1s β€” ~<1s (est)
Training 16Γ—A100 Cluster Offline 1 GPU Cluster 1 GPU
Interpretable Low Low Low Low Low High

9. Next Steps to Get Real Numbers

  1. Add visual encoder to our architecture (small CNN or ViT-Tiny) for pixel observations β†’ enables Push-T, Maze, Reach benchmarks
  2. Integrate with stable-worldmodel evaluation suite (arxiv:2602.08968) for standardized comparison
  3. Run Push-T first β€” most-used benchmark, open code, our architecture could show SLM specialization
  4. Design memory-dependent benchmark β€” a custom task where agents MUST recall past observations to solve current goals. This is where we should clearly beat all JEPA models.

10. References

Paper ArXiv ID Key Numbers
I-JEPA 2301.08243 ImageNet linear: 80%+, 632M params
V-JEPA 2404.08471 K400: 81.9%, SSv2: 72.2%, 632M params
DINO-WM 2411.04983 Push-T: 0.90 SR, Reach: 0.92 SR
LeWM 2603.19312 Push-T: 0.88 SR, 15M params, <1s planning, 48Γ— faster
V-JEPA 2.1 2603.14482 Ego4D: 7.71 mAP, SSv2: 77.7%, 1.9B params
IntPhys 2 2506.09849 V-JEPA 2: 57.5%, Human: 96.4%
JEPA-WMs study 2512.24497 CEM best planner, proprioception critical
DreamerV3 2301.04104 Atari SOTA, Push-T: 0.30 SR
LeCun position paper 2306.02572 Theoretical H-JEPA architecture