inv0krr

Add benchmark comparison vs LeCun world models (JEPA, V-JEPA, DINO-WM, LeWM)

6e04050 verified 15 days ago

preview code

raw

history blame contribute delete

13.1 kB

LeWorld Memory Architecture vs. LeCun's World Models — Benchmark Comparison

⚠️ Honest Status: This Is a Theoretical Comparison

Our architecture has NOT been benchmarked on the standard world-model evaluation suites yet. What follows is:

A factual catalog of every LeCun-family world model, their exact published numbers, and benchmarks
An architectural comparison showing where our design is fundamentally different
A concrete plan for which benchmarks to run and what we'd need to demonstrate

We are not claiming results we don't have. Here's what exists and what's needed.

1. The LeCun World Model Family — Published Numbers

Model Lineup (by publication date)

Model	Paper	Params	Training	Key Innovation
I-JEPA	arxiv:2301.08243	632M (ViT-H/14)	ImageNet, 16×A100	Predicts masked image patch representations
V-JEPA	arxiv:2404.08471	632M (ViT-H/16)	VideoMix2M (2M videos)	Predicts masked video region features
DINO-WM	arxiv:2411.04983	~300M (frozen DINOv2 + predictor)	Offline trajectories	Plans in DINOv2 patch-feature space
LeWM	arxiv:2603.19312	~15M	Single GPU, few hours	End-to-end from pixels, 2 loss terms only
V-JEPA 2.1	arxiv:2603.14482	1.1–1.9B (ViT-g/G)	Massive video corpus	Dense features, multi-layer loss

LeWM is our primary comparison target (both ~15M params).

2. Published Benchmark Results — Robotic Planning

Push-T (Tabletop Block Pushing → Success Rate ↑)

Model	Params	Success Rate	Planning Time	Notes
DINO-WM	~300M+ (frozen DINOv2)	0.90	~48 seconds	Uses pretrained DINOv2 encoder
LeWM	~15M	0.88 (beats DINO-WM pixel-only)	<1 second	End-to-end from pixels, 48× faster
PLDM	~15M	0.70	<1 second	End-to-end, 7 loss terms (VICReg)
DreamerV3	12-400M	0.30	—	Model-based RL, needs rewards
IRIS	—	0.32	—	—
TD-MPC2	—	0.00	—	Fails without reward signal
Ours (LeWorld Memory)	13.5M	❌ Not yet tested	—	—

PointMaze / Wall Navigation → Success Rate ↑

Model	Maze SR	Wall SR	Notes
DINO-WM	0.98	0.96	Near-perfect
DreamerV3	1.00	1.00	Perfect on simple navigation
LeWM	Lower than DINO-WM	Lower	Struggles on very simple envs (SIGReg limitation)
Ours	❌ Not yet tested	❌ Not yet tested	—

Reach (Robotic Arm) → Success Rate ↑

Model	Success Rate
DINO-WM	0.92
DreamerV3	0.64
IRIS	0.18
Ours	❌ Not yet tested

Rope & Granular Manipulation → Chamfer Distance ↓

Model	Rope CD ↓	Granular CD ↓
DINO-WM	0.41	0.26
DreamerV3	2.49	1.05
IRIS	1.11	0.37
Ours	❌ Not yet tested	❌ Not yet tested

3. Published Benchmark Results — Physical Understanding

Physical Latent Probing on Push-T (Pearson r ↑, higher = better)

Property	DINO-WM (MLP)	LeWM (MLP)	PLDM (MLP)
Agent Location	r = 0.999	r = 0.998	r = 0.993
Block Location	r = 0.999	r = 0.999	r = 0.994
Block Angle	r = 0.995	r = 0.990	r = 0.972

LeWM at 15M achieves near-parity with DINO-WM (300M+ pretrained) on physical probing.

Violation-of-Expectation (VoE) — Physics Anomaly Detection

Perturbation	LeWM	PLDM
Teleportation (physically impossible)	Detects (p<0.01)	Detects
Color change (visual only)	Does NOT flag	Does NOT flag
Correct distinction?	✅ Yes	✅ Yes

IntPhys 2 (Intuitive Physics) — Accuracy ↑

Model	Accuracy	Notes
Human	96.4%	Ceiling
V-JEPA 2 (1B+ params)	57.5%	Best JEPA model
Gemini 2.5 Flash	55.6%	Best commercial LLM
GPT-4o	~50%	Near random
Ours	❌ Not yet tested	Potential differentiator (see below)

The IntPhys gap (57.5% vs 96.4% human) is the biggest open problem in world models.

4. Published Benchmark Results — Video Understanding

Kinetics-400 (Video Action Recognition) → Top-1 Accuracy ↑

Model	Params	Accuracy	Probe Type
V-JEPA 2.1 ViT-G	1.9B	~88%	Attentive
V-JEPA ViT-H/16	632M	81.9%	Frozen, attentive
VideoMAEv2 ViT-H	~632M	87%	Fine-tuned
Ours	13.5M	N/A	Different modality (state vectors, not video)

Something-Something-v2 (Temporal Reasoning) → Top-1 Accuracy ↑

Model	Accuracy
V-JEPA 2.1 ViT-G	77.7%
V-JEPA ViT-H/16	72.2%
Ours	N/A

Note: Our architecture operates on state vectors + bit-level memory, not pixels/video. The video benchmarks above are not directly applicable without adding a visual encoder front-end.

5. Architectural Comparison — Where We're Different

Dimension	LeWM / JEPA Family	Our LeWorld Memory Architecture
Memory	Implicit (in network weights + latent state)	Explicit bit-level RAM (64K × 32-bit words, address-range access)
State Prediction	Single model predicts next embedding	Hierarchical: 3 SLMs find memory, 1 BLM aggregates + predicts
Information Retrieval	All in one forward pass	Active retrieval: BLM asks "what do I need?", SLMs fetch from memory
Model Selection	N/A (single model)	Binary routing [1,0,1]: BLM selects which SLMs to trust
Collapse Prevention	SIGReg (LeWM), EMA (V-JEPA), frozen encoder (DINO-WM)	Diversity loss + load-balance loss + temperature annealing
Training	Single loss (LeWM: 2 terms)	3-phase: pre-train → joint → info-request refinement
Params	15M (LeWM), 632M (V-JEPA), 1.9B (V-JEPA 2.1)	13.5M (3×745K SLMs + 11.2M BLM)
Input Modality	Pixels / video frames	State vectors + characteristics (extensible to pixels with encoder)
Planning	CEM in latent space	BLM next-state prediction + info-request loop
Gradient through discrete	N/A (continuous latent space)	ST-Sigmoid for routing, product-key CE for addressing

What Our Architecture Adds That JEPA Doesn't Have

Explicit addressable memory — JEPA has no equivalent. All "memory" in JEPA is implicit in weights. Our architecture has a literal 256KB RAM that models can read/write to by address.
Multi-agent retrieval — 3 independent SLMs each search different memory regions. This is like having 3 specialized "attention heads" that look at different parts of a knowledge base, with a gating mechanism that selects the most useful ones.
Active information request — The BLM generates "what do I need next?" queries that influence what SLMs look for. JEPA models have no equivalent — they receive all information passively.
CPU-inspired structure — Address bus → RAM → data bus → processor pipeline mirrors actual computer architecture. This structural prior could help with systematic, compositional reasoning that neural networks typically struggle with.

6. What Benchmarks We Need to Run (Roadmap)

Tier 1 — Must Run (direct comparison with LeWM/DINO-WM)

Benchmark	What's Needed	Expected Difficulty
Push-T	Add pixel encoder to our architecture, train on Push-T trajectories	Medium — need ~18K trajectories, visual encoder front-end
PointMaze/Wall	Same as above	Easy — simple navigation
OGBench-Cube	Same + 3D rendering	Medium-Hard
Physical Probing	Train linear/MLP probes on our latent space	Easy — we already have latent representations
VoE (Violation of Expectation)	Inject anomalies, measure surprise	Easy — our architecture naturally computes prediction error

Tier 2 — High-Impact Differentiators

Benchmark	Why It Matters	Our Advantage
IntPhys 2	ALL JEPA models fail (≤57.5% vs 96.4% human)	Our explicit memory could help with object permanence
Long-horizon planning	JEPA models degrade over long rollouts	Our info-request loop provides feedback for multi-step
Memory-dependent tasks	Tasks requiring recall of past observations	Direct advantage — our architecture has literal memory

Tier 3 — Efficiency Benchmarks

Metric	LeWM	Our Target
Planning time	<1 second	Should be comparable (similar param count)
Training time	Single GPU, few hours	Same — 13.5M params
Training data efficiency	Scales with dataset size	To be measured

7. Honest Assessment — Strengths and Weaknesses

Where Our Architecture Should Excel (Hypotheses)

Memory-dependent tasks: Any task where the agent must remember and recall past observations to make current decisions. JEPA has no explicit memory — it's all in the latent state. Our 64K-word memory is persistent.
Compositional state tracking: Tasks with multiple objects where different "aspects" of the state need different information sources. Our 3 SLMs can specialize (one tracks the agent, one tracks the object, one tracks the environment).
Anomaly detection / physics violation: Our explicit memory + multi-step prediction error should catch "impossible" events better than implicit models. The info-request loop acts as an active hypothesis tester.
Interpretability: You can literally inspect which memory addresses were read, which SLMs were selected, what the info-request query was. JEPA is a black box.

Where Our Architecture Will Likely Struggle

Raw pixel processing: Our current architecture works on state vectors. Adding a visual encoder is engineering work — but JEPA models are built pixel-first.
Large-scale visual representation: V-JEPA 2.1 at 1.9B params has seen millions of videos. Our 13.5M model can't compete on raw representation quality for visual tasks.
Simple tasks: LeWM already struggles on trivial environments (TwoRoom). Our more complex architecture might face similar issues — the overhead of memory + routing may not help when the task is simple.
Training stability: 3-phase training is more complex than LeWM's elegant 2-loss setup. More things can go wrong.

8. Comparison Summary Table

	I-JEPA	V-JEPA	DINO-WM	LeWM	V-JEPA 2.1	Ours
Params	632M	632M	~300M	15M	1.9B	13.5M
Input	Images	Video	Pixels+Act	Pixels+Act	Video	State vec
Memory	None	None	None	None	None	Explicit 256KB
Multi-model routing	No	No	No	No	No	Yes
Active info request	No	No	No	No	No	Yes
Push-T SR	—	—	0.90	0.88	—	❌ TBD
Maze SR	—	—	0.98	—	—	❌ TBD
Reach SR	—	—	0.92	—	—	❌ TBD
IntPhys 2	—	—	—	—	57.5%	❌ TBD
K400 Acc	—	81.9%	—	—	~88%	N/A
Planning Speed	—	—	~48s	<1s	—	~<1s (est)
Training	16×A100	Cluster	Offline	1 GPU	Cluster	1 GPU
Interpretable	Low	Low	Low	Low	Low	High

9. Next Steps to Get Real Numbers

Add visual encoder to our architecture (small CNN or ViT-Tiny) for pixel observations → enables Push-T, Maze, Reach benchmarks
Integrate with stable-worldmodel evaluation suite (arxiv:2602.08968) for standardized comparison
Run Push-T first — most-used benchmark, open code, our architecture could show SLM specialization
Design memory-dependent benchmark — a custom task where agents MUST recall past observations to solve current goals. This is where we should clearly beat all JEPA models.

10. References

Paper	ArXiv ID	Key Numbers
I-JEPA	2301.08243	ImageNet linear: 80%+, 632M params
V-JEPA	2404.08471	K400: 81.9%, SSv2: 72.2%, 632M params
DINO-WM	2411.04983	Push-T: 0.90 SR, Reach: 0.92 SR
LeWM	2603.19312	Push-T: 0.88 SR, 15M params, <1s planning, 48× faster
V-JEPA 2.1	2603.14482	Ego4D: 7.71 mAP, SSv2: 77.7%, 1.9B params
IntPhys 2	2506.09849	V-JEPA 2: 57.5%, Human: 96.4%
JEPA-WMs study	2512.24497	CEM best planner, proprioception critical
DreamerV3	2301.04104	Atari SOTA, Push-T: 0.30 SR
LeCun position paper	2306.02572	Theoretical H-JEPA architecture