V-JEPA 2 ViT-L/16 — ONNX Export (Encoder Only, 2-Frame)

Encoder-only ONNX export of facebook/vjepa2-vitl-fpc64-256 for single-image latent representation analysis with latent-inspector and ONNX Runtime.

Model

V-JEPA 2 is a self-supervised video encoder from Meta FAIR that learns spatiotemporal representations by predicting future frame representations from past frames. Trained on internet-scale video data, its encoder carries an implicit prior about how the visual world moves — even when processing a static image.

This ONNX artifact contains only the encoder (predictor head stripped). The input is fixed to 2 identical frames (the minimum for tubelet_size=2), collapsing the temporal dimension and producing pure spatial patch embeddings comparable to image-only models.

Property	Value
Architecture	ViT-L/16
Parameters	304M
Embedding dimension	1024
Layers / Heads	24 / 16
Patch size	16 px
Input size	256 x 256
Input format	Video: `[1, 2, 3, 256, 256]` (2 duplicated frames)
Output tokens	256 spatial patches
CLS token	No
Training data	Internet-scale video
Paper	Bardes et al. 2024
Original repo	facebookresearch/vjepa2
License	CC-BY-NC-4.0

Why encoder-only with 2 frames?

V-JEPA 2 is a video model. Its full architecture has an encoder (processes input frames) and a predictor (predicts masked future frames in latent space). For latent representation analysis on static images:

We strip the predictor — it's only needed for the self-supervised training objective
We use 2 identical frames (minimum for tubelet_size=2) — the model's spatiotemporal patch embedding requires at least 2 frames to form one temporal tubelet
This collapses the temporal dimension: 256 spatial patches x 1 temporal step = 256 output tokens of dimension 1024

The output shape [1, 256, 1024] is identical to DINOv2's, enabling direct cross-model comparison via CKA and k-NN overlap.

ONNX Export Process

Exported from facebook/vjepa2-vitl-fpc64-256:

Extract encoder: model.encoder.embeddings + model.encoder.layer + model.encoder.layernorm (predictor stripped)
Export with PyTorch TorchScript ONNX exporter at opset 14
Simplify with onnxsim: 10920 → 3133 nodes
Save with external data: model.onnx (graph) + model.onnx_data (weights)
Verify against full model PyTorch output: max diff = 0.003 (encoder path only)

Files

File	Size	Description
`model.onnx`	739 KB	ONNX graph (opset 14, 3133 nodes)
`model.onnx_data`	1.16 GB	External weight data

ONNX I/O

Direction	Name	Shape	Type
Input	`pixel_values_videos`	`[1, 2, 3, 256, 256]`	float32
Output	`last_hidden_state`	`[1, 256, 1024]`	float32

Input: batch of 1 video with 2 frames, 3 channels, 256x256 pixels. For single images, duplicate the preprocessed frame along dimension 1.

Output: 256 spatial patch tokens of dimension 1024. No CLS token.

Usage

With latent-inspector (Rust)

# Auto-downloads on first use (~1.2 GB). Frame duplication is automatic.
latent-inspector inspect photo.jpg --model vjepa2-vitl-fpc2-256
latent-inspector compare photo.jpg --models dinov2-vit-l14,vjepa2-vitl-fpc2-256

With ONNX Runtime (Python)

import onnxruntime as ort
import numpy as np
from PIL import Image
from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize(256, interpolation=transforms.InterpolationMode.LANCZOS),
    transforms.CenterCrop(256),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

image = Image.open("photo.jpg").convert("RGB")
frame = transform(image).numpy()  # [3, 256, 256]

# Duplicate frame to create 2-frame video input
video = np.stack([frame, frame])[np.newaxis]  # [1, 2, 3, 256, 256]

session = ort.InferenceSession("model.onnx")
output = session.run(None, {"pixel_values_videos": video.astype(np.float32)})[0]
# output shape: [1, 256, 1024]

patch_tokens = output[0]  # [256, 1024] — spatial patch embeddings
image_embedding = patch_tokens.mean(axis=0)  # [1024] — mean-pool for global embedding

With ONNX Runtime (Rust)

let session = ort::session::Session::builder()?
    .with_intra_threads(4)?
    .commit_from_file("model.onnx")?;

// Build [1, 2, 3, 256, 256] input — two identical frames
let frame = ndarray::Array4::<f32>::zeros((1, 1, 3, 256, 256));
// ... fill with preprocessed image ...
let video = ndarray::concatenate(ndarray::Axis(1), &[frame.view(), frame.view()])?;

let outputs = session.run(ort::inputs!["pixel_values_videos" => video])?;
let hidden = outputs["last_hidden_state"].try_extract_tensor::<f32>()?;
// shape: [1, 256, 1024]

Representation Fingerprint

Compared against other SSL models on the same image (real ONNX inference via latent-inspector):

Metric	V-JEPA 2 ViT-L/16	DINOv2 ViT-L/14	I-JEPA ViT-H/14	EUPE ViT-B/16
Effective rank	64/1024	60/1024	44/1280	17/768
Top-10 variance	58.1%	66.8%	72.7%	88.8%
Patch isotropy	0.678	0.796	0.788	0.026
CKA vs DINOv2	0.358	1.000	0.329	0.044
k-NN vs DINOv2	0.205	1.000	0.278	0.132

V-JEPA 2 has the highest effective rank (64) and most spread-out variance (58.1% in top-10) among all models — it uses more of its representational capacity. Its CKA with DINOv2 (0.358) is the highest of any non-DINOv2 model, likely because they share the same ViT-L architecture (24 layers, 16 heads, 1024-dim).

Citation

@article{bardes2024vjepa2,
  title={V-JEPA 2: Self-Supervised Video Models Enable Understanding
         of Complex Real-World Interactions},
  author={Bardes, Adrien and others},
  journal={arXiv preprint arXiv:2506.09985},
  year={2024}
}