V-JEPA 2 ViT-L/16 β ONNX Export (Encoder Only, 2-Frame)
Encoder-only ONNX export of facebook/vjepa2-vitl-fpc64-256 for single-image latent representation analysis with latent-inspector and ONNX Runtime.
Model
V-JEPA 2 is a self-supervised video encoder from Meta FAIR that learns spatiotemporal representations by predicting future frame representations from past frames. Trained on internet-scale video data, its encoder carries an implicit prior about how the visual world moves β even when processing a static image.
This ONNX artifact contains only the encoder (predictor head stripped). The input is fixed to 2 identical frames (the minimum for tubelet_size=2), collapsing the temporal dimension and producing pure spatial patch embeddings comparable to image-only models.
| Property | Value |
|---|---|
| Architecture | ViT-L/16 |
| Parameters | 304M |
| Embedding dimension | 1024 |
| Layers / Heads | 24 / 16 |
| Patch size | 16 px |
| Input size | 256 x 256 |
| Input format | Video: [1, 2, 3, 256, 256] (2 duplicated frames) |
| Output tokens | 256 spatial patches |
| CLS token | No |
| Training data | Internet-scale video |
| Paper | Bardes et al. 2024 |
| Original repo | facebookresearch/vjepa2 |
| License | CC-BY-NC-4.0 |
Why encoder-only with 2 frames?
V-JEPA 2 is a video model. Its full architecture has an encoder (processes input frames) and a predictor (predicts masked future frames in latent space). For latent representation analysis on static images:
- We strip the predictor β it's only needed for the self-supervised training objective
- We use 2 identical frames (minimum for
tubelet_size=2) β the model's spatiotemporal patch embedding requires at least 2 frames to form one temporal tubelet - This collapses the temporal dimension: 256 spatial patches x 1 temporal step = 256 output tokens of dimension 1024
The output shape [1, 256, 1024] is identical to DINOv2's, enabling direct cross-model comparison via CKA and k-NN overlap.
ONNX Export Process
Exported from facebook/vjepa2-vitl-fpc64-256:
- Extract encoder:
model.encoder.embeddings+model.encoder.layer+model.encoder.layernorm(predictor stripped) - Export with PyTorch TorchScript ONNX exporter at opset 14
- Simplify with onnxsim: 10920 β 3133 nodes
- Save with external data:
model.onnx(graph) +model.onnx_data(weights) - Verify against full model PyTorch output: max diff = 0.003 (encoder path only)
Files
| File | Size | Description |
|---|---|---|
model.onnx |
739 KB | ONNX graph (opset 14, 3133 nodes) |
model.onnx_data |
1.16 GB | External weight data |
ONNX I/O
| Direction | Name | Shape | Type |
|---|---|---|---|
| Input | pixel_values_videos |
[1, 2, 3, 256, 256] |
float32 |
| Output | last_hidden_state |
[1, 256, 1024] |
float32 |
Input: batch of 1 video with 2 frames, 3 channels, 256x256 pixels. For single images, duplicate the preprocessed frame along dimension 1.
Output: 256 spatial patch tokens of dimension 1024. No CLS token.
Usage
With latent-inspector (Rust)
# Auto-downloads on first use (~1.2 GB). Frame duplication is automatic.
latent-inspector inspect photo.jpg --model vjepa2-vitl-fpc2-256
latent-inspector compare photo.jpg --models dinov2-vit-l14,vjepa2-vitl-fpc2-256
With ONNX Runtime (Python)
import onnxruntime as ort
import numpy as np
from PIL import Image
from torchvision import transforms
transform = transforms.Compose([
transforms.Resize(256, interpolation=transforms.InterpolationMode.LANCZOS),
transforms.CenterCrop(256),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
image = Image.open("photo.jpg").convert("RGB")
frame = transform(image).numpy() # [3, 256, 256]
# Duplicate frame to create 2-frame video input
video = np.stack([frame, frame])[np.newaxis] # [1, 2, 3, 256, 256]
session = ort.InferenceSession("model.onnx")
output = session.run(None, {"pixel_values_videos": video.astype(np.float32)})[0]
# output shape: [1, 256, 1024]
patch_tokens = output[0] # [256, 1024] β spatial patch embeddings
image_embedding = patch_tokens.mean(axis=0) # [1024] β mean-pool for global embedding
With ONNX Runtime (Rust)
let session = ort::session::Session::builder()?
.with_intra_threads(4)?
.commit_from_file("model.onnx")?;
// Build [1, 2, 3, 256, 256] input β two identical frames
let frame = ndarray::Array4::<f32>::zeros((1, 1, 3, 256, 256));
// ... fill with preprocessed image ...
let video = ndarray::concatenate(ndarray::Axis(1), &[frame.view(), frame.view()])?;
let outputs = session.run(ort::inputs!["pixel_values_videos" => video])?;
let hidden = outputs["last_hidden_state"].try_extract_tensor::<f32>()?;
// shape: [1, 256, 1024]
Representation Fingerprint
Compared against other SSL models on the same image (real ONNX inference via latent-inspector):
| Metric | V-JEPA 2 ViT-L/16 | DINOv2 ViT-L/14 | I-JEPA ViT-H/14 | EUPE ViT-B/16 |
|---|---|---|---|---|
| Effective rank | 64/1024 | 60/1024 | 44/1280 | 17/768 |
| Top-10 variance | 58.1% | 66.8% | 72.7% | 88.8% |
| Patch isotropy | 0.678 | 0.796 | 0.788 | 0.026 |
| CKA vs DINOv2 | 0.358 | 1.000 | 0.329 | 0.044 |
| k-NN vs DINOv2 | 0.205 | 1.000 | 0.278 | 0.132 |
V-JEPA 2 has the highest effective rank (64) and most spread-out variance (58.1% in top-10) among all models β it uses more of its representational capacity. Its CKA with DINOv2 (0.358) is the highest of any non-DINOv2 model, likely because they share the same ViT-L architecture (24 layers, 16 heads, 1024-dim).
Citation
@article{bardes2024vjepa2,
title={V-JEPA 2: Self-Supervised Video Models Enable Understanding
of Complex Real-World Interactions},
author={Bardes, Adrien and others},
journal={arXiv preprint arXiv:2506.09985},
year={2024}
}
Acknowledgments
Original weights by Meta FAIR under CC-BY-NC-4.0. Encoder-only ONNX export and hosting by @AbdelStark for the latent-inspector project.
Model tree for abdelstark/vjepa2-vitl-fpc2-256-onnx
Base model
facebook/vjepa2-vitl-fpc64-256