V-JEPA 2 ViT-L/16 β€” ONNX Export (Encoder Only, 2-Frame)

Encoder-only ONNX export of facebook/vjepa2-vitl-fpc64-256 for single-image latent representation analysis with latent-inspector and ONNX Runtime.

Model

V-JEPA 2 is a self-supervised video encoder from Meta FAIR that learns spatiotemporal representations by predicting future frame representations from past frames. Trained on internet-scale video data, its encoder carries an implicit prior about how the visual world moves β€” even when processing a static image.

This ONNX artifact contains only the encoder (predictor head stripped). The input is fixed to 2 identical frames (the minimum for tubelet_size=2), collapsing the temporal dimension and producing pure spatial patch embeddings comparable to image-only models.

Property Value
Architecture ViT-L/16
Parameters 304M
Embedding dimension 1024
Layers / Heads 24 / 16
Patch size 16 px
Input size 256 x 256
Input format Video: [1, 2, 3, 256, 256] (2 duplicated frames)
Output tokens 256 spatial patches
CLS token No
Training data Internet-scale video
Paper Bardes et al. 2024
Original repo facebookresearch/vjepa2
License CC-BY-NC-4.0

Why encoder-only with 2 frames?

V-JEPA 2 is a video model. Its full architecture has an encoder (processes input frames) and a predictor (predicts masked future frames in latent space). For latent representation analysis on static images:

  • We strip the predictor β€” it's only needed for the self-supervised training objective
  • We use 2 identical frames (minimum for tubelet_size=2) β€” the model's spatiotemporal patch embedding requires at least 2 frames to form one temporal tubelet
  • This collapses the temporal dimension: 256 spatial patches x 1 temporal step = 256 output tokens of dimension 1024

The output shape [1, 256, 1024] is identical to DINOv2's, enabling direct cross-model comparison via CKA and k-NN overlap.

ONNX Export Process

Exported from facebook/vjepa2-vitl-fpc64-256:

  1. Extract encoder: model.encoder.embeddings + model.encoder.layer + model.encoder.layernorm (predictor stripped)
  2. Export with PyTorch TorchScript ONNX exporter at opset 14
  3. Simplify with onnxsim: 10920 β†’ 3133 nodes
  4. Save with external data: model.onnx (graph) + model.onnx_data (weights)
  5. Verify against full model PyTorch output: max diff = 0.003 (encoder path only)

Files

File Size Description
model.onnx 739 KB ONNX graph (opset 14, 3133 nodes)
model.onnx_data 1.16 GB External weight data

ONNX I/O

Direction Name Shape Type
Input pixel_values_videos [1, 2, 3, 256, 256] float32
Output last_hidden_state [1, 256, 1024] float32

Input: batch of 1 video with 2 frames, 3 channels, 256x256 pixels. For single images, duplicate the preprocessed frame along dimension 1.

Output: 256 spatial patch tokens of dimension 1024. No CLS token.

Usage

With latent-inspector (Rust)

# Auto-downloads on first use (~1.2 GB). Frame duplication is automatic.
latent-inspector inspect photo.jpg --model vjepa2-vitl-fpc2-256
latent-inspector compare photo.jpg --models dinov2-vit-l14,vjepa2-vitl-fpc2-256

With ONNX Runtime (Python)

import onnxruntime as ort
import numpy as np
from PIL import Image
from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize(256, interpolation=transforms.InterpolationMode.LANCZOS),
    transforms.CenterCrop(256),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

image = Image.open("photo.jpg").convert("RGB")
frame = transform(image).numpy()  # [3, 256, 256]

# Duplicate frame to create 2-frame video input
video = np.stack([frame, frame])[np.newaxis]  # [1, 2, 3, 256, 256]

session = ort.InferenceSession("model.onnx")
output = session.run(None, {"pixel_values_videos": video.astype(np.float32)})[0]
# output shape: [1, 256, 1024]

patch_tokens = output[0]  # [256, 1024] β€” spatial patch embeddings
image_embedding = patch_tokens.mean(axis=0)  # [1024] β€” mean-pool for global embedding

With ONNX Runtime (Rust)

let session = ort::session::Session::builder()?
    .with_intra_threads(4)?
    .commit_from_file("model.onnx")?;

// Build [1, 2, 3, 256, 256] input β€” two identical frames
let frame = ndarray::Array4::<f32>::zeros((1, 1, 3, 256, 256));
// ... fill with preprocessed image ...
let video = ndarray::concatenate(ndarray::Axis(1), &[frame.view(), frame.view()])?;

let outputs = session.run(ort::inputs!["pixel_values_videos" => video])?;
let hidden = outputs["last_hidden_state"].try_extract_tensor::<f32>()?;
// shape: [1, 256, 1024]

Representation Fingerprint

Compared against other SSL models on the same image (real ONNX inference via latent-inspector):

Metric V-JEPA 2 ViT-L/16 DINOv2 ViT-L/14 I-JEPA ViT-H/14 EUPE ViT-B/16
Effective rank 64/1024 60/1024 44/1280 17/768
Top-10 variance 58.1% 66.8% 72.7% 88.8%
Patch isotropy 0.678 0.796 0.788 0.026
CKA vs DINOv2 0.358 1.000 0.329 0.044
k-NN vs DINOv2 0.205 1.000 0.278 0.132

V-JEPA 2 has the highest effective rank (64) and most spread-out variance (58.1% in top-10) among all models β€” it uses more of its representational capacity. Its CKA with DINOv2 (0.358) is the highest of any non-DINOv2 model, likely because they share the same ViT-L architecture (24 layers, 16 heads, 1024-dim).

Citation

@article{bardes2024vjepa2,
  title={V-JEPA 2: Self-Supervised Video Models Enable Understanding
         of Complex Real-World Interactions},
  author={Bardes, Adrien and others},
  journal={arXiv preprint arXiv:2506.09985},
  year={2024}
}

Acknowledgments

Original weights by Meta FAIR under CC-BY-NC-4.0. Encoder-only ONNX export and hosting by @AbdelStark for the latent-inspector project.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for abdelstark/vjepa2-vitl-fpc2-256-onnx

Quantized
(2)
this model

Collection including abdelstark/vjepa2-vitl-fpc2-256-onnx

Paper for abdelstark/vjepa2-vitl-fpc2-256-onnx