SO-100 JEPA-2 AC (Action-Conditioned) Model

This repository contains a V-JEPA 2-AC model trained on the SO-100 robotics dataset for the Ball-Cup task.

Model Overview

Architecture: V-JEPA 2 with Action-Conditioned Predictor
Vision Foundation: ViT-Large (1024 embed dim)
Task: Robotics control / world modeling (predicting future latents based on current context and actions)
Dataset: SO-100 Ball-Cup (Robotics interaction)

Directory Structure

jepa-model/
│── config.json             # Model architecture and data configuration
│── pytorch_model.bin       # Best predictor weights (state dict)
│── vision_encoder.pt       # ViT-Large vision encoder weights (~5GB)
│── README.md               # Model documentation
│── train.py                # Source code for model and training
│── training_log.csv        # Training history
│── ckpt_ep*.pt             # Training checkpoints with optimizer state
│── latest.pt               # Latest training checkpoint

How to Use

Prerequisites

pip install torch numpy torchvision

Loading the Model

This model requires both the Vision Encoder and the Action-Conditioned Predictor.

import torch
import json
from train import ActionConditionedVJepa, VisionEncoder # Classes from train.py

# 1. Load Vision Encoder (ViT-L)
encoder = VisionEncoder(model_name="vit_large")
encoder.load_state_dict(torch.load("vision_encoder.pt"))
encoder.eval()

# 2. Load Predictor
with open("config.json", "r") as f:
    config = json.load(f)
predictor = ActionConditionedVJepa(config["architecture"]["predictor"])
predictor.load_state_dict(torch.load("pytorch_model.bin"))
predictor.eval()

Dataset Information

The model was trained on the SO-100 Ball-Cup Robotics Dataset.

Type: Video-based robotics interaction demos.
Task: Tracking and predicting the motion of a ball being caught in a cup by a robotic arm.
Observations: Multi-view camera setup (primarily observation.images.phone).
Actions: 6-DoF end-effector control or joint targets (Delta control).

Training Progress (Loss Evolution)

The training process was fully tracked and logged using Weights & Biases (WandB). You can view the live dashboard and detailed metrics here:

WandB Training Dashboard: vjepa2-ac

The evolution of the L1 loss in latent space over the training period is shown below:

The plot shows a consistent decrease in both training and validation loss, indicating successful learning of the world model dynamics without significant overfitting.

Training Details

Loss: L1 distance (Mean Absolute Error) in latent space between predicted and target embeddings.
Optimizer: AdamW
Scheduler: Cosine with warmup
Input: 6 context frames + 6 context actions → 2 predicted future latents.
Resolution: 256x256

Citation

If you use this model, please cite the original V-JEPA work and the SO-100 dataset.

@article{vjepa2,
  title={V-JEPA 2: Action-Conditioned World Models for Robotics},
  author={...},
  journal={...},
  year={2024}
}

Downloads last month: 28

Video Preview

Robotics