SO-100 JEPA-2 AC (Action-Conditioned) Model
This repository contains a V-JEPA 2-AC model trained on the SO-100 robotics dataset for the Ball-Cup task.
Model Overview
- Architecture: V-JEPA 2 with Action-Conditioned Predictor
- Vision Foundation: ViT-Large (1024 embed dim)
- Task: Robotics control / world modeling (predicting future latents based on current context and actions)
- Dataset: SO-100 Ball-Cup (Robotics interaction)
Directory Structure
jepa-model/
βββ config.json # Model architecture and data configuration
βββ pytorch_model.bin # Best predictor weights (state dict)
βββ vision_encoder.pt # ViT-Large vision encoder weights (~5GB)
βββ README.md # Model documentation
βββ train.py # Source code for model and training
βββ training_log.csv # Training history
βββ ckpt_ep*.pt # Training checkpoints with optimizer state
βββ latest.pt # Latest training checkpoint
How to Use
Prerequisites
pip install torch numpy torchvision
Loading the Model
This model requires both the Vision Encoder and the Action-Conditioned Predictor.
import torch
import json
from train import ActionConditionedVJepa, VisionEncoder # Classes from train.py
# 1. Load Vision Encoder (ViT-L)
encoder = VisionEncoder(model_name="vit_large")
encoder.load_state_dict(torch.load("vision_encoder.pt"))
encoder.eval()
# 2. Load Predictor
with open("config.json", "r") as f:
config = json.load(f)
predictor = ActionConditionedVJepa(config["architecture"]["predictor"])
predictor.load_state_dict(torch.load("pytorch_model.bin"))
predictor.eval()
Dataset Information
The model was trained on the SO-100 Ball-Cup Robotics Dataset.
- Type: Video-based robotics interaction demos.
- Task: Tracking and predicting the motion of a ball being caught in a cup by a robotic arm.
- Observations: Multi-view camera setup (primarily
observation.images.phone). - Actions: 6-DoF end-effector control or joint targets (Delta control).
Training Progress (Loss Evolution)
The training process was fully tracked and logged using Weights & Biases (WandB). You can view the live dashboard and detailed metrics here:
WandB Training Dashboard: vjepa2-ac
The evolution of the L1 loss in latent space over the training period is shown below:
The plot shows a consistent decrease in both training and validation loss, indicating successful learning of the world model dynamics without significant overfitting.
Training Details
- Loss: L1 distance (Mean Absolute Error) in latent space between predicted and target embeddings.
- Optimizer: AdamW
- Scheduler: Cosine with warmup
- Input: 6 context frames + 6 context actions β 2 predicted future latents.
- Resolution: 256x256
Citation
If you use this model, please cite the original V-JEPA work and the SO-100 dataset.
@article{vjepa2,
title={V-JEPA 2: Action-Conditioned World Models for Robotics},
author={...},
journal={...},
year={2024}
}
- Downloads last month
- 28
