# VL-JEPA: Simplified Video-Language Alignment

A simplified implementation of the Video-Language Joint Embedding Predictive
Architecture (VL-JEPA) for **Temporal Moment Retrieval** (Temporal Grounding).

This project uses **V-JEPA 2** for video understanding and **Qwen 2.5 0.5B** as
a predictor to align video features with language queries in a high-dimensional
embedding space.

## 🚀 Architecture

The model follows the JEPA framework by aligning video features (X) and text
descriptions (Y) through a predictor (P):

- **X-Encoder (Video)**: Frozen **V-JEPA 2** (ViT-L). High-fidelity hierarchical
  video features.
- **Y-Encoder (Text)**: Frozen **MiniLM** (all-MiniLM-L6-v2). Compact and
  efficient semantic text embeddings.
- **Predictor (Alignment)**: **Qwen 2.5 0.5B** with **LoRA** (Low-Rank
  Adaptation). Learns to predict the target text embedding from the joint
  video+query representation.

## 🛠️ Installation

This project uses `uv` for lightning-fast dependency management.

```bash
# Clone the repository
git clone https://github.com/max044/vl-jepa.git
cd vl-jepa

# Create environment and install dependencies
uv sync
```

## 📊 Data Preparation

The model is trained on the **Charades-STA** dataset for temporal grounding.

1. **Videos**: Download
   [Charades v1](https://ai2-public-datasets.s3-us-west-2.amazonaws.com/charades/Charades_v1_480.zip)
   and place them in `data/Charades_v1_480`.
2. **Annotations**: Use `download_annotations.py` to download the annotations.

Structure:

```text
data/
├── Charades_v1_480/      # Video files (.mp4)
├── charades_sta_train.txt
└── charades_sta_test.txt
```

## 🏋️ Training

Start training with default hyperparameters:

```bash
# Regular training (local, MPS/CPU)
uv run train.py

# Debug mode (small subset, only 2 epochs)
uv run train.py --debug --device mps
```

### Key Training Features:

- **Bidirectional InfoNCE Loss**: Maximizes mutual information between predicted
  and target embeddings.
- **LoRA Tuning**: Only 0.2% of the predictor parameters (Qwen) are trained,
  making it extremely memory-efficient.
- **MPS Support**: Optimized for Mac M1/M2/M3 chips.
- **W&B Integration**: Full experiment tracking with model versioning.

## ☁️ Cloud GPU Training

Train on GPU with [Vast.ai](https://vast.ai/) (~$0.50–2/h for A100/H100).

### Quick Start

```bash
# 1. On the cloud instance — bootstrap
curl -sSL https://raw.githubusercontent.com/max044/vl-jepa/main/scripts/bootstrap.sh | bash

# 2. Configure W&B
cd ~/vl-jepa
cp .env.example .env
nano .env  # Set WANDB_API_KEY (get it at https://wandb.ai/authorize)

# 3. Download videos
wget -P data/ https://ai2-public-datasets.s3-us-west-2.amazonaws.com/charades/Charades_v1_480.zip
unzip data/Charades_v1_480.zip -d data/

or

uv run hf download max044/Charades_v1_480 --local-dir data/Charades_v1_480 --repo-type dataset

# 4. Launch training
bash scripts/train_cloud.sh
```

### W&B Experiment Tracking

All training runs are tracked on [Weights & Biases](https://wandb.ai/):

- **Metrics**: loss, InfoNCE, learning rate (per step + per epoch)
- **System**: GPU utilization, memory usage (automatic)
- **Model versioning**: checkpoints uploaded as W&B Artifacts (`vl-jepa-best`,
  `vl-jepa-last`) — every version is preserved and downloadable

```bash
# Train with W&B (default)
uv run train.py --device cuda --wandb-project vl-jepa

# Train without W&B
uv run train.py --device cuda --no-wandb

# Custom W&B run name
uv run train.py --device cuda --wandb-run-name "exp-lr3e4-bs16"
```

### Environment Variables

| Variable        | Description                                          | Required     |
| --------------- | ---------------------------------------------------- | ------------ |
| `WANDB_API_KEY` | W&B API key ([get here](https://wandb.ai/authorize)) | For tracking |
| `WANDB_PROJECT` | W&B project name (default: `vl-jepa`)                | No           |
| `WANDB_ENTITY`  | W&B team/organization                                | No           |
| `EPOCHS`        | Override epoch count                                 | No           |
| `BATCH_SIZE`    | Override batch size                                  | No           |

## 🔍 Inference (Moment Retrieval)

Once trained, you can use the model to find specific moments in a video based on
a text query. The script uses a sliding window approach with NMS to find the
best matching segments.

```bash
# Example: Local inference
uv run infer.py \
    --video data/Charades_v1_480/3MSZA.mp4 \
    --query "person turns on the light" \
    --checkpoint checkpoints/best.pth \
    --device mps
```

## 🔍 Implementation Details

Unlike standard VLM (Visual-Language Models) that use generative heads, this
VL-JEPA implementation focuses on **embedding alignment**. This makes it an
order of magnitude faster for retrieval tasks (search) as embeddings can be
pre-computed and indexed using vector databases (Faiss, Milvus, Chroma).

## 📚 References

This implementation is based on the official VL-JEPA paper:

```bibtex
@misc{chen2026vljepajointembeddingpredictive,
      title={VL-JEPA: Joint Embedding Predictive Architecture for Vision-language}, 
      author={Delong Chen and Mustafa Shukor and Theo Moutakanni and Willy Chung and Jade Yu and Tejaswi Kasarla and Yejin Bang and Allen Bolourchi and Yann LeCun and Pascale Fung},
      year={2026},
      eprint={2512.10942},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10942}, 
}
```

## 📄 License

MIT