Upload folder using huggingface_hub

Browse files

Files changed (16) hide show

README.md +94 -0
best.pth +3 -0
original_readme.md +172 -0
sample_inference.py +64 -0
vljepa/__init__.py +7 -0
vljepa/__pycache__/__init__.cpython-313.pyc +0 -0
vljepa/__pycache__/config.cpython-313.pyc +0 -0
vljepa/__pycache__/dataset.cpython-313.pyc +0 -0
vljepa/__pycache__/losses.cpython-313.pyc +0 -0
vljepa/__pycache__/models.cpython-313.pyc +0 -0
vljepa/__pycache__/utils.cpython-313.pyc +0 -0
vljepa/config.py +87 -0
vljepa/dataset.py +185 -0
vljepa/losses.py +88 -0
vljepa/models.py +240 -0
vljepa/utils.py +158 -0

README.md ADDED Viewed

	@@ -0,0 +1,94 @@

+---
+language:
+    - en
+    - fr
+license: apache-2.0
+library_name: transformers
+tags:
+    - video-search
+    - v-jepa
+    - multi-modal
+    - temporal-grounding
+    - action-retrieval
+datasets:
+    - max044/Charades_v1_480
+metrics:
+    - loss
+---
+# VL-JEPA Custom (V-JEPA 2 + Qwen 2.5 + MiniLM)
+## English Description
+This model is a custom implementation of the **VL-JEPA** (Video-Language Joint
+Embedding Predictive Architecture) inspired by Meta AI's research. It is
+designed for **Temporal Moment Retrieval** (finding specific actions in videos).
+### Architecture
+- **X-Encoder (Video)**: Frozen
+  [V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256).
+- **Predictor (Refinement)**:
+  [Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) fine-tuned using
+  **LoRA** (Low-Rank Adaptation).
+- **Y-Encoder (Text Target)**: Frozen
+  [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
+### Training Details
+- **Dataset**:
+  [Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480)
+  (Academic dataset for video action localization).
+- **Optimization**: LoRA with $r=64$ and $\alpha=128$, targeting `q_proj` and
+  `v_proj` in Qwen.
+- **Learning Rate**: 3e-4 with Cosine Warmup.
+- **Outcome**: Only 0.2% of parameters are trainable, making it extremely
+  lightweight to train and run.
+---
+## Description en Français
+Ce modèle est une implémentation personnalisée de **VL-JEPA**, inspirée des
+travaux de Meta AI. Il est optimisé pour la recherche d'actions temporelles dans
+les vidéos (**Temporal Moment Retrieval**).
+### Architecture
+- **Encodeur Vidéo (X)** :
+  [V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256)
+  gelé.
+- **Prédicteur** : [Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
+  adapté avec **LoRA**.
+- **Encodeur Texte (Y)** :
+  [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
+  gelé.
+### Détails d'Entraînement
+- **Dataset** :
+  [Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480).
+- **Méthode** : Entraînement via LoRA ($r=64$, $\alpha=128$).
+- **Coût** : Approche très économique, entraînée pour environ 5$ sur Vast.ai.
+## Usage / Utilisation
+```python
+import torch
+from vljepa.config import Config
+from vljepa.models import VLJepa
+# Load model
+config = Config()
+model = VLJepa(config)
+checkpoint = torch.load("best.pth", map_location="cpu")
+model.predictor.load_state_dict(checkpoint["predictor_state_dict"])
+model.y_encoder.projection.load_state_dict(checkpoint["y_projection_state_dict"])
+model.eval()
+# Localizing an action
+# (Requires preprocessing frames and tokenizing query)
+```
+Refer to the source code for full inference pipeline with sliding window and
+NMS.

best.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6393f56b7528ad91a3281ebcd0bb368b44dc041a5b50bc7569d466e91e992750
+size 2045205003

original_readme.md ADDED Viewed

	@@ -0,0 +1,172 @@

+# VL-JEPA: Simplified Video-Language Alignment
+A simplified implementation of the Video-Language Joint Embedding Predictive
+Architecture (VL-JEPA) for **Temporal Moment Retrieval** (Temporal Grounding).
+This project uses **V-JEPA 2** for video understanding and **Qwen 2.5 0.5B** as
+a predictor to align video features with language queries in a high-dimensional
+embedding space.
+## 🚀 Architecture
+The model follows the JEPA framework by aligning video features (X) and text
+descriptions (Y) through a predictor (P):
+- **X-Encoder (Video)**: Frozen **V-JEPA 2** (ViT-L). High-fidelity hierarchical
+  video features.
+- **Y-Encoder (Text)**: Frozen **MiniLM** (all-MiniLM-L6-v2). Compact and
+  efficient semantic text embeddings.
+- **Predictor (Alignment)**: **Qwen 2.5 0.5B** with **LoRA** (Low-Rank
+  Adaptation). Learns to predict the target text embedding from the joint
+  video+query representation.
+## 🛠️ Installation
+This project uses `uv` for lightning-fast dependency management.
+```bash
+# Clone the repository
+git clone https://github.com/max044/vl-jepa.git
+cd vl-jepa
+# Create environment and install dependencies
+uv sync
+```
+## 📊 Data Preparation
+The model is trained on the **Charades-STA** dataset for temporal grounding.
+1. **Videos**: Download
+   [Charades v1](https://ai2-public-datasets.s3-us-west-2.amazonaws.com/charades/Charades_v1_480.zip)
+   and place them in `data/Charades_v1_480`.
+2. **Annotations**: Use `download_annotations.py` to download the annotations.
+Structure:
+```text
+data/
+├── Charades_v1_480/      # Video files (.mp4)
+├── charades_sta_train.txt
+└── charades_sta_test.txt
+```
+## 🏋️ Training
+Start training with default hyperparameters:
+```bash
+# Regular training (local, MPS/CPU)
+uv run train.py
+# Debug mode (small subset, only 2 epochs)
+uv run train.py --debug --device mps
+```
+### Key Training Features:
+- **Bidirectional InfoNCE Loss**: Maximizes mutual information between predicted
+  and target embeddings.
+- **LoRA Tuning**: Only 0.2% of the predictor parameters (Qwen) are trained,
+  making it extremely memory-efficient.
+- **MPS Support**: Optimized for Mac M1/M2/M3 chips.
+- **W&B Integration**: Full experiment tracking with model versioning.
+## ☁️ Cloud GPU Training
+Train on GPU with [Vast.ai](https://vast.ai/) (~$0.50–2/h for A100/H100).
+### Quick Start
+```bash
+# 1. On the cloud instance — bootstrap
+curl -sSL https://raw.githubusercontent.com/max044/vl-jepa/main/scripts/bootstrap.sh | bash
+# 2. Configure W&B
+cd ~/vl-jepa
+cp .env.example .env
+nano .env  # Set WANDB_API_KEY (get it at https://wandb.ai/authorize)
+# 3. Download videos
+wget -P data/ https://ai2-public-datasets.s3-us-west-2.amazonaws.com/charades/Charades_v1_480.zip
+unzip data/Charades_v1_480.zip -d data/
+or
+uv run hf download max044/Charades_v1_480 --local-dir data/Charades_v1_480 --repo-type dataset
+# 4. Launch training
+bash scripts/train_cloud.sh
+```
+### W&B Experiment Tracking
+All training runs are tracked on [Weights & Biases](https://wandb.ai/):
+- **Metrics**: loss, InfoNCE, learning rate (per step + per epoch)
+- **System**: GPU utilization, memory usage (automatic)
+- **Model versioning**: checkpoints uploaded as W&B Artifacts (`vl-jepa-best`,
+  `vl-jepa-last`) — every version is preserved and downloadable
+```bash
+# Train with W&B (default)
+uv run train.py --device cuda --wandb-project vl-jepa
+# Train without W&B
+uv run train.py --device cuda --no-wandb
+# Custom W&B run name
+uv run train.py --device cuda --wandb-run-name "exp-lr3e4-bs16"
+```
+### Environment Variables
+| Variable        | Description                                          | Required     |
+| --------------- | ---------------------------------------------------- | ------------ |
+| `WANDB_API_KEY` | W&B API key ([get here](https://wandb.ai/authorize)) | For tracking |
+| `WANDB_PROJECT` | W&B project name (default: `vl-jepa`)                | No           |
+| `WANDB_ENTITY`  | W&B team/organization                                | No           |
+| `EPOCHS`        | Override epoch count                                 | No           |
+| `BATCH_SIZE`    | Override batch size                                  | No           |
+## 🔍 Inference (Moment Retrieval)
+Once trained, you can use the model to find specific moments in a video based on
+a text query. The script uses a sliding window approach with NMS to find the
+best matching segments.
+```bash
+# Example: Local inference
+uv run infer.py \
+    --video data/Charades_v1_480/3MSZA.mp4 \
+    --query "person turns on the light" \
+    --checkpoint checkpoints/best.pth \
+    --device mps
+```
+## 🔍 Implementation Details
+Unlike standard VLM (Visual-Language Models) that use generative heads, this
+VL-JEPA implementation focuses on **embedding alignment**. This makes it an
+order of magnitude faster for retrieval tasks (search) as embeddings can be
+pre-computed and indexed using vector databases (Faiss, Milvus, Chroma).
+## 📚 References
+This implementation is based on the official VL-JEPA paper:
+```bibtex
+@misc{chen2026vljepajointembeddingpredictive,
+      title={VL-JEPA: Joint Embedding Predictive Architecture for Vision-language},
+      author={Delong Chen and Mustafa Shukor and Theo Moutakanni and Willy Chung and Jade Yu and Tejaswi Kasarla and Yejin Bang and Allen Bolourchi and Yann LeCun and Pascale Fung},
+      year={2026},
+      eprint={2512.10942},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2512.10942},
+}
+```
+## 📄 License
+MIT

sample_inference.py ADDED Viewed

	@@ -0,0 +1,64 @@

+import torch
+import cv2
+import numpy as np
+from PIL import Image
+from vljepa.config import Config
+from vljepa.models import VLJepa
+from vljepa.utils import nms
+def load_model(checkpoint_path, device="cpu"):
+    config = Config()
+    config.device = device
+    model = VLJepa(config)
+    print(f"Loading weights from {checkpoint_path}...")
+    checkpoint = torch.load(checkpoint_path, map_location=device, weights_only=True)
+    model.predictor.load_state_dict(checkpoint["predictor_state_dict"])
+    model.y_encoder.projection.load_state_dict(checkpoint["y_projection_state_dict"])
+    model.eval()
+    return model, config
+def extract_frames(video_path, num_frames=16):
+    cap = cv2.VideoCapture(video_path)
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    if total_frames <= 0:
+        return []
+    indices = np.linspace(0, total_frames - 1, num_frames).astype(int)
+    frames = []
+    for idx in indices:
+        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
+        ret, frame = cap.read()
+        if ret:
+            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+            frames.append(frame)
+    cap.release()
+    return frames
+def main():
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    checkpoint_path = "best.pth"
+    video_path = "sample_video.mp4" # Replace with a real video path
+    query = "a person is opening a door"
+    model, config = load_model(checkpoint_path, device)
+    # This is a simplified inference demonstration.
+    # In a real scenario, you would use a sliding window approach as seen in infer.py
+    print(f"Ready for inference on {device}.")
+    print(f"Model architecture: {config.clip_model} + {config.predictor_model} (LoRA) + {config.text_model}")
+    # Example Tokenization
+    query_tokens = model.query_encoder.tokenize([query], device=device)
+    # Example Text Encoding
+    with torch.no_grad():
+        text_embedding = model.encode_text([query], device=device)
+    print(f"Query: '{query}'")
+    print(f"Text embedding shape: {text_embedding.shape}")
+    print("\nTo perform full temporal localization, use the infer.py script which implements sliding window and NMS.")
+if __name__ == "__main__":
+    main()

vljepa/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+"""VL-JEPA: Simplified Video-Language Joint Embedding Predictive Architecture."""
+from vljepa.config import Config
+from vljepa.models import VLJepa
+from vljepa.losses import vl_jepa_loss
+__all__ = ["Config", "VLJepa", "vl_jepa_loss"]

vljepa/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (435 Bytes). View file

vljepa/__pycache__/config.cpython-313.pyc ADDED Viewed

Binary file (3.85 kB). View file

vljepa/__pycache__/dataset.cpython-313.pyc ADDED Viewed

Binary file (6.86 kB). View file

vljepa/__pycache__/losses.cpython-313.pyc ADDED Viewed

Binary file (3.71 kB). View file

vljepa/__pycache__/models.cpython-313.pyc ADDED Viewed

Binary file (14.7 kB). View file

vljepa/__pycache__/utils.cpython-313.pyc ADDED Viewed

Binary file (5.8 kB). View file

vljepa/config.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""Configuration for VL-JEPA training and inference."""
+from dataclasses import dataclass, field
+from pathlib import Path
+import torch
+@dataclass
+class Config:
+    """All hyperparameters and paths for VL-JEPA."""
+    # ── Device ──────────────────────────────────────────────
+    device: str = ""  # auto-detected if empty
+    # ── Model ────────────────────────────────────────────────────────────
+    # X-Encoder: V-JEPA 2 ViT-L (frozen, ~300M)
+    clip_model: str = "facebook/vjepa2-vitl-fpc64-256"
+    # Predictor: Qwen 2.5 0.5B (LoRA)
+    predictor_model: str = "Qwen/Qwen2.5-0.5B"
+    use_lora: bool = True
+    lora_r: int = 64
+    lora_alpha: int = 128
+    lora_dropout: float = 0.05
+    lora_target_modules: list[str] = field(default_factory=lambda: ["q_proj", "v_proj"])
+    # Y-Encoder: MiniLM (frozen, ~22M)
+    text_model: str = "sentence-transformers/all-MiniLM-L6-v2"
+    # Embedding and model dimensions
+    x_dim: int = 1024         # V-JEPA ViT-L output dim
+    predictor_dim: int = 896  # Qwen 2.5 0.5B hidden dim
+    text_dim: int = 384       # MiniLM-L6-v2 output dim
+    embed_dim: int = 384      # Shared projection target
+    # ── Video ────────────────────────────────────────────────────────────
+    num_frames: int = 16
+    frame_size: int = 224     # V-JEPA input resolution
+    # ── Training ─────────────────────────────────────────────────────────
+    batch_size: int = 4       # Start small (increase if GPU RAM allows)
+    lr: float = 3e-4
+    weight_decay: float = 0.01
+    epochs: int = 20
+    warmup_steps: int = 200
+    grad_clip: float = 1.0
+    # Loss
+    temperature: float = 0.07
+    sigreg_weight: float = 0.1
+    # ── Data ────────────────────────────────────────────────
+    data_dir: str = "./data"
+    videos_dir: str = "./data/Charades_v1_480"
+    anno_train: str = "./data/charades_sta_train.txt"
+    anno_test: str = "./data/charades_sta_test.txt"
+    hf_dataset_id: str = "max044/Charades_v1_480"
+    # ── Checkpoints ─────────────────────────────────────────
+    checkpoint_dir: str = "./checkpoints"
+    save_every: int = 2  # save checkpoint every N epochs
+    val_every: int = 2   # run validation every N epochs
+    val_samples: int = 500  # limit validation samples for speed
+    # ── Inference ───────────────────────────────────────────
+    window_sizes: list[float] = field(default_factory=lambda: [2.0, 4.0, 8.0, 16.0])
+    window_stride: float = 1.0
+    nms_threshold: float = 0.5
+    top_k: int = 5
+    # ── Debug ───────────────────────────────────────────────
+    debug: bool = False
+    debug_samples: int = 100
+    num_workers: int = 0  # 0 for MPS compatibility
+    def __post_init__(self):
+        if not self.device:
+            if torch.cuda.is_available():
+                self.device = "cuda"
+            elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+                self.device = "mps"
+            else:
+                self.device = "cpu"
+        # Ensure directories exist
+        Path(self.checkpoint_dir).mkdir(parents=True, exist_ok=True)
+        Path(self.data_dir).mkdir(parents=True, exist_ok=True)

vljepa/dataset.py ADDED Viewed

	@@ -0,0 +1,185 @@

+"""Charades-STA dataset for VL-JEPA training."""
+import os
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+from vljepa.config import Config
+from vljepa.utils import load_video_frames
+try:
+    from huggingface_hub import hf_hub_download
+    HAS_HF_HUB = True
+except ImportError:
+    HAS_HF_HUB = False
+class CharadesSTADataset(Dataset):
+    """Dataset for Charades-STA temporal grounding.
+    Annotation format: video_id start end##sentence
+    Example: 3MSZA 24.3 30.4##person turn a light on
+    For training, the query is a neutral prompt ("What is happening in this video?")
+    and the target is the ground-truth caption.
+    """
+    NEUTRAL_QUERIES = [
+        "What is happening in this video?",
+        "Describe this video clip.",
+        "What action is being performed?",
+    ]
+    def __init__(
+        self,
+        anno_file: str,
+        videos_dir: str,
+        config: Config,
+        split: str = "train",
+    ):
+        self.videos_dir = videos_dir
+        self.config = config
+        self.split = split
+        self.samples = []
+        self._load_annotations(anno_file)
+        if config.debug:
+            self.samples = self.samples[: config.debug_samples]
+        print(f"[{split}] Loaded {len(self.samples)} samples")
+    def _load_annotations(self, anno_file: str):
+        """Parse Charades-STA annotation file."""
+        if not os.path.exists(anno_file):
+            # Try loading from HuggingFace datasets
+            self._load_from_hf()
+            return
+        with open(anno_file, "r") as f:
+            for line in f:
+                line = line.strip()
+                if not line:
+                    continue
+                # Format: video_id start end##sentence
+                parts = line.split("##")
+                if len(parts) < 2:
+                    continue
+                meta = parts[0].strip().split()
+                sentence = parts[1].strip()
+                if len(meta) < 3:
+                    continue
+                video_id = meta[0]
+                start = float(meta[1])
+                end = float(meta[2])
+                video_path = os.path.join(self.videos_dir, f"{video_id}.mp4")
+                # If streaming/lazy loading is enabled, we add even if not local
+                if os.path.exists(video_path) or self.config.hf_dataset_id:
+                    self.samples.append({
+                        "video_path": video_path,
+                        "video_id": video_id,
+                        "start": start,
+                        "end": end,
+                        "caption": sentence,
+                    })
+    def _load_from_hf(self):
+        """Fallback: load annotations from HuggingFace datasets."""
+        try:
+            from datasets import load_dataset
+            print("Loading annotations from HuggingFace (lmms-lab/charades_sta)...")
+            ds = load_dataset("lmms-lab/charades_sta", split="test")
+            for item in ds:
+                video_id = item.get("video_id") or item.get("video", "")
+                start = float(item.get("start", 0))
+                end = float(item.get("end", 10))
+                caption = item.get("query", "") or item.get("description", "")
+                video_path = os.path.join(self.videos_dir, f"{video_id}.mp4")
+                if os.path.exists(video_path) and caption:
+                    self.samples.append({
+                        "video_path": video_path,
+                        "video_id": video_id,
+                        "start": start,
+                        "end": end,
+                        "caption": caption,
+                    })
+        except Exception as e:
+            print(f"Failed to load from HuggingFace: {e}")
+            print("Please download annotations manually. See download_annotations.py")
+    def __len__(self):
+        return len(self.samples)
+    def __getitem__(self, idx: int) -> dict | None:
+        sample = self.samples[idx]
+        video_path = sample["video_path"]
+        # ── Lazy Loading from HF ────────────────────────────
+        if not os.path.exists(video_path) and self.config.hf_dataset_id:
+            if HAS_HF_HUB:
+                try:
+                    # Download only the specific file needed
+                    video_path = hf_hub_download(
+                        repo_id=self.config.hf_dataset_id,
+                        filename=f"{sample['video_id']}.mp4",
+                        repo_type="dataset",
+                        local_dir=self.videos_dir, # Cache it in the videos dir
+                    )
+                except Exception as e:
+                    print(f"Error downloading {sample['video_id']}: {e}")
+                    return None
+            else:
+                print("Error: huggingface_hub not installed, cannot lazy load.")
+                return None
+        # Load frames from the annotated temporal segment
+        frames = load_video_frames(
+            video_path,
+            start_sec=sample["start"],
+            end_sec=sample["end"],
+            num_frames=self.config.num_frames,
+        )
+        if frames is None or len(frames) == 0:
+            return None
+        # Use a neutral query for training
+        # (VL-JEPA learns to predict the target caption embedding from video + query)
+        query_idx = idx % len(self.NEUTRAL_QUERIES)
+        query = self.NEUTRAL_QUERIES[query_idx]
+        return {
+            "frames": frames,           # list of numpy arrays (H, W, 3)
+            "query": query,             # neutral text query
+            "caption": sample["caption"],  # target caption
+            "video_id": sample["video_id"],
+            "start": sample["start"],
+            "end": sample["end"],
+        }
+def collate_fn(batch: list[dict | None]) -> dict | None:
+    """Custom collate that filters out None samples."""
+    batch = [b for b in batch if b is not None]
+    if len(batch) == 0:
+        return None
+    return {
+        "frames": [b["frames"] for b in batch],
+        "queries": [b["query"] for b in batch],
+        "captions": [b["caption"] for b in batch],
+        "video_ids": [b["video_id"] for b in batch],
+        "starts": [b["start"] for b in batch],
+        "ends": [b["end"] for b in batch],
+    }

vljepa/losses.py ADDED Viewed

	@@ -0,0 +1,88 @@

+"""Loss functions for VL-JEPA: bidirectional InfoNCE + SIGReg regularization."""
+import torch
+import torch.nn.functional as F
+def infonce_bidirectional(
+    pred: torch.Tensor,
+    target: torch.Tensor,
+    temperature: float = 0.07,
+) -> torch.Tensor:
+    """Symmetric InfoNCE loss between predicted and target embeddings.
+    Args:
+        pred: predicted embeddings (B, D), L2-normalized inside.
+        target: target embeddings (B, D), L2-normalized inside.
+        temperature: scaling factor for logits.
+    Returns:
+        Scalar loss (average of forward + backward directions).
+    """
+    pred = F.normalize(pred, dim=-1)
+    target = F.normalize(target, dim=-1)
+    # Cosine similarity matrix (B, B)
+    logits = pred @ target.T / temperature
+    labels = torch.arange(pred.size(0), device=pred.device)
+    loss_fwd = F.cross_entropy(logits, labels)
+    loss_bwd = F.cross_entropy(logits.T, labels)
+    return (loss_fwd + loss_bwd) / 2
+def sigreg_loss(
+    embeddings: torch.Tensor,
+    lambda_reg: float = 0.1,
+) -> torch.Tensor:
+    """Regularize embeddings towards unit-variance isotropic distribution.
+    Simplified SIGReg: penalizes deviation of the covariance from identity.
+    """
+    if embeddings.size(0) < 2:
+        return torch.tensor(0.0, device=embeddings.device)
+    # Center
+    embeddings = embeddings - embeddings.mean(dim=0, keepdim=True)
+    # Covariance (D, D)
+    B, D = embeddings.shape
+    cov = (embeddings.T @ embeddings) / (B - 1)
+    # Variance: encourage diagonal to be 1
+    var_loss = F.relu(1.0 - cov.diagonal()).mean()
+    # Covariance: decorrelate off-diagonal
+    off_diag = cov - torch.diag(cov.diagonal())
+    cov_loss = (off_diag ** 2).mean()
+    return lambda_reg * (var_loss + cov_loss)
+def vl_jepa_loss(
+    pred: torch.Tensor,
+    target: torch.Tensor,
+    temperature: float = 0.07,
+    sigreg_weight: float = 0.1,
+) -> tuple[torch.Tensor, dict[str, float]]:
+    """Combined VL-JEPA training loss.
+    Returns:
+        total_loss: scalar tensor for backprop.
+        metrics: dict with breakdown of loss components.
+    """
+    align = infonce_bidirectional(pred, target, temperature)
+    reg_pred = sigreg_loss(pred, sigreg_weight)
+    reg_target = sigreg_loss(target, sigreg_weight)
+    total = align + reg_pred + reg_target
+    metrics = {
+        "loss/total": total.item(),
+        "loss/infonce": align.item(),
+        "loss/sigreg_pred": reg_pred.item(),
+        "loss/sigreg_target": reg_target.item(),
+    }
+    return total, metrics

vljepa/models.py ADDED Viewed

	@@ -0,0 +1,240 @@

+"""VL-JEPA model components: V-JEPA 2 (X-Encoder), Qwen 2.5 (Predictor), MiniLM (Y-Encoder)."""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer
+from peft import get_peft_model, LoraConfig, TaskType
+from sentence_transformers import SentenceTransformer
+import numpy as np
+from vljepa.config import Config
+class XEncoder(nn.Module):
+    """Frozen V-JEPA 2 Video Encoder.
+    Extracts hierarchical video features.
+    """
+    def __init__(self, config: Config):
+        super().__init__()
+        # Load V-JEPA 2 model
+        try:
+            self.model = AutoModel.from_pretrained(config.clip_model, trust_remote_code=True)
+        except Exception:
+            print(f"Warning: Failed to load {config.clip_model}. Trying fallback 'facebook/vjepa-vit-h-14-224'.")
+            self.model = AutoModel.from_pretrained("facebook/vjepa-vit-h-14-224", trust_remote_code=True)
+            config.x_dim = self.model.config.hidden_size
+        # Freeze
+        for p in self.model.parameters():
+            p.requires_grad = False
+        self.model.eval()
+        # Move to device if needed
+        self.model.to(config.device)
+        self.hidden_size = config.x_dim
+    @torch.no_grad()
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        """Encode video frames.
+        Args:
+            pixel_values: (B, C, T, H, W) preprocessed frames (0-1 float, normalized)
+        """
+        if pixel_values.shape[1] == 3 and pixel_values.shape[2] > 3:
+             # (B, C, T, H, W) -> (B, T, C, H, W)
+             pixel_values = pixel_values.permute(0, 2, 1, 3, 4)
+        try:
+            outputs = self.model(pixel_values_videos=pixel_values)
+        except TypeError:
+             # Fallback
+             outputs = self.model(pixel_values=pixel_values)
+        last_hidden = outputs.last_hidden_state # (B, seq_len, hidden)
+        sv = last_hidden.mean(dim=1) # (B, hidden)
+        return sv
+    def preprocess_frames(self, frames_batch: list[list], device: str = "cpu") -> torch.Tensor:
+        """Preprocess frames."""
+        mean = torch.tensor([0.485, 0.456, 0.406], device=device).view(1, 3, 1, 1, 1)
+        std = torch.tensor([0.229, 0.224, 0.225], device=device).view(1, 3, 1, 1, 1)
+        padded = []
+        for frames in frames_batch:
+             if len(frames) == 0:
+                 t = torch.zeros((16, 3, 224, 224), device=device)
+                 padded.append(t)
+                 continue
+             # Stack to (T, H, W, 3)
+             t = torch.tensor(np.stack(frames), dtype=torch.float32, device=device)
+             # Permute to (T, 3, H, W)
+             t = t.permute(0, 3, 1, 2) / 255.0
+             # Resize
+             t = F.interpolate(t, size=(224, 224), mode='bilinear', align_corners=False)
+             padded.append(t)
+        max_t = max((t.size(0) for t in padded), default=16)
+        final_padded = []
+        for t in padded:
+             if t.size(0) < max_t:
+                 pad = t[-1:].expand(max_t - t.size(0), -1, -1, -1)
+                 t = torch.cat([t, pad], dim=0)
+             final_padded.append(t)
+        # Stack -> (B, T, 3, H, W)
+        pixel_values = torch.stack(final_padded, dim=0)
+        # Input to V-JEPA 2 (via HF) usually expects (B, T, C, H, W)
+        # Normalize (broadcasting T)
+        # mean/std are (1, 3, 1, 1, 1). We need to align with (B, T, 3, H, W)
+        # Permute to (B, 3, T, H, W) for normalization
+        pixel_values = pixel_values.permute(0, 2, 1, 3, 4)
+        pixel_values = (pixel_values - mean) / std
+        # Permute back to (B, T, 3, H, W)
+        pixel_values = pixel_values.permute(0, 2, 1, 3, 4)
+        return pixel_values
+class QueryEncoder(nn.Module):
+    """Tokenizer for Qwen."""
+    def __init__(self, config: Config):
+        super().__init__()
+        self.tokenizer = AutoTokenizer.from_pretrained(config.predictor_model, trust_remote_code=True)
+        if self.tokenizer.pad_token is None:
+            self.tokenizer.pad_token = self.tokenizer.eos_token
+    def tokenize(self, texts: list[str], device: str = "cpu") -> dict:
+        return self.tokenizer(
+            texts, return_tensors="pt", padding=True, truncation=True, max_length=64
+        ).to(device)
+class Predictor(nn.Module):
+    """Qwen 2.5 0.5B Predictor with LoRA."""
+    def __init__(self, config: Config):
+        super().__init__()
+        self.model = AutoModel.from_pretrained(
+            config.predictor_model,
+            torch_dtype=torch.float16 if config.device == "cuda" else torch.float32,
+            trust_remote_code=True
+        )
+        if config.use_lora:
+            peft_config = LoraConfig(
+                task_type=TaskType.FEATURE_EXTRACTION,
+                inference_mode=False,
+                r=config.lora_r,
+                lora_alpha=config.lora_alpha,
+                lora_dropout=config.lora_dropout,
+                target_modules=config.lora_target_modules
+            )
+            self.model = get_peft_model(self.model, peft_config)
+            self.model.print_trainable_parameters()
+        self.visual_proj = nn.Linear(config.x_dim, config.predictor_dim)
+        self.output_proj = nn.Linear(config.predictor_dim, config.embed_dim)
+        # Move to device
+        self.to(config.device)
+    def forward(self, sv: torch.Tensor, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
+        B = sv.size(0)
+        sv_embeds = self.visual_proj(sv).unsqueeze(1) # (B, 1, predictor_dim)
+        if hasattr(self.model, "base_model"):
+             base = self.model.base_model.model
+        else:
+             base = self.model
+        # Qwen2 uses model.embed_tokens
+        # We try to access it via property or direct module
+        if hasattr(base, "model"):
+             embed_layer = base.model.embed_tokens
+        elif hasattr(base, "embed_tokens"):
+             embed_layer = base.embed_tokens
+        else:
+             # General fallback for AutoModel
+             embed_layer = base.get_input_embeddings()
+        inputs_embeds = embed_layer(input_ids)
+        combined_embeds = torch.cat([sv_embeds, inputs_embeds], dim=1)
+        ones = torch.ones((B, 1), device=sv.device, dtype=attention_mask.dtype)
+        combined_mask = torch.cat([ones, attention_mask], dim=1)
+        outputs = self.model(inputs_embeds=combined_embeds, attention_mask=combined_mask)
+        last_hidden = outputs.last_hidden_state[:, -1, :]
+        return self.output_proj(last_hidden)
+class YEncoder(nn.Module):
+    """Frozen MiniLM Y-Encoder."""
+    def __init__(self, config: Config):
+        super().__init__()
+        self.model = SentenceTransformer(config.text_model)
+        self.projection = nn.Linear(config.text_dim, config.embed_dim)
+        for p in self.model.parameters():
+            p.requires_grad = False
+        self.model.eval()
+    def forward(self, texts: list[str], device: str = "cpu") -> torch.Tensor:
+        with torch.no_grad():
+            embeddings = self.model.encode(texts, convert_to_tensor=True, device=device)
+        # Clone to avoid "Inference tensors cannot be saved for backward" error
+        return self.projection(embeddings.clone())
+class VLJepa(nn.Module):
+    """V-JEPA 2 + Qwen 2.5 + MiniLM."""
+    def __init__(self, config: Config):
+        super().__init__()
+        self.config = config
+        self.x_encoder = XEncoder(config)
+        self.query_encoder = QueryEncoder(config)
+        self.predictor = Predictor(config)
+        self.y_encoder = YEncoder(config)
+    def forward(self, pixel_values, query_ids, query_mask, target_texts):
+        sv = self.x_encoder(pixel_values)
+        sy_hat = self.predictor(sv, query_ids, query_mask)
+        sy = self.y_encoder(target_texts, device=str(pixel_values.device))
+        return sy_hat, sy
+    def encode_video_query(self, pixel_values, query_ids, query_mask):
+        sv = self.x_encoder(pixel_values)
+        sy_hat = self.predictor(sv, query_ids, query_mask)
+        return sy_hat
+    def encode_text(self, texts, device="cpu"):
+        return self.y_encoder(texts, device=device)
+    def trainable_parameters(self):
+        return list(self.predictor.parameters()) + list(self.y_encoder.projection.parameters())
+    def count_parameters(self):
+        def _count(m):
+            return {
+                "total": sum(p.numel() for p in m.parameters()),
+                "trainable": sum(p.numel() for p in m.parameters() if p.requires_grad)
+            }
+        return {
+            "x_encoder": _count(self.x_encoder),
+            "predictor": _count(self.predictor),
+            "y_encoder": _count(self.y_encoder)
+        }

vljepa/utils.py ADDED Viewed

	@@ -0,0 +1,158 @@

+"""Utility functions: video I/O, temporal IoU, NMS, sliding windows."""
+import cv2
+import numpy as np
+import torch
+def load_video_frames(
+    video_path: str,
+    start_sec: float = 0.0,
+    end_sec: float | None = None,
+    num_frames: int = 16,
+) -> list[np.ndarray] | None:
+    """Load uniformly sampled RGB frames from a video segment.
+    Args:
+        video_path: path to .mp4 file
+        start_sec: start of segment in seconds
+        end_sec: end of segment in seconds (None = end of video)
+        num_frames: number of frames to sample
+    Returns:
+        List of RGB numpy arrays (H, W, 3), or None on failure.
+    """
+    cap = cv2.VideoCapture(video_path)
+    if not cap.isOpened():
+        return None
+    fps = cap.get(cv2.CAP_PROP_FPS)
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    if fps <= 0 or total_frames <= 0:
+        cap.release()
+        return None
+    duration = total_frames / fps
+    if end_sec is None:
+        end_sec = duration
+    start_frame = max(0, int(start_sec * fps))
+    end_frame = min(total_frames - 1, int(end_sec * fps))
+    if end_frame <= start_frame:
+        cap.release()
+        return None
+    n_available = end_frame - start_frame + 1
+    n_sample = min(num_frames, n_available)
+    indices = np.linspace(start_frame, end_frame, n_sample, dtype=int)
+    frames = []
+    for idx in indices:
+        cap.set(cv2.CAP_PROP_POS_FRAMES, int(idx))
+        ret, frame = cap.read()
+        if ret:
+            frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
+    cap.release()
+    if len(frames) == 0:
+        return None
+    return frames
+def get_video_duration(video_path: str) -> float:
+    """Get video duration in seconds."""
+    cap = cv2.VideoCapture(video_path)
+    if not cap.isOpened():
+        return 0.0
+    fps = cap.get(cv2.CAP_PROP_FPS)
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    cap.release()
+    if fps <= 0:
+        return 0.0
+    return total_frames / fps
+def temporal_iou(
+    pred_start: float,
+    pred_end: float,
+    gt_start: float,
+    gt_end: float,
+) -> float:
+    """Compute temporal Intersection over Union between two segments."""
+    inter_start = max(pred_start, gt_start)
+    inter_end = min(pred_end, gt_end)
+    inter = max(0.0, inter_end - inter_start)
+    union = (pred_end - pred_start) + (gt_end - gt_start) - inter
+    if union <= 0:
+        return 0.0
+    return inter / union
+def nms(
+    proposals: list[tuple[float, float]],
+    scores: list[float],
+    iou_threshold: float = 0.5,
+) -> list[int]:
+    """Non-maximum suppression for temporal proposals.
+    Args:
+        proposals: list of (start, end) tuples
+        scores: corresponding scores
+        iou_threshold: suppress proposals with IoU above this
+    Returns:
+        List of kept indices (sorted by score descending).
+    """
+    if len(proposals) == 0:
+        return []
+    sorted_idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
+    kept = []
+    for i in sorted_idx:
+        should_keep = True
+        for j in kept:
+            iou = temporal_iou(
+                proposals[i][0], proposals[i][1],
+                proposals[j][0], proposals[j][1],
+            )
+            if iou > iou_threshold:
+                should_keep = False
+                break
+        if should_keep:
+            kept.append(i)
+    return kept
+def sliding_window_proposals(
+    duration: float,
+    window_sizes: list[float],
+    stride: float = 1.0,
+) -> list[tuple[float, float]]:
+    """Generate candidate temporal proposals using sliding windows.
+    Args:
+        duration: total video duration in seconds
+        window_sizes: list of window durations to use
+        stride: step size in seconds
+    Returns:
+        List of (start, end) proposals.
+    """
+    proposals = []
+    for ws in window_sizes:
+        if ws > duration:
+            # Single proposal covering the whole video
+            proposals.append((0.0, duration))
+            continue
+        start = 0.0
+        while start + ws <= duration + 0.01:  # small epsilon for float
+            end = min(start + ws, duration)
+            proposals.append((start, end))
+            start += stride
+    return proposals