---
tags:
- ml-intern
---
# Video Highlight Extractor

A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. **vlog**, **food**, **travel**, **tutorial**). Provides **Start** and **End** timestamps for each extracted clip.

---

## 🎯 What it does

| Step | Action |
|------|--------|
| **1. Segment** | Video is split into short overlapping windows (default 4 s, 1 s overlap) |
| **2. Understand** | A video-language model describes what happens in each segment |
| **3. Score** | Each segment is scored 0-10 for relevance to your natural-language query |
| **4. Merge** | Adjacent high-scoring segments are merged into continuous clips |
| **5. Output** | Clips with `start_sec`, `end_sec`, `start_hms`, `end_hms`, score, description & category |

---

## 📦 Files

| File | Purpose |
|------|---------|
| `video_highlight_extractor.py` | Core `VideoHighlightExtractor` class |
| `demo.py` | Ready-to-run CLI demo |
| `test_video_pipeline.py` | Unit tests (creates a synthetic video & tests I/O + merging) |
| `requirements.txt` | Python dependencies |

---

## 🚀 Quick start

```bash
pip install -r requirements.txt

# Run on any video
python demo.py \
  --video my_video.mp4 \
  --query "exciting food moments and travel scenery" \
  --model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \
  --output highlights.json
```

---

## 🧠 Recommended models

| Model | Size | Best for |
|-------|------|----------|
| **[`HuggingFaceTB/SmolVLM2-256M-Video-Instruct`](https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct)** | 256 M | Fast, CPU-friendly |
| **[`Qwen/Qwen2.5-Omni-3B`](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)** | 3 B | Strong video + audio understanding |
| **[`OpenGVLab/VideoChat-R1_7B_caption`](https://huggingface.co/OpenGVLab/VideoChat-R1_7B_caption)** | 7 B | Highest quality, needs GPU |

---

## 📋 Python API

```python
from video_highlight_extractor import VideoHighlightExtractor

extractor = VideoHighlightExtractor(
    model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
)

clips = extractor.extract_highlights(
    video_path="my_video.mp4",
    query="delicious food preparation",
    target_duration_sec=30,      # target ~30 s of highlights total
    score_threshold=0.3,
    top_k=5,
    detect_categories=True,        # classify into vlog / food / travel / tutorial / ...
    use_audio=False,               # set True (+ whisper) for conversation-based highlights
)

for c in clips:
    print(c.to_dict())
    # {
    #   "start_sec": 12.0,
    #   "end_sec": 20.0,
    #   "start_hms": "00:12.000",
    #   "end_hms": "00:20.000",
    #   "duration_sec": 8.0,
    #   "relevance_score": 0.85,
    #   "description": "A chef is chopping vegetables...",
    #   "category": "food"
    # }

extractor.save_results(clips, "highlights.json")
```

---

## 📤 Output format (`highlights.json`)

```json
{
  "clips": [
    {
      "start_sec": 12.0,
      "end_sec": 20.0,
      "start_hms": "00:12.000",
      "end_hms": "00:20.000",
      "duration_sec": 8.0,
      "relevance_score": 0.85,
      "description": "A chef is chopping vegetables in a kitchen...",
      "transcript": null,
      "category": "food"
    }
  ],
  "total_clips": 1,
  "total_highlight_duration": 8.0
}
```

---

## 🔧 Architecture

```
Video ──► segment windows (4 s, 1 s overlap)
        │
        ├───► sample up to 8 frames ──► VLM describes segment
        │                                   │
        ├───► (optional) Whisper transcribes audio
        │
        └───► LLM scores 0-10 relevance to query
                  │
                  ▼
        merge adjacent high-score segments ──► VideoClip objects
                  │
                  ▼
        JSON output with timestamps + metadata
```

---

## 📚 Research foundation

This implementation builds on ideas from:

- **[UniVTG](https://arxiv.org/abs/2307.16715)** — unified video-language temporal grounding
- **[QVHighlights / Moment-DETR](https://arxiv.org/abs/2107.09609)** — transformer encoder-decoder for joint moment retrieval & highlight detection
- **[VTG-LLM](https://arxiv.org/abs/2405.13382)** — timestamp-aware video LLMs for temporal grounding
- **Qwen2.5-Omni** & **SmolVLM2** — current practical video-language models on HuggingFace

---

## 🧪 Tests

```bash
python test_video_pipeline.py
```

Validates:
- Synthetic MP4 creation & parsing
- Video info extraction (duration, FPS, resolution)
- Frame subsampling
- Segment merging logic
- Timestamp formatting (`HH:MM:SS.mmm`)

---

## 📄 License

Apache-2.0

<!-- ml-intern-provenance -->
## Generated by ML Intern

This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "notjulietxd/video-highlight-extractor"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```

For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.