notjulietxd's picture
Update ML Intern artifact metadata
538668c verified
---
tags:
- ml-intern
---
# Video Highlight Extractor
A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. **vlog**, **food**, **travel**, **tutorial**). Provides **Start** and **End** timestamps for each extracted clip.
---
## 🎯 What it does
| Step | Action |
|------|--------|
| **1. Segment** | Video is split into short overlapping windows (default 4 s, 1 s overlap) |
| **2. Understand** | A video-language model describes what happens in each segment |
| **3. Score** | Each segment is scored 0-10 for relevance to your natural-language query |
| **4. Merge** | Adjacent high-scoring segments are merged into continuous clips |
| **5. Output** | Clips with `start_sec`, `end_sec`, `start_hms`, `end_hms`, score, description & category |
---
## πŸ“¦ Files
| File | Purpose |
|------|---------|
| `video_highlight_extractor.py` | Core `VideoHighlightExtractor` class |
| `demo.py` | Ready-to-run CLI demo |
| `test_video_pipeline.py` | Unit tests (creates a synthetic video & tests I/O + merging) |
| `requirements.txt` | Python dependencies |
---
## πŸš€ Quick start
```bash
pip install -r requirements.txt
# Run on any video
python demo.py \
--video my_video.mp4 \
--query "exciting food moments and travel scenery" \
--model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \
--output highlights.json
```
---
## 🧠 Recommended models
| Model | Size | Best for |
|-------|------|----------|
| **[`HuggingFaceTB/SmolVLM2-256M-Video-Instruct`](https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct)** | 256 M | Fast, CPU-friendly |
| **[`Qwen/Qwen2.5-Omni-3B`](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)** | 3 B | Strong video + audio understanding |
| **[`OpenGVLab/VideoChat-R1_7B_caption`](https://huggingface.co/OpenGVLab/VideoChat-R1_7B_caption)** | 7 B | Highest quality, needs GPU |
---
## πŸ“‹ Python API
```python
from video_highlight_extractor import VideoHighlightExtractor
extractor = VideoHighlightExtractor(
model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
)
clips = extractor.extract_highlights(
video_path="my_video.mp4",
query="delicious food preparation",
target_duration_sec=30, # target ~30 s of highlights total
score_threshold=0.3,
top_k=5,
detect_categories=True, # classify into vlog / food / travel / tutorial / ...
use_audio=False, # set True (+ whisper) for conversation-based highlights
)
for c in clips:
print(c.to_dict())
# {
# "start_sec": 12.0,
# "end_sec": 20.0,
# "start_hms": "00:12.000",
# "end_hms": "00:20.000",
# "duration_sec": 8.0,
# "relevance_score": 0.85,
# "description": "A chef is chopping vegetables...",
# "category": "food"
# }
extractor.save_results(clips, "highlights.json")
```
---
## πŸ“€ Output format (`highlights.json`)
```json
{
"clips": [
{
"start_sec": 12.0,
"end_sec": 20.0,
"start_hms": "00:12.000",
"end_hms": "00:20.000",
"duration_sec": 8.0,
"relevance_score": 0.85,
"description": "A chef is chopping vegetables in a kitchen...",
"transcript": null,
"category": "food"
}
],
"total_clips": 1,
"total_highlight_duration": 8.0
}
```
---
## πŸ”§ Architecture
```
Video ──► segment windows (4 s, 1 s overlap)
β”‚
β”œβ”€β”€β”€β–Ί sample up to 8 frames ──► VLM describes segment
β”‚ β”‚
β”œβ”€β”€β”€β–Ί (optional) Whisper transcribes audio
β”‚
└───► LLM scores 0-10 relevance to query
β”‚
β–Ό
merge adjacent high-score segments ──► VideoClip objects
β”‚
β–Ό
JSON output with timestamps + metadata
```
---
## πŸ“š Research foundation
This implementation builds on ideas from:
- **[UniVTG](https://arxiv.org/abs/2307.16715)** β€” unified video-language temporal grounding
- **[QVHighlights / Moment-DETR](https://arxiv.org/abs/2107.09609)** β€” transformer encoder-decoder for joint moment retrieval & highlight detection
- **[VTG-LLM](https://arxiv.org/abs/2405.13382)** β€” timestamp-aware video LLMs for temporal grounding
- **Qwen2.5-Omni** & **SmolVLM2** β€” current practical video-language models on HuggingFace
---
## πŸ§ͺ Tests
```bash
python test_video_pipeline.py
```
Validates:
- Synthetic MP4 creation & parsing
- Video info extraction (duration, FPS, resolution)
- Frame subsampling
- Segment merging logic
- Timestamp formatting (`HH:MM:SS.mmm`)
---
## πŸ“„ License
Apache-2.0
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "notjulietxd/video-highlight-extractor"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```
For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.