| --- |
| tags: |
| - ml-intern |
| --- |
| # Video Highlight Extractor |
|
|
| A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. **vlog**, **food**, **travel**, **tutorial**). Provides **Start** and **End** timestamps for each extracted clip. |
|
|
| --- |
|
|
| ## π― What it does |
|
|
| | Step | Action | |
| |------|--------| |
| | **1. Segment** | Video is split into short overlapping windows (default 4 s, 1 s overlap) | |
| | **2. Understand** | A video-language model describes what happens in each segment | |
| | **3. Score** | Each segment is scored 0-10 for relevance to your natural-language query | |
| | **4. Merge** | Adjacent high-scoring segments are merged into continuous clips | |
| | **5. Output** | Clips with `start_sec`, `end_sec`, `start_hms`, `end_hms`, score, description & category | |
|
|
| --- |
|
|
| ## π¦ Files |
|
|
| | File | Purpose | |
| |------|---------| |
| | `video_highlight_extractor.py` | Core `VideoHighlightExtractor` class | |
| | `demo.py` | Ready-to-run CLI demo | |
| | `test_video_pipeline.py` | Unit tests (creates a synthetic video & tests I/O + merging) | |
| | `requirements.txt` | Python dependencies | |
|
|
| --- |
|
|
| ## π Quick start |
|
|
| ```bash |
| pip install -r requirements.txt |
| |
| # Run on any video |
| python demo.py \ |
| --video my_video.mp4 \ |
| --query "exciting food moments and travel scenery" \ |
| --model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \ |
| --output highlights.json |
| ``` |
|
|
| --- |
|
|
| ## π§ Recommended models |
|
|
| | Model | Size | Best for | |
| |-------|------|----------| |
| | **[`HuggingFaceTB/SmolVLM2-256M-Video-Instruct`](https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct)** | 256 M | Fast, CPU-friendly | |
| | **[`Qwen/Qwen2.5-Omni-3B`](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)** | 3 B | Strong video + audio understanding | |
| | **[`OpenGVLab/VideoChat-R1_7B_caption`](https://huggingface.co/OpenGVLab/VideoChat-R1_7B_caption)** | 7 B | Highest quality, needs GPU | |
|
|
| --- |
|
|
| ## π Python API |
|
|
| ```python |
| from video_highlight_extractor import VideoHighlightExtractor |
| |
| extractor = VideoHighlightExtractor( |
| model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct" |
| ) |
| |
| clips = extractor.extract_highlights( |
| video_path="my_video.mp4", |
| query="delicious food preparation", |
| target_duration_sec=30, # target ~30 s of highlights total |
| score_threshold=0.3, |
| top_k=5, |
| detect_categories=True, # classify into vlog / food / travel / tutorial / ... |
| use_audio=False, # set True (+ whisper) for conversation-based highlights |
| ) |
| |
| for c in clips: |
| print(c.to_dict()) |
| # { |
| # "start_sec": 12.0, |
| # "end_sec": 20.0, |
| # "start_hms": "00:12.000", |
| # "end_hms": "00:20.000", |
| # "duration_sec": 8.0, |
| # "relevance_score": 0.85, |
| # "description": "A chef is chopping vegetables...", |
| # "category": "food" |
| # } |
| |
| extractor.save_results(clips, "highlights.json") |
| ``` |
|
|
| --- |
|
|
| ## π€ Output format (`highlights.json`) |
|
|
| ```json |
| { |
| "clips": [ |
| { |
| "start_sec": 12.0, |
| "end_sec": 20.0, |
| "start_hms": "00:12.000", |
| "end_hms": "00:20.000", |
| "duration_sec": 8.0, |
| "relevance_score": 0.85, |
| "description": "A chef is chopping vegetables in a kitchen...", |
| "transcript": null, |
| "category": "food" |
| } |
| ], |
| "total_clips": 1, |
| "total_highlight_duration": 8.0 |
| } |
| ``` |
|
|
| --- |
|
|
| ## π§ Architecture |
|
|
| ``` |
| Video βββΊ segment windows (4 s, 1 s overlap) |
| β |
| βββββΊ sample up to 8 frames βββΊ VLM describes segment |
| β β |
| βββββΊ (optional) Whisper transcribes audio |
| β |
| βββββΊ LLM scores 0-10 relevance to query |
| β |
| βΌ |
| merge adjacent high-score segments βββΊ VideoClip objects |
| β |
| βΌ |
| JSON output with timestamps + metadata |
| ``` |
|
|
| --- |
|
|
| ## π Research foundation |
|
|
| This implementation builds on ideas from: |
|
|
| - **[UniVTG](https://arxiv.org/abs/2307.16715)** β unified video-language temporal grounding |
| - **[QVHighlights / Moment-DETR](https://arxiv.org/abs/2107.09609)** β transformer encoder-decoder for joint moment retrieval & highlight detection |
| - **[VTG-LLM](https://arxiv.org/abs/2405.13382)** β timestamp-aware video LLMs for temporal grounding |
| - **Qwen2.5-Omni** & **SmolVLM2** β current practical video-language models on HuggingFace |
|
|
| --- |
|
|
| ## π§ͺ Tests |
|
|
| ```bash |
| python test_video_pipeline.py |
| ``` |
|
|
| Validates: |
| - Synthetic MP4 creation & parsing |
| - Video info extraction (duration, FPS, resolution) |
| - Frame subsampling |
| - Segment merging logic |
| - Timestamp formatting (`HH:MM:SS.mmm`) |
|
|
| --- |
|
|
| ## π License |
|
|
| Apache-2.0 |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = "notjulietxd/video-highlight-extractor" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
|
|
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
|
|