--- tags: - ml-intern --- # Video Highlight Extractor A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. **vlog**, **food**, **travel**, **tutorial**). Provides **Start** and **End** timestamps for each extracted clip. --- ## ๐ŸŽฏ What it does | Step | Action | |------|--------| | **1. Segment** | Video is split into short overlapping windows (default 4 s, 1 s overlap) | | **2. Understand** | A video-language model describes what happens in each segment | | **3. Score** | Each segment is scored 0-10 for relevance to your natural-language query | | **4. Merge** | Adjacent high-scoring segments are merged into continuous clips | | **5. Output** | Clips with `start_sec`, `end_sec`, `start_hms`, `end_hms`, score, description & category | --- ## ๐Ÿ“ฆ Files | File | Purpose | |------|---------| | `video_highlight_extractor.py` | Core `VideoHighlightExtractor` class | | `demo.py` | Ready-to-run CLI demo | | `test_video_pipeline.py` | Unit tests (creates a synthetic video & tests I/O + merging) | | `requirements.txt` | Python dependencies | --- ## ๐Ÿš€ Quick start ```bash pip install -r requirements.txt # Run on any video python demo.py \ --video my_video.mp4 \ --query "exciting food moments and travel scenery" \ --model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \ --output highlights.json ``` --- ## ๐Ÿง  Recommended models | Model | Size | Best for | |-------|------|----------| | **[`HuggingFaceTB/SmolVLM2-256M-Video-Instruct`](https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct)** | 256 M | Fast, CPU-friendly | | **[`Qwen/Qwen2.5-Omni-3B`](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)** | 3 B | Strong video + audio understanding | | **[`OpenGVLab/VideoChat-R1_7B_caption`](https://huggingface.co/OpenGVLab/VideoChat-R1_7B_caption)** | 7 B | Highest quality, needs GPU | --- ## ๐Ÿ“‹ Python API ```python from video_highlight_extractor import VideoHighlightExtractor extractor = VideoHighlightExtractor( model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct" ) clips = extractor.extract_highlights( video_path="my_video.mp4", query="delicious food preparation", target_duration_sec=30, # target ~30 s of highlights total score_threshold=0.3, top_k=5, detect_categories=True, # classify into vlog / food / travel / tutorial / ... use_audio=False, # set True (+ whisper) for conversation-based highlights ) for c in clips: print(c.to_dict()) # { # "start_sec": 12.0, # "end_sec": 20.0, # "start_hms": "00:12.000", # "end_hms": "00:20.000", # "duration_sec": 8.0, # "relevance_score": 0.85, # "description": "A chef is chopping vegetables...", # "category": "food" # } extractor.save_results(clips, "highlights.json") ``` --- ## ๐Ÿ“ค Output format (`highlights.json`) ```json { "clips": [ { "start_sec": 12.0, "end_sec": 20.0, "start_hms": "00:12.000", "end_hms": "00:20.000", "duration_sec": 8.0, "relevance_score": 0.85, "description": "A chef is chopping vegetables in a kitchen...", "transcript": null, "category": "food" } ], "total_clips": 1, "total_highlight_duration": 8.0 } ``` --- ## ๐Ÿ”ง Architecture ``` Video โ”€โ”€โ–บ segment windows (4 s, 1 s overlap) โ”‚ โ”œโ”€โ”€โ”€โ–บ sample up to 8 frames โ”€โ”€โ–บ VLM describes segment โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ–บ (optional) Whisper transcribes audio โ”‚ โ””โ”€โ”€โ”€โ–บ LLM scores 0-10 relevance to query โ”‚ โ–ผ merge adjacent high-score segments โ”€โ”€โ–บ VideoClip objects โ”‚ โ–ผ JSON output with timestamps + metadata ``` --- ## ๐Ÿ“š Research foundation This implementation builds on ideas from: - **[UniVTG](https://arxiv.org/abs/2307.16715)** โ€” unified video-language temporal grounding - **[QVHighlights / Moment-DETR](https://arxiv.org/abs/2107.09609)** โ€” transformer encoder-decoder for joint moment retrieval & highlight detection - **[VTG-LLM](https://arxiv.org/abs/2405.13382)** โ€” timestamp-aware video LLMs for temporal grounding - **Qwen2.5-Omni** & **SmolVLM2** โ€” current practical video-language models on HuggingFace --- ## ๐Ÿงช Tests ```bash python test_video_pipeline.py ``` Validates: - Synthetic MP4 creation & parsing - Video info extraction (duration, FPS, resolution) - Frame subsampling - Segment merging logic - Timestamp formatting (`HH:MM:SS.mmm`) --- ## ๐Ÿ“„ License Apache-2.0 ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "notjulietxd/video-highlight-extractor" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) ``` For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.