metadata
tags:
- ml-intern
Video Highlight Extractor
A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. vlog, food, travel, tutorial). Provides Start and End timestamps for each extracted clip.
π― What it does
| Step | Action |
|---|---|
| 1. Segment | Video is split into short overlapping windows (default 4 s, 1 s overlap) |
| 2. Understand | A video-language model describes what happens in each segment |
| 3. Score | Each segment is scored 0-10 for relevance to your natural-language query |
| 4. Merge | Adjacent high-scoring segments are merged into continuous clips |
| 5. Output | Clips with start_sec, end_sec, start_hms, end_hms, score, description & category |
π¦ Files
| File | Purpose |
|---|---|
video_highlight_extractor.py |
Core VideoHighlightExtractor class |
demo.py |
Ready-to-run CLI demo |
test_video_pipeline.py |
Unit tests (creates a synthetic video & tests I/O + merging) |
requirements.txt |
Python dependencies |
π Quick start
pip install -r requirements.txt
# Run on any video
python demo.py \
--video my_video.mp4 \
--query "exciting food moments and travel scenery" \
--model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \
--output highlights.json
π§ Recommended models
| Model | Size | Best for |
|---|---|---|
HuggingFaceTB/SmolVLM2-256M-Video-Instruct |
256 M | Fast, CPU-friendly |
Qwen/Qwen2.5-Omni-3B |
3 B | Strong video + audio understanding |
OpenGVLab/VideoChat-R1_7B_caption |
7 B | Highest quality, needs GPU |
π Python API
from video_highlight_extractor import VideoHighlightExtractor
extractor = VideoHighlightExtractor(
model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
)
clips = extractor.extract_highlights(
video_path="my_video.mp4",
query="delicious food preparation",
target_duration_sec=30, # target ~30 s of highlights total
score_threshold=0.3,
top_k=5,
detect_categories=True, # classify into vlog / food / travel / tutorial / ...
use_audio=False, # set True (+ whisper) for conversation-based highlights
)
for c in clips:
print(c.to_dict())
# {
# "start_sec": 12.0,
# "end_sec": 20.0,
# "start_hms": "00:12.000",
# "end_hms": "00:20.000",
# "duration_sec": 8.0,
# "relevance_score": 0.85,
# "description": "A chef is chopping vegetables...",
# "category": "food"
# }
extractor.save_results(clips, "highlights.json")
π€ Output format (highlights.json)
{
"clips": [
{
"start_sec": 12.0,
"end_sec": 20.0,
"start_hms": "00:12.000",
"end_hms": "00:20.000",
"duration_sec": 8.0,
"relevance_score": 0.85,
"description": "A chef is chopping vegetables in a kitchen...",
"transcript": null,
"category": "food"
}
],
"total_clips": 1,
"total_highlight_duration": 8.0
}
π§ Architecture
Video βββΊ segment windows (4 s, 1 s overlap)
β
βββββΊ sample up to 8 frames βββΊ VLM describes segment
β β
βββββΊ (optional) Whisper transcribes audio
β
βββββΊ LLM scores 0-10 relevance to query
β
βΌ
merge adjacent high-score segments βββΊ VideoClip objects
β
βΌ
JSON output with timestamps + metadata
π Research foundation
This implementation builds on ideas from:
- UniVTG β unified video-language temporal grounding
- QVHighlights / Moment-DETR β transformer encoder-decoder for joint moment retrieval & highlight detection
- VTG-LLM β timestamp-aware video LLMs for temporal grounding
- Qwen2.5-Omni & SmolVLM2 β current practical video-language models on HuggingFace
π§ͺ Tests
python test_video_pipeline.py
Validates:
- Synthetic MP4 creation & parsing
- Video info extraction (duration, FPS, resolution)
- Frame subsampling
- Segment merging logic
- Timestamp formatting (
HH:MM:SS.mmm)
π License
Apache-2.0
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "notjulietxd/video-highlight-extractor"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.