Video Highlight Extractor

A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. vlog, food, travel, tutorial). Provides Start and End timestamps for each extracted clip.

🎯 What it does

Step	Action
1. Segment	Video is split into short overlapping windows (default 4 s, 1 s overlap)
2. Understand	A video-language model describes what happens in each segment
3. Score	Each segment is scored 0-10 for relevance to your natural-language query
4. Merge	Adjacent high-scoring segments are merged into continuous clips
5. Output	Clips with `start_sec`, `end_sec`, `start_hms`, `end_hms`, score, description & category

📦 Files

File	Purpose
`video_highlight_extractor.py`	Core `VideoHighlightExtractor` class
`demo.py`	Ready-to-run CLI demo
`test_video_pipeline.py`	Unit tests (creates a synthetic video & tests I/O + merging)
`requirements.txt`	Python dependencies

🚀 Quick start

pip install -r requirements.txt

# Run on any video
python demo.py \
  --video my_video.mp4 \
  --query "exciting food moments and travel scenery" \
  --model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \
  --output highlights.json

🧠 Recommended models

Model	Size	Best for
`HuggingFaceTB/SmolVLM2-256M-Video-Instruct`	256 M	Fast, CPU-friendly
`Qwen/Qwen2.5-Omni-3B`	3 B	Strong video + audio understanding
`OpenGVLab/VideoChat-R1_7B_caption`	7 B	Highest quality, needs GPU

📋 Python API

from video_highlight_extractor import VideoHighlightExtractor

extractor = VideoHighlightExtractor(
    model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
)

clips = extractor.extract_highlights(
    video_path="my_video.mp4",
    query="delicious food preparation",
    target_duration_sec=30,      # target ~30 s of highlights total
    score_threshold=0.3,
    top_k=5,
    detect_categories=True,        # classify into vlog / food / travel / tutorial / ...
    use_audio=False,               # set True (+ whisper) for conversation-based highlights
)

for c in clips:
    print(c.to_dict())
    # {
    #   "start_sec": 12.0,
    #   "end_sec": 20.0,
    #   "start_hms": "00:12.000",
    #   "end_hms": "00:20.000",
    #   "duration_sec": 8.0,
    #   "relevance_score": 0.85,
    #   "description": "A chef is chopping vegetables...",
    #   "category": "food"
    # }

extractor.save_results(clips, "highlights.json")

📤 Output format (`highlights.json`)

{
  "clips": [
    {
      "start_sec": 12.0,
      "end_sec": 20.0,
      "start_hms": "00:12.000",
      "end_hms": "00:20.000",
      "duration_sec": 8.0,
      "relevance_score": 0.85,
      "description": "A chef is chopping vegetables in a kitchen...",
      "transcript": null,
      "category": "food"
    }
  ],
  "total_clips": 1,
  "total_highlight_duration": 8.0
}

🔧 Architecture

Video ──► segment windows (4 s, 1 s overlap)
        │
        ├───► sample up to 8 frames ──► VLM describes segment
        │                                   │
        ├───► (optional) Whisper transcribes audio
        │
        └───► LLM scores 0-10 relevance to query
                  │
                  ▼
        merge adjacent high-score segments ──► VideoClip objects
                  │
                  ▼
        JSON output with timestamps + metadata

📚 Research foundation

This implementation builds on ideas from:

UniVTG — unified video-language temporal grounding
QVHighlights / Moment-DETR — transformer encoder-decoder for joint moment retrieval & highlight detection
VTG-LLM — timestamp-aware video LLMs for temporal grounding
Qwen2.5-Omni & SmolVLM2 — current practical video-language models on HuggingFace

🧪 Tests

python test_video_pipeline.py

Validates:

Synthetic MP4 creation & parsing
Video info extraction (duration, FPS, resolution)
Frame subsampling
Segment merging logic
Timestamp formatting (HH:MM:SS.mmm)

📄 License

Apache-2.0

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "notjulietxd/video-highlight-extractor"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for notjulietxd/video-highlight-extractor

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Paper • 2405.13382 • Published May 22, 2024 • 1

UniVTG: Towards Unified Video-Language Temporal Grounding

Paper • 2307.16715 • Published Jul 31, 2023 • 12

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Paper • 2107.09609 • Published Jul 20, 2021 • 3