Video Highlight Extractor

A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. vlog, food, travel, tutorial). Provides Start and End timestamps for each extracted clip.


🎯 What it does

Step Action
1. Segment Video is split into short overlapping windows (default 4 s, 1 s overlap)
2. Understand A video-language model describes what happens in each segment
3. Score Each segment is scored 0-10 for relevance to your natural-language query
4. Merge Adjacent high-scoring segments are merged into continuous clips
5. Output Clips with start_sec, end_sec, start_hms, end_hms, score, description & category

πŸ“¦ Files

File Purpose
video_highlight_extractor.py Core VideoHighlightExtractor class
demo.py Ready-to-run CLI demo
test_video_pipeline.py Unit tests (creates a synthetic video & tests I/O + merging)
requirements.txt Python dependencies

πŸš€ Quick start

pip install -r requirements.txt

# Run on any video
python demo.py \
  --video my_video.mp4 \
  --query "exciting food moments and travel scenery" \
  --model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \
  --output highlights.json

🧠 Recommended models

Model Size Best for
HuggingFaceTB/SmolVLM2-256M-Video-Instruct 256 M Fast, CPU-friendly
Qwen/Qwen2.5-Omni-3B 3 B Strong video + audio understanding
OpenGVLab/VideoChat-R1_7B_caption 7 B Highest quality, needs GPU

πŸ“‹ Python API

from video_highlight_extractor import VideoHighlightExtractor

extractor = VideoHighlightExtractor(
    model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
)

clips = extractor.extract_highlights(
    video_path="my_video.mp4",
    query="delicious food preparation",
    target_duration_sec=30,      # target ~30 s of highlights total
    score_threshold=0.3,
    top_k=5,
    detect_categories=True,        # classify into vlog / food / travel / tutorial / ...
    use_audio=False,               # set True (+ whisper) for conversation-based highlights
)

for c in clips:
    print(c.to_dict())
    # {
    #   "start_sec": 12.0,
    #   "end_sec": 20.0,
    #   "start_hms": "00:12.000",
    #   "end_hms": "00:20.000",
    #   "duration_sec": 8.0,
    #   "relevance_score": 0.85,
    #   "description": "A chef is chopping vegetables...",
    #   "category": "food"
    # }

extractor.save_results(clips, "highlights.json")

πŸ“€ Output format (highlights.json)

{
  "clips": [
    {
      "start_sec": 12.0,
      "end_sec": 20.0,
      "start_hms": "00:12.000",
      "end_hms": "00:20.000",
      "duration_sec": 8.0,
      "relevance_score": 0.85,
      "description": "A chef is chopping vegetables in a kitchen...",
      "transcript": null,
      "category": "food"
    }
  ],
  "total_clips": 1,
  "total_highlight_duration": 8.0
}

πŸ”§ Architecture

Video ──► segment windows (4 s, 1 s overlap)
        β”‚
        β”œβ”€β”€β”€β–Ί sample up to 8 frames ──► VLM describes segment
        β”‚                                   β”‚
        β”œβ”€β”€β”€β–Ί (optional) Whisper transcribes audio
        β”‚
        └───► LLM scores 0-10 relevance to query
                  β”‚
                  β–Ό
        merge adjacent high-score segments ──► VideoClip objects
                  β”‚
                  β–Ό
        JSON output with timestamps + metadata

πŸ“š Research foundation

This implementation builds on ideas from:

  • UniVTG β€” unified video-language temporal grounding
  • QVHighlights / Moment-DETR β€” transformer encoder-decoder for joint moment retrieval & highlight detection
  • VTG-LLM β€” timestamp-aware video LLMs for temporal grounding
  • Qwen2.5-Omni & SmolVLM2 β€” current practical video-language models on HuggingFace

πŸ§ͺ Tests

python test_video_pipeline.py

Validates:

  • Synthetic MP4 creation & parsing
  • Video info extraction (duration, FPS, resolution)
  • Frame subsampling
  • Segment merging logic
  • Timestamp formatting (HH:MM:SS.mmm)

πŸ“„ License

Apache-2.0

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "notjulietxd/video-highlight-extractor"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for notjulietxd/video-highlight-extractor