VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
Paper β’ 2405.13382 β’ Published β’ 1
A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. vlog, food, travel, tutorial). Provides Start and End timestamps for each extracted clip.
| Step | Action |
|---|---|
| 1. Segment | Video is split into short overlapping windows (default 4 s, 1 s overlap) |
| 2. Understand | A video-language model describes what happens in each segment |
| 3. Score | Each segment is scored 0-10 for relevance to your natural-language query |
| 4. Merge | Adjacent high-scoring segments are merged into continuous clips |
| 5. Output | Clips with start_sec, end_sec, start_hms, end_hms, score, description & category |
| File | Purpose |
|---|---|
video_highlight_extractor.py |
Core VideoHighlightExtractor class |
demo.py |
Ready-to-run CLI demo |
test_video_pipeline.py |
Unit tests (creates a synthetic video & tests I/O + merging) |
requirements.txt |
Python dependencies |
pip install -r requirements.txt
# Run on any video
python demo.py \
--video my_video.mp4 \
--query "exciting food moments and travel scenery" \
--model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \
--output highlights.json
| Model | Size | Best for |
|---|---|---|
HuggingFaceTB/SmolVLM2-256M-Video-Instruct |
256 M | Fast, CPU-friendly |
Qwen/Qwen2.5-Omni-3B |
3 B | Strong video + audio understanding |
OpenGVLab/VideoChat-R1_7B_caption |
7 B | Highest quality, needs GPU |
from video_highlight_extractor import VideoHighlightExtractor
extractor = VideoHighlightExtractor(
model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
)
clips = extractor.extract_highlights(
video_path="my_video.mp4",
query="delicious food preparation",
target_duration_sec=30, # target ~30 s of highlights total
score_threshold=0.3,
top_k=5,
detect_categories=True, # classify into vlog / food / travel / tutorial / ...
use_audio=False, # set True (+ whisper) for conversation-based highlights
)
for c in clips:
print(c.to_dict())
# {
# "start_sec": 12.0,
# "end_sec": 20.0,
# "start_hms": "00:12.000",
# "end_hms": "00:20.000",
# "duration_sec": 8.0,
# "relevance_score": 0.85,
# "description": "A chef is chopping vegetables...",
# "category": "food"
# }
extractor.save_results(clips, "highlights.json")
highlights.json)
{
"clips": [
{
"start_sec": 12.0,
"end_sec": 20.0,
"start_hms": "00:12.000",
"end_hms": "00:20.000",
"duration_sec": 8.0,
"relevance_score": 0.85,
"description": "A chef is chopping vegetables in a kitchen...",
"transcript": null,
"category": "food"
}
],
"total_clips": 1,
"total_highlight_duration": 8.0
}
Video βββΊ segment windows (4 s, 1 s overlap)
β
βββββΊ sample up to 8 frames βββΊ VLM describes segment
β β
βββββΊ (optional) Whisper transcribes audio
β
βββββΊ LLM scores 0-10 relevance to query
β
βΌ
merge adjacent high-score segments βββΊ VideoClip objects
β
βΌ
JSON output with timestamps + metadata
This implementation builds on ideas from:
python test_video_pipeline.py
Validates:
HH:MM:SS.mmm)Apache-2.0
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "notjulietxd/video-highlight-extractor"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.