Update ML Intern artifact metadata

538668c verified 2 days ago

5.38 kB

	---
	tags:
	- ml-intern
	---
	# Video Highlight Extractor

	A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. vlog, food, travel, tutorial). Provides Start and End timestamps for each extracted clip.

	---

	## 🎯 What it does

	\| Step \| Action \|
	\|------\|--------\|
	\| 1. Segment \| Video is split into short overlapping windows (default 4 s, 1 s overlap) \|
	\| 2. Understand \| A video-language model describes what happens in each segment \|
	\| 3. Score \| Each segment is scored 0-10 for relevance to your natural-language query \|
	\| 4. Merge \| Adjacent high-scoring segments are merged into continuous clips \|
	\| 5. Output \| Clips with `start_sec`, `end_sec`, `start_hms`, `end_hms`, score, description & category \|

	---

	## 📦 Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `video_highlight_extractor.py` \| Core `VideoHighlightExtractor` class \|
	\| `demo.py` \| Ready-to-run CLI demo \|
	\| `test_video_pipeline.py` \| Unit tests (creates a synthetic video & tests I/O + merging) \|
	\| `requirements.txt` \| Python dependencies \|

	---

	## 🚀 Quick start

	```bash
	pip install -r requirements.txt

	# Run on any video
	python demo.py \
	--video my_video.mp4 \
	--query "exciting food moments and travel scenery" \
	--model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \
	--output highlights.json
	```

	---

	## 🧠 Recommended models

	\| Model \| Size \| Best for \|
	\|-------\|------\|----------\|
	\| [`HuggingFaceTB/SmolVLM2-256M-Video-Instruct`](https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct) \| 256 M \| Fast, CPU-friendly \|
	\| [`Qwen/Qwen2.5-Omni-3B`](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) \| 3 B \| Strong video + audio understanding \|
	\| [`OpenGVLab/VideoChat-R1_7B_caption`](https://huggingface.co/OpenGVLab/VideoChat-R1_7B_caption) \| 7 B \| Highest quality, needs GPU \|

	---

	## 📋 Python API

	```python
	from video_highlight_extractor import VideoHighlightExtractor

	extractor = VideoHighlightExtractor(
	model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
	)

	clips = extractor.extract_highlights(
	video_path="my_video.mp4",
	query="delicious food preparation",
	target_duration_sec=30, # target ~30 s of highlights total
	score_threshold=0.3,
	top_k=5,
	detect_categories=True, # classify into vlog / food / travel / tutorial / ...
	use_audio=False, # set True (+ whisper) for conversation-based highlights
	)

	for c in clips:
	print(c.to_dict())
	# {
	# "start_sec": 12.0,
	# "end_sec": 20.0,
	# "start_hms": "00:12.000",
	# "end_hms": "00:20.000",
	# "duration_sec": 8.0,
	# "relevance_score": 0.85,
	# "description": "A chef is chopping vegetables...",
	# "category": "food"
	# }

	extractor.save_results(clips, "highlights.json")
	```

	---

	## 📤 Output format (`highlights.json`)

	```json
	{
	"clips": [
	{
	"start_sec": 12.0,
	"end_sec": 20.0,
	"start_hms": "00:12.000",
	"end_hms": "00:20.000",
	"duration_sec": 8.0,
	"relevance_score": 0.85,
	"description": "A chef is chopping vegetables in a kitchen...",
	"transcript": null,
	"category": "food"
	}
	],
	"total_clips": 1,
	"total_highlight_duration": 8.0
	}
	```

	---

	## 🔧 Architecture

	```
	Video ──► segment windows (4 s, 1 s overlap)
	│
	├───► sample up to 8 frames ──► VLM describes segment
	│ │
	├───► (optional) Whisper transcribes audio
	│
	└───► LLM scores 0-10 relevance to query
	│
	▼
	merge adjacent high-score segments ──► VideoClip objects
	│
	▼
	JSON output with timestamps + metadata
	```

	---

	## 📚 Research foundation

	This implementation builds on ideas from:

	- [UniVTG](https://arxiv.org/abs/2307.16715) — unified video-language temporal grounding
	- [QVHighlights / Moment-DETR](https://arxiv.org/abs/2107.09609) — transformer encoder-decoder for joint moment retrieval & highlight detection
	- [VTG-LLM](https://arxiv.org/abs/2405.13382) — timestamp-aware video LLMs for temporal grounding
	- Qwen2.5-Omni & SmolVLM2 — current practical video-language models on HuggingFace

	---

	## 🧪 Tests

	```bash
	python test_video_pipeline.py
	```

	Validates:
	- Synthetic MP4 creation & parsing
	- Video info extraction (duration, FPS, resolution)
	- Frame subsampling
	- Segment merging logic
	- Timestamp formatting (`HH:MM:SS.mmm`)

	---

	## 📄 License

	Apache-2.0

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "notjulietxd/video-highlight-extractor"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.