notjulietxd
/

video-highlight-extractor

ml-intern

Model card Files Files and versions

xet

Community

notjulietxd commited on 6 days ago

Commit

4ced461

verified ·

1 Parent(s): b69de93

Add README

Browse files

Files changed (1) hide show

README.md +152 -14

README.md CHANGED Viewed

@@ -1,26 +1,164 @@
 ---
-tags:
-- ml-intern
 ---
-# notjulietxd/video-highlight-extractor
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
-## Usage
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = "notjulietxd/video-highlight-extractor"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
 ```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.

+# Video Highlight Extractor
+A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. **vlog**, **food**, **travel**, **tutorial**). Provides **Start** and **End** timestamps for each extracted clip.
 ---
+## 🎯 What it does
+| Step | Action |
+|------|--------|
+| **1. Segment** | Video is split into short overlapping windows (default 4 s, 1 s overlap) |
+| **2. Understand** | A video-language model describes what happens in each segment |
+| **3. Score** | Each segment is scored 0-10 for relevance to your natural-language query |
+| **4. Merge** | Adjacent high-scoring segments are merged into continuous clips |
+| **5. Output** | Clips with `start_sec`, `end_sec`, `start_hms`, `end_hms`, score, description & category |
 ---
+## 📦 Files
+| File | Purpose |
+|------|---------|
+| `video_highlight_extractor.py` | Core `VideoHighlightExtractor` class |
+| `demo.py` | Ready-to-run CLI demo |
+| `test_video_pipeline.py` | Unit tests (creates a synthetic video & tests I/O + merging) |
+| `requirements.txt` | Python dependencies |
+---
+## 🚀 Quick start
+```bash
+pip install -r requirements.txt
+# Run on any video
+python demo.py \
+  --video my_video.mp4 \
+  --query "exciting food moments and travel scenery" \
+  --model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \
+  --output highlights.json
+```
+---
+## 🧠 Recommended models
+| Model | Size | Best for |
+|-------|------|----------|
+| **[`HuggingFaceTB/SmolVLM2-256M-Video-Instruct`](https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct)** | 256 M | Fast, CPU-friendly |
+| **[`Qwen/Qwen2.5-Omni-3B`](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)** | 3 B | Strong video + audio understanding |
+| **[`OpenGVLab/VideoChat-R1_7B_caption`](https://huggingface.co/OpenGVLab/VideoChat-R1_7B_caption)** | 7 B | Highest quality, needs GPU |
+---
+## 📋 Python API
 ```python
+from video_highlight_extractor import VideoHighlightExtractor
+extractor = VideoHighlightExtractor(
+    model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
+)
+clips = extractor.extract_highlights(
+    video_path="my_video.mp4",
+    query="delicious food preparation",
+    target_duration_sec=30,      # target ~30 s of highlights total
+    score_threshold=0.3,
+    top_k=5,
+    detect_categories=True,        # classify into vlog / food / travel / tutorial / ...
+    use_audio=False,               # set True (+ whisper) for conversation-based highlights
+)
+for c in clips:
+    print(c.to_dict())
+    # {
+    #   "start_sec": 12.0,
+    #   "end_sec": 20.0,
+    #   "start_hms": "00:12.000",
+    #   "end_hms": "00:20.000",
+    #   "duration_sec": 8.0,
+    #   "relevance_score": 0.85,
+    #   "description": "A chef is chopping vegetables...",
+    #   "category": "food"
+    # }
+extractor.save_results(clips, "highlights.json")
 ```
+---
+## 📤 Output format (`highlights.json`)
+```json
+{
+  "clips": [
+    {
+      "start_sec": 12.0,
+      "end_sec": 20.0,
+      "start_hms": "00:12.000",
+      "end_hms": "00:20.000",
+      "duration_sec": 8.0,
+      "relevance_score": 0.85,
+      "description": "A chef is chopping vegetables in a kitchen...",
+      "transcript": null,
+      "category": "food"
+    }
+  ],
+  "total_clips": 1,
+  "total_highlight_duration": 8.0
+}
+```
+---
+## 🔧 Architecture
+```
+Video ──► segment windows (4 s, 1 s overlap)
+        │
+        ├───► sample up to 8 frames ──► VLM describes segment
+        │                                   │
+        ├───► (optional) Whisper transcribes audio
+        │
+        └───► LLM scores 0-10 relevance to query
+                  │
+                  ▼
+        merge adjacent high-score segments ──► VideoClip objects
+                  │
+                  ▼
+        JSON output with timestamps + metadata
+```
+---
+## 📚 Research foundation
+This implementation builds on ideas from:
+- **[UniVTG](https://arxiv.org/abs/2307.16715)** — unified video-language temporal grounding
+- **[QVHighlights / Moment-DETR](https://arxiv.org/abs/2107.09609)** — transformer encoder-decoder for joint moment retrieval & highlight detection
+- **[VTG-LLM](https://arxiv.org/abs/2405.13382)** — timestamp-aware video LLMs for temporal grounding
+- **Qwen2.5-Omni** & **SmolVLM2** — current practical video-language models on HuggingFace
+---
+## 🧪 Tests
+```bash
+python test_video_pipeline.py
+```
+Validates:
+- Synthetic MP4 creation & parsing
+- Video info extraction (duration, FPS, resolution)
+- Frame subsampling
+- Segment merging logic
+- Timestamp formatting (`HH:MM:SS.mmm`)
+---
+## 📄 License
+Apache-2.0