notjulietxd commited on
Commit
4ced461
Β·
verified Β·
1 Parent(s): b69de93

Add README

Browse files
Files changed (1) hide show
  1. README.md +152 -14
README.md CHANGED
@@ -1,26 +1,164 @@
 
 
 
 
1
  ---
2
- tags:
3
- - ml-intern
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # notjulietxd/video-highlight-extractor
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
- <!-- ml-intern-provenance -->
9
- ## Generated by ML Intern
10
 
11
- This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
 
 
 
 
12
 
13
- - Try ML Intern: https://smolagents-ml-intern.hf.space
14
- - Source code: https://github.com/huggingface/ml-intern
15
 
16
- ## Usage
17
 
18
  ```python
19
- from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- model_id = "notjulietxd/video-highlight-extractor"
22
- tokenizer = AutoTokenizer.from_pretrained(model_id)
23
- model = AutoModelForCausalLM.from_pretrained(model_id)
 
 
 
 
 
 
 
 
 
 
 
24
  ```
25
 
26
- For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Video Highlight Extractor
2
+
3
+ A pipeline for detecting highlights within videos based on events or conversations, then harvesting clips nearest to the user's intent (e.g. **vlog**, **food**, **travel**, **tutorial**). Provides **Start** and **End** timestamps for each extracted clip.
4
+
5
  ---
6
+
7
+ ## 🎯 What it does
8
+
9
+ | Step | Action |
10
+ |------|--------|
11
+ | **1. Segment** | Video is split into short overlapping windows (default 4 s, 1 s overlap) |
12
+ | **2. Understand** | A video-language model describes what happens in each segment |
13
+ | **3. Score** | Each segment is scored 0-10 for relevance to your natural-language query |
14
+ | **4. Merge** | Adjacent high-scoring segments are merged into continuous clips |
15
+ | **5. Output** | Clips with `start_sec`, `end_sec`, `start_hms`, `end_hms`, score, description & category |
16
+
17
  ---
18
 
19
+ ## πŸ“¦ Files
20
+
21
+ | File | Purpose |
22
+ |------|---------|
23
+ | `video_highlight_extractor.py` | Core `VideoHighlightExtractor` class |
24
+ | `demo.py` | Ready-to-run CLI demo |
25
+ | `test_video_pipeline.py` | Unit tests (creates a synthetic video & tests I/O + merging) |
26
+ | `requirements.txt` | Python dependencies |
27
+
28
+ ---
29
+
30
+ ## πŸš€ Quick start
31
+
32
+ ```bash
33
+ pip install -r requirements.txt
34
+
35
+ # Run on any video
36
+ python demo.py \
37
+ --video my_video.mp4 \
38
+ --query "exciting food moments and travel scenery" \
39
+ --model HuggingFaceTB/SmolVLM2-256M-Video-Instruct \
40
+ --output highlights.json
41
+ ```
42
+
43
+ ---
44
 
45
+ ## 🧠 Recommended models
 
46
 
47
+ | Model | Size | Best for |
48
+ |-------|------|----------|
49
+ | **[`HuggingFaceTB/SmolVLM2-256M-Video-Instruct`](https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct)** | 256 M | Fast, CPU-friendly |
50
+ | **[`Qwen/Qwen2.5-Omni-3B`](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)** | 3 B | Strong video + audio understanding |
51
+ | **[`OpenGVLab/VideoChat-R1_7B_caption`](https://huggingface.co/OpenGVLab/VideoChat-R1_7B_caption)** | 7 B | Highest quality, needs GPU |
52
 
53
+ ---
 
54
 
55
+ ## πŸ“‹ Python API
56
 
57
  ```python
58
+ from video_highlight_extractor import VideoHighlightExtractor
59
+
60
+ extractor = VideoHighlightExtractor(
61
+ model_id="HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
62
+ )
63
+
64
+ clips = extractor.extract_highlights(
65
+ video_path="my_video.mp4",
66
+ query="delicious food preparation",
67
+ target_duration_sec=30, # target ~30 s of highlights total
68
+ score_threshold=0.3,
69
+ top_k=5,
70
+ detect_categories=True, # classify into vlog / food / travel / tutorial / ...
71
+ use_audio=False, # set True (+ whisper) for conversation-based highlights
72
+ )
73
 
74
+ for c in clips:
75
+ print(c.to_dict())
76
+ # {
77
+ # "start_sec": 12.0,
78
+ # "end_sec": 20.0,
79
+ # "start_hms": "00:12.000",
80
+ # "end_hms": "00:20.000",
81
+ # "duration_sec": 8.0,
82
+ # "relevance_score": 0.85,
83
+ # "description": "A chef is chopping vegetables...",
84
+ # "category": "food"
85
+ # }
86
+
87
+ extractor.save_results(clips, "highlights.json")
88
  ```
89
 
90
+ ---
91
+
92
+ ## πŸ“€ Output format (`highlights.json`)
93
+
94
+ ```json
95
+ {
96
+ "clips": [
97
+ {
98
+ "start_sec": 12.0,
99
+ "end_sec": 20.0,
100
+ "start_hms": "00:12.000",
101
+ "end_hms": "00:20.000",
102
+ "duration_sec": 8.0,
103
+ "relevance_score": 0.85,
104
+ "description": "A chef is chopping vegetables in a kitchen...",
105
+ "transcript": null,
106
+ "category": "food"
107
+ }
108
+ ],
109
+ "total_clips": 1,
110
+ "total_highlight_duration": 8.0
111
+ }
112
+ ```
113
+
114
+ ---
115
+
116
+ ## πŸ”§ Architecture
117
+
118
+ ```
119
+ Video ──► segment windows (4 s, 1 s overlap)
120
+ β”‚
121
+ β”œβ”€β”€β”€β–Ί sample up to 8 frames ──► VLM describes segment
122
+ β”‚ β”‚
123
+ β”œβ”€β”€β”€β–Ί (optional) Whisper transcribes audio
124
+ β”‚
125
+ └───► LLM scores 0-10 relevance to query
126
+ β”‚
127
+ β–Ό
128
+ merge adjacent high-score segments ──► VideoClip objects
129
+ β”‚
130
+ β–Ό
131
+ JSON output with timestamps + metadata
132
+ ```
133
+
134
+ ---
135
+
136
+ ## πŸ“š Research foundation
137
+
138
+ This implementation builds on ideas from:
139
+
140
+ - **[UniVTG](https://arxiv.org/abs/2307.16715)** β€” unified video-language temporal grounding
141
+ - **[QVHighlights / Moment-DETR](https://arxiv.org/abs/2107.09609)** β€” transformer encoder-decoder for joint moment retrieval & highlight detection
142
+ - **[VTG-LLM](https://arxiv.org/abs/2405.13382)** β€” timestamp-aware video LLMs for temporal grounding
143
+ - **Qwen2.5-Omni** & **SmolVLM2** β€” current practical video-language models on HuggingFace
144
+
145
+ ---
146
+
147
+ ## πŸ§ͺ Tests
148
+
149
+ ```bash
150
+ python test_video_pipeline.py
151
+ ```
152
+
153
+ Validates:
154
+ - Synthetic MP4 creation & parsing
155
+ - Video info extraction (duration, FPS, resolution)
156
+ - Frame subsampling
157
+ - Segment merging logic
158
+ - Timestamp formatting (`HH:MM:SS.mmm`)
159
+
160
+ ---
161
+
162
+ ## πŸ“„ License
163
+
164
+ Apache-2.0