JakgritB commited on
Commit
f6ac864
·
1 Parent(s): df9eb37

docs: add hackathon-ready project README

Browse files

Document the problem, solution, AMD ROCm and MI300X alignment, Hugging Face and Qwen integrations, architecture, local setup, AMD Cloud deployment path, benchmark plan, and lablab.ai submission checklist.

Files changed (1) hide show
  1. README.md +480 -0
README.md ADDED
@@ -0,0 +1,480 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ElevenClip.AI
2
+
3
+ ElevenClip.AI is an AI-powered clip studio for turning long-form videos into personalized short-form content for TikTok, YouTube Shorts, and Instagram Reels.
4
+
5
+ This project is built for the **AMD Developer Hackathon** on lablab.ai, targeting **Track 3: Vision & Multimodal AI**. The system is designed to run on **AMD Developer Cloud** with **ROCm** and **AMD Instinct MI300X** acceleration, while using **Hugging Face** as the model hub/deployment layer and **Qwen** models for profile-aware highlight reasoning.
6
+
7
+ ## One-Sentence Pitch
8
+
9
+ ElevenClip.AI helps creators convert long videos into ready-to-edit short clips by combining Whisper transcription, Qwen highlight detection, optional Qwen-VL visual understanding, ffmpeg rendering, and a human-in-the-loop clip editor.
10
+
11
+ ## Problem
12
+
13
+ Long-form creators, podcasters, educators, streamers, and marketing teams often publish hours of video but still need short clips for modern discovery platforms.
14
+
15
+ The manual workflow is painful:
16
+
17
+ - Watch the full video.
18
+ - Find high-retention moments.
19
+ - Trim each clip.
20
+ - Rewrite subtitles.
21
+ - Reframe to vertical 9:16.
22
+ - Export platform-ready MP4 files.
23
+
24
+ For a two-hour video, this can take several hours of editing time. The bottleneck is not just cutting video; it is understanding which moments match the creator's audience, channel style, language, and target platform.
25
+
26
+ ## Solution
27
+
28
+ ElevenClip.AI automates the first pass of short-form production:
29
+
30
+ 1. The creator sets up a reusable channel profile.
31
+ 2. The creator provides a YouTube URL or uploads a video file.
32
+ 3. Whisper Large V3 transcribes the video, including Thai and multilingual speech.
33
+ 4. Qwen2.5 analyzes the transcript and scores candidate highlights based on engagement potential and the creator profile.
34
+ 5. Optional Qwen2-VL analysis can enrich the scores with visual signals such as reactions, scene changes, and on-screen text.
35
+ 6. ffmpeg renders vertical clips with subtitle files and burn-in support.
36
+ 7. The React editor lets the human approve, delete, trim, regenerate, and edit subtitles before download.
37
+
38
+ The product is intentionally human-AI collaborative: AI finds and prepares the clips quickly, while the creator keeps editorial control.
39
+
40
+ ## Hackathon Alignment
41
+
42
+ ### Track
43
+
44
+ **Track 3: Vision & Multimodal AI**
45
+
46
+ ElevenClip.AI processes multiple media types:
47
+
48
+ - Audio: speech transcription with Whisper Large V3.
49
+ - Text: transcript reasoning and highlight ranking with Qwen2.5.
50
+ - Video: frame-aware multimodal analysis with Qwen2-VL as the next pipeline stage.
51
+ - Rendered media: ffmpeg exports platform-ready video clips.
52
+
53
+ ### AMD Technology
54
+
55
+ The production target is AMD Developer Cloud:
56
+
57
+ - **AMD Instinct MI300X** for high-throughput model inference.
58
+ - **ROCm 6.x** as the GPU software stack.
59
+ - **PyTorch with ROCm support** for Whisper inference.
60
+ - **vLLM ROCm backend** for fast Qwen2.5 inference.
61
+ - **Optimum-AMD** as an optimization path for Hugging Face models on AMD hardware.
62
+ - **ffmpeg hardware acceleration hooks** for faster video encoding where available.
63
+
64
+ The app has a local `DEMO_MODE=true` path so judges and teammates can inspect the UI/API without downloading large models. On AMD Developer Cloud, set `DEMO_MODE=false` to activate the real model stack.
65
+
66
+ ### Hugging Face Integration
67
+
68
+ Hugging Face is used as the model hub and deployment layer:
69
+
70
+ - `openai/whisper-large-v3` for transcription.
71
+ - `Qwen/Qwen2.5-7B-Instruct` for highlight analysis.
72
+ - `Qwen/Qwen2-VL-7B-Instruct` for multimodal video understanding.
73
+ - Planned public Hugging Face Space for the hackathon demo UI.
74
+
75
+ ### Qwen Integration
76
+
77
+ Qwen is not used as a generic chatbot. It is part of the core product logic:
78
+
79
+ - Reads timestamped transcript segments.
80
+ - Considers creator profile settings.
81
+ - Scores engagement potential.
82
+ - Explains why a segment should become a clip.
83
+ - Returns structured JSON with timestamps, titles, scores, reasons, and subtitle text.
84
+
85
+ ## Current MVP Features
86
+
87
+ - Channel profile onboarding:
88
+ - niche
89
+ - preferred clip style
90
+ - preferred clip length
91
+ - primary language
92
+ - target platform
93
+ - YouTube URL ingestion through `yt-dlp`.
94
+ - Direct video upload endpoint.
95
+ - Whisper transcription service boundary.
96
+ - Qwen highlight detection service boundary.
97
+ - Optional Qwen2-VL multimodal analysis service boundary.
98
+ - ffmpeg clip generation with subtitle file creation.
99
+ - Vertical 9:16 export path for TikTok, Shorts, and Reels.
100
+ - Human-AI review UI:
101
+ - trim start/end
102
+ - edit subtitles inline
103
+ - approve clips
104
+ - delete clips
105
+ - regenerate a clip
106
+ - download MP4 output
107
+ - Timing logs for benchmark demos.
108
+ - Docker and AMD Cloud deployment notes.
109
+
110
+ ## Architecture
111
+
112
+ ```mermaid
113
+ flowchart LR
114
+ A["Creator Profile"] --> D["Qwen2.5 Highlight Scoring"]
115
+ B["YouTube URL"] --> C["yt-dlp / Video Input"]
116
+ B2["Uploaded Video"] --> C
117
+ C --> W["Whisper Large V3 Transcription"]
118
+ W --> D
119
+ C --> V["Qwen2-VL Visual Analysis (Optional)"]
120
+ D --> R["Clip Plan JSON"]
121
+ V --> R
122
+ R --> F["ffmpeg Clip Rendering + Subtitles"]
123
+ F --> E["React Human-AI Editor"]
124
+ E --> O["Approved Short-Form Clips"]
125
+ ```
126
+
127
+ ## Repository Structure
128
+
129
+ ```text
130
+ .
131
+ ├── backend/
132
+ │ ├── app/
133
+ │ │ ├── core/ # configuration and timing instrumentation
134
+ │ │ ├── models/ # Pydantic request/response schemas
135
+ │ │ ├── services/ # ingest, transcription, Qwen scoring, subtitles, rendering
136
+ │ │ ├── utils/ # ROCm / accelerator detection
137
+ │ │ ├── workers/ # optional Celery wiring
138
+ │ │ ├── main.py # FastAPI application
139
+ │ │ └── storage.py # file-backed job storage for MVP
140
+ │ ├── Dockerfile
141
+ │ └── pyproject.toml
142
+ ├── frontend/
143
+ │ ├── src/
144
+ │ │ ├── App.jsx # creator workflow and clip editor
145
+ │ │ ├── main.jsx
146
+ │ │ └── styles.css
147
+ │ ├── Dockerfile
148
+ │ └── package.json
149
+ ├── infra/
150
+ │ └── amd-cloud.md # AMD Developer Cloud deployment guide
151
+ ├── scripts/
152
+ │ └── benchmark.py # end-to-end API benchmark helper
153
+ ├── docker-compose.yml
154
+ └── README.md
155
+ ```
156
+
157
+ ## Processing Pipeline
158
+
159
+ ### 1. Video Input
160
+
161
+ The backend accepts:
162
+
163
+ - YouTube URL through `POST /api/jobs/youtube`
164
+ - Uploaded video file through `POST /api/jobs/upload`
165
+
166
+ In production, YouTube videos are downloaded with `yt-dlp`. In demo mode, the app can generate a synthetic ffmpeg test video so the workflow can be tested without external downloads.
167
+
168
+ ### 2. Transcription
169
+
170
+ The transcription service is implemented in `backend/app/services/transcription.py`.
171
+
172
+ Production target:
173
+
174
+ - Model: `openai/whisper-large-v3`
175
+ - Runtime: Hugging Face Transformers
176
+ - Accelerator: PyTorch ROCm on AMD MI300X
177
+ - Language goal: Thai and multilingual support
178
+
179
+ ### 3. Highlight Detection
180
+
181
+ The highlight detector is implemented in `backend/app/services/highlight.py`.
182
+
183
+ Production target:
184
+
185
+ - Model: `Qwen/Qwen2.5-7B-Instruct`
186
+ - Runtime: vLLM with ROCm backend
187
+ - Output: strict structured JSON
188
+
189
+ Highlight scoring considers:
190
+
191
+ - questions
192
+ - punchlines
193
+ - emotional peaks
194
+ - key information
195
+ - channel niche
196
+ - preferred clip style
197
+ - target platform
198
+ - target clip length
199
+
200
+ ### 4. Multimodal Analysis
201
+
202
+ The multimodal service boundary is implemented in `backend/app/services/multimodal.py`.
203
+
204
+ Planned production target:
205
+
206
+ - Model: `Qwen/Qwen2-VL-7B-Instruct`
207
+ - Inputs: sampled video frames, transcript context, and clip candidates
208
+ - Visual signals:
209
+ - creator or guest reactions
210
+ - scene changes
211
+ - on-screen text
212
+ - high-motion segments
213
+
214
+ This is isolated as a replaceable pipeline step so it can be enabled when AMD Cloud resources are available.
215
+
216
+ ### 5. Clip Generation
217
+
218
+ Clip rendering is implemented in `backend/app/services/clips.py`.
219
+
220
+ The ffmpeg stage:
221
+
222
+ - cuts video by selected timestamps
223
+ - exports MP4
224
+ - creates `.srt` subtitle files
225
+ - supports subtitle burn-in
226
+ - reformats to 9:16 vertical output for short-form platforms
227
+ - includes AMD hardware encoder configuration hooks
228
+
229
+ ### 6. Human-AI Collaborative Editing
230
+
231
+ The frontend editor lets creators review AI-generated clips and make final decisions:
232
+
233
+ - adjust start and end timestamps
234
+ - edit subtitle text
235
+ - delete weak clips
236
+ - approve good clips
237
+ - regenerate a specific clip
238
+ - download the result
239
+
240
+ ## API Overview
241
+
242
+ | Method | Endpoint | Description |
243
+ | --- | --- | --- |
244
+ | `GET` | `/health` | Returns service health and accelerator detection. |
245
+ | `POST` | `/api/jobs/youtube` | Creates a processing job from a YouTube URL. |
246
+ | `POST` | `/api/jobs/upload` | Creates a processing job from an uploaded video. |
247
+ | `GET` | `/api/jobs/{job_id}` | Returns status, transcript, clips, timings, and errors. |
248
+ | `PATCH` | `/api/jobs/{job_id}/clips/{clip_id}` | Updates trim times, subtitles, approval, or deletion state. |
249
+ | `POST` | `/api/jobs/{job_id}/clips/{clip_id}/regenerate` | Re-renders one clip with updated parameters. |
250
+ | `GET` | `/api/jobs/{job_id}/clips/{clip_id}/download` | Downloads an exported clip. |
251
+
252
+ ## Local Development
253
+
254
+ ### Requirements
255
+
256
+ - Python 3.11+
257
+ - Node.js 20+
258
+ - ffmpeg
259
+
260
+ ### Backend
261
+
262
+ ```bash
263
+ cd backend
264
+ python -m venv .venv
265
+ . .venv/bin/activate
266
+ pip install -e .
267
+ uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
268
+ ```
269
+
270
+ On Windows PowerShell:
271
+
272
+ ```powershell
273
+ cd backend
274
+ python -m venv .venv
275
+ .\.venv\Scripts\Activate.ps1
276
+ pip install -e .
277
+ uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
278
+ ```
279
+
280
+ ### Frontend
281
+
282
+ ```bash
283
+ cd frontend
284
+ npm install
285
+ npm run dev
286
+ ```
287
+
288
+ Open:
289
+
290
+ ```text
291
+ http://localhost:5173
292
+ ```
293
+
294
+ ### Demo Mode
295
+
296
+ By default, the project runs in demo mode:
297
+
298
+ ```env
299
+ DEMO_MODE=true
300
+ ```
301
+
302
+ Demo mode avoids downloading multi-GB AI models and returns deterministic mock transcript/highlight data while still exercising the API, UI, job state, timing logs, subtitle generation, and ffmpeg rendering path.
303
+
304
+ ## AMD Developer Cloud Deployment
305
+
306
+ See [infra/amd-cloud.md](infra/amd-cloud.md) for a focused deployment guide.
307
+
308
+ High-level steps:
309
+
310
+ ```bash
311
+ git clone https://github.com/JakgritB/ElevenClip.AI.git
312
+ cd ElevenClip.AI
313
+ cp .env.example .env
314
+ ```
315
+
316
+ Edit `.env`:
317
+
318
+ ```env
319
+ DEMO_MODE=false
320
+ HF_TOKEN=your_huggingface_token
321
+ WHISPER_MODEL_ID=openai/whisper-large-v3
322
+ QWEN_TEXT_MODEL_ID=Qwen/Qwen2.5-7B-Instruct
323
+ QWEN_VL_MODEL_ID=Qwen/Qwen2-VL-7B-Instruct
324
+ ```
325
+
326
+ Install the AI/ROCm stack on the AMD instance:
327
+
328
+ ```bash
329
+ cd backend
330
+ pip install -e ".[ai,rocm-inference]"
331
+ ```
332
+
333
+ Start the API:
334
+
335
+ ```bash
336
+ uvicorn app.main:app --host 0.0.0.0 --port 8000
337
+ ```
338
+
339
+ Validate accelerator detection:
340
+
341
+ ```bash
342
+ curl http://localhost:8000/health
343
+ ```
344
+
345
+ Expected on AMD Cloud:
346
+
347
+ - `torch_available: true`
348
+ - `cuda_api_available: true`
349
+ - `rocm_hip_version` populated
350
+ - MI300X visible as the active device
351
+
352
+ ## Docker
353
+
354
+ ```bash
355
+ docker compose up --build
356
+ ```
357
+
358
+ For AMD Developer Cloud with ROCm extras:
359
+
360
+ ```bash
361
+ docker compose build --build-arg INSTALL_EXTRAS=.[ai,rocm-inference] backend
362
+ docker compose up
363
+ ```
364
+
365
+ The compose file mounts AMD GPU devices (`/dev/kfd`, `/dev/dri`) and uses host IPC for large-model inference.
366
+
367
+ ## Benchmark Plan
368
+
369
+ The hackathon judges care about technology application and real-world performance. ElevenClip.AI includes step-level timing logs so the demo can show why AMD acceleration matters.
370
+
371
+ Run a benchmark against a running API:
372
+
373
+ ```bash
374
+ python scripts/benchmark.py \
375
+ --api http://localhost:8000 \
376
+ --youtube-url "https://youtube.com/watch?v=..."
377
+ ```
378
+
379
+ Recommended benchmark comparison:
380
+
381
+ | Scenario | Hardware | Expected Purpose |
382
+ | --- | --- | --- |
383
+ | CPU baseline | CPU-only runtime | Show the pain of long-form video processing without acceleration. |
384
+ | AMD GPU run | AMD Instinct MI300X + ROCm | Show high-throughput transcription and Qwen inference. |
385
+
386
+ Metrics captured:
387
+
388
+ - input/download time
389
+ - transcription time
390
+ - highlight detection time
391
+ - multimodal analysis time
392
+ - clip generation time
393
+ - total wall-clock time
394
+ - number of clips generated
395
+
396
+ Demo target:
397
+
398
+ - input: two-hour creator video
399
+ - output: 10 subtitle-ready clips
400
+ - goal: under 10 minutes on MI300X
401
+
402
+ ## Submission Assets Checklist
403
+
404
+ The lablab.ai submission asks for:
405
+
406
+ - Project title: `ElevenClip.AI`
407
+ - Short description
408
+ - Long description
409
+ - Technology and category tags
410
+ - Cover image
411
+ - Video presentation
412
+ - Slide presentation
413
+ - Public GitHub repository
414
+ - Demo application platform
415
+ - Application URL
416
+
417
+ Recommended tags:
418
+
419
+ ```text
420
+ AMD, ROCm, MI300X, AMD Developer Cloud, Vision AI, Multimodal AI, Video AI, Whisper, Qwen, Qwen-VL, Hugging Face, FastAPI, React
421
+ ```
422
+
423
+ ## Suggested Short Description
424
+
425
+ ```text
426
+ ElevenClip.AI turns long-form videos into personalized short-form clips using Whisper, Qwen, Hugging Face, and AMD ROCm on MI300X.
427
+ ```
428
+
429
+ ## Suggested Long Description
430
+
431
+ ```text
432
+ ElevenClip.AI is a human-AI collaborative clip studio for creators. It takes a YouTube URL or uploaded long-form video, transcribes it with Whisper Large V3, uses Qwen2.5 to identify high-engagement highlight moments based on a reusable channel profile, optionally enriches candidates with Qwen2-VL visual analysis, and renders short-form MP4 clips with subtitles using ffmpeg. The React editor lets creators trim, edit subtitles, approve, delete, regenerate, and download final clips. The project is designed for AMD Developer Cloud with ROCm and AMD Instinct MI300X acceleration, demonstrating how high-throughput multimodal AI can reduce hours of manual editing into a fast creator workflow.
433
+ ```
434
+
435
+ ## Judging Criteria Mapping
436
+
437
+ ### Application of Technology
438
+
439
+ ElevenClip.AI integrates Whisper, Qwen2.5, Qwen2-VL, Hugging Face, ROCm, vLLM, and AMD Developer Cloud into an end-to-end video processing product.
440
+
441
+ ### Presentation
442
+
443
+ The demo is designed to be visual and easy to understand: input a long video, watch AI create candidates, edit clips, and download platform-ready MP4 files.
444
+
445
+ ### Business Value
446
+
447
+ The product targets a real creator economy workflow. Creators, agencies, podcasters, educators, and streamers all need short-form repurposing, and manual editing is expensive.
448
+
449
+ ### Originality
450
+
451
+ The system goes beyond generic clipping by personalizing highlight selection to a creator's niche, style, language, clip length, and platform. It also preserves human editorial control instead of fully automating final publishing.
452
+
453
+ ## Build-in-Public Plan
454
+
455
+ The hackathon includes a build-in-public challenge. Suggested updates:
456
+
457
+ 1. Share the architecture and first local demo.
458
+ 2. Share AMD Cloud/ROCm setup notes and benchmark results.
459
+ 3. Publish meaningful feedback about ROCm, AMD Developer Cloud, or inference setup.
460
+
461
+ Suggested hashtags/topics:
462
+
463
+ ```text
464
+ #AMDDeveloperHackathon #ROCm #MI300X #HuggingFace #Qwen #VideoAI #MultimodalAI
465
+ ```
466
+
467
+ ## Roadmap
468
+
469
+ - Real Whisper Large V3 run on AMD Developer Cloud.
470
+ - Real Qwen2.5 vLLM ROCm inference path.
471
+ - Qwen2-VL frame sampling and visual scoring.
472
+ - Batch export for 10+ clips.
473
+ - Subtitle styling presets per platform.
474
+ - Creator profile memory and reusable brand presets.
475
+ - Hugging Face Space deployment.
476
+ - CPU vs MI300X benchmark report.
477
+
478
+ ## License
479
+
480
+ MIT. See [LICENSE](LICENSE).