File size: 15,572 Bytes
f066cd2 f6ac864 00b7145 f6ac864 00b7145 f6ac864 00b7145 f6ac864 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 | ---
title: ElevenClip AI
emoji: π¬
colorFrom: purple
colorTo: red
sdk: docker
pinned: false
---
# ElevenClip.AI
ElevenClip.AI is an AI-powered clip studio for turning long-form videos into personalized short-form content for TikTok, YouTube Shorts, and Instagram Reels.
This project is built for the **AMD Developer Hackathon** on lablab.ai, targeting **Track 3: Vision & Multimodal AI**. The system is designed to run on **AMD Developer Cloud** with **ROCm** and **AMD Instinct MI300X** acceleration, while using **Hugging Face** as the model hub/deployment layer and **Qwen** models for profile-aware highlight reasoning.
## One-Sentence Pitch
ElevenClip.AI helps creators convert long videos into ready-to-edit short clips by combining Whisper transcription, Qwen highlight detection, optional Qwen-VL visual understanding, ffmpeg rendering, and a human-in-the-loop clip editor.
## Problem
Long-form creators, podcasters, educators, streamers, and marketing teams often publish hours of video but still need short clips for modern discovery platforms.
The manual workflow is painful:
- Watch the full video.
- Find high-retention moments.
- Trim each clip.
- Rewrite subtitles.
- Reframe to vertical 9:16.
- Export platform-ready MP4 files.
For a two-hour video, this can take several hours of editing time. The bottleneck is not just cutting video; it is understanding which moments match the creator's audience, channel style, language, and target platform.
## Solution
ElevenClip.AI automates the first pass of short-form production:
1. The creator sets up a reusable channel profile.
2. The creator provides a YouTube URL or uploads a video file.
3. Whisper Large V3 transcribes the video, including Thai and multilingual speech.
4. Qwen2.5 analyzes the transcript and scores candidate highlights based on engagement potential and the creator profile.
5. Optional Qwen2-VL analysis can enrich the scores with visual signals such as reactions, scene changes, and on-screen text.
6. ffmpeg renders vertical clips with subtitle files and burn-in support.
7. The React editor lets the human approve, delete, trim, regenerate, and edit subtitles before download.
The product is intentionally human-AI collaborative: AI finds and prepares the clips quickly, while the creator keeps editorial control.
## Hackathon Alignment
### Track
**Track 3: Vision & Multimodal AI**
ElevenClip.AI processes multiple media types:
- Audio: speech transcription with Whisper Large V3.
- Text: transcript reasoning and highlight ranking with Qwen2.5.
- Video: frame-aware multimodal analysis with Qwen2-VL as the next pipeline stage.
- Rendered media: ffmpeg exports platform-ready video clips.
### AMD Technology
The production target is AMD Developer Cloud:
- **AMD Instinct MI300X** for high-throughput model inference.
- **ROCm 6.x** as the GPU software stack.
- **PyTorch with ROCm support** for Whisper inference.
- **vLLM ROCm backend** for fast Qwen2.5 inference.
- **Optimum-AMD** as an optimization path for Hugging Face models on AMD hardware.
- **ffmpeg hardware acceleration hooks** for faster video encoding where available.
The app has a local `DEMO_MODE=true` path so judges and teammates can inspect the UI/API without downloading large models. On AMD Developer Cloud, set `DEMO_MODE=false` to activate the real model stack.
### Hugging Face Integration
Hugging Face is used as the model hub and deployment layer:
- `openai/whisper-large-v3` for transcription.
- `Qwen/Qwen2.5-7B-Instruct` for highlight analysis.
- `Qwen/Qwen2-VL-7B-Instruct` for multimodal video understanding.
- Public Hugging Face Space for the hackathon demo page:
`https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/ElevenClip-AI`
### Qwen Integration
Qwen is not used as a generic chatbot. It is part of the core product logic:
- Reads timestamped transcript segments.
- Considers creator profile settings.
- Scores engagement potential.
- Explains why a segment should become a clip.
- Returns structured JSON with timestamps, titles, scores, reasons, and subtitle text.
## Current MVP Features
- Channel profile onboarding:
- niche
- preferred clip style
- preferred clip length
- primary language
- target platform
- YouTube URL ingestion through `yt-dlp`.
- Direct video upload endpoint.
- Whisper transcription service boundary.
- Qwen highlight detection service boundary.
- Optional Qwen2-VL multimodal analysis service boundary.
- ffmpeg clip generation with subtitle file creation.
- Vertical 9:16 export path for TikTok, Shorts, and Reels.
- Human-AI review UI:
- trim start/end
- edit subtitles inline
- approve clips
- delete clips
- regenerate a clip
- download MP4 output
- Timing logs for benchmark demos.
- Docker and AMD Cloud deployment notes.
## Architecture
```mermaid
flowchart LR
A["Creator Profile"] --> D["Qwen2.5 Highlight Scoring"]
B["YouTube URL"] --> C["yt-dlp / Video Input"]
B2["Uploaded Video"] --> C
C --> W["Whisper Large V3 Transcription"]
W --> D
C --> V["Qwen2-VL Visual Analysis (Optional)"]
D --> R["Clip Plan JSON"]
V --> R
R --> F["ffmpeg Clip Rendering + Subtitles"]
F --> E["React Human-AI Editor"]
E --> O["Approved Short-Form Clips"]
```
## Repository Structure
```text
.
βββ backend/
β βββ app/
β β βββ core/ # configuration and timing instrumentation
β β βββ models/ # Pydantic request/response schemas
β β βββ services/ # ingest, transcription, Qwen scoring, subtitles, rendering
β β βββ utils/ # ROCm / accelerator detection
β β βββ workers/ # optional Celery wiring
β β βββ main.py # FastAPI application
β β βββ storage.py # file-backed job storage for MVP
β βββ Dockerfile
β βββ pyproject.toml
βββ frontend/
β βββ src/
β β βββ App.jsx # creator workflow and clip editor
β β βββ main.jsx
β β βββ styles.css
β βββ Dockerfile
β βββ package.json
βββ infra/
β βββ amd-cloud.md # AMD Developer Cloud deployment guide
βββ scripts/
β βββ benchmark.py # end-to-end API benchmark helper
βββ docker-compose.yml
βββ README.md
```
## Processing Pipeline
### 1. Video Input
The backend accepts:
- YouTube URL through `POST /api/jobs/youtube`
- Uploaded video file through `POST /api/jobs/upload`
In production, YouTube videos are downloaded with `yt-dlp`. In demo mode, the app can generate a synthetic ffmpeg test video so the workflow can be tested without external downloads.
### 2. Transcription
The transcription service is implemented in `backend/app/services/transcription.py`.
Production target:
- Model: `openai/whisper-large-v3`
- Runtime: Hugging Face Transformers
- Accelerator: PyTorch ROCm on AMD MI300X
- Language goal: Thai and multilingual support
### 3. Highlight Detection
The highlight detector is implemented in `backend/app/services/highlight.py`.
Production target:
- Model: `Qwen/Qwen2.5-7B-Instruct`
- Runtime: vLLM with ROCm backend
- Output: strict structured JSON
Highlight scoring considers:
- questions
- punchlines
- emotional peaks
- key information
- channel niche
- preferred clip style
- target platform
- target clip length
### 4. Multimodal Analysis
The multimodal service boundary is implemented in `backend/app/services/multimodal.py`.
Planned production target:
- Model: `Qwen/Qwen2-VL-7B-Instruct`
- Inputs: sampled video frames, transcript context, and clip candidates
- Visual signals:
- creator or guest reactions
- scene changes
- on-screen text
- high-motion segments
This is isolated as a replaceable pipeline step so it can be enabled when AMD Cloud resources are available.
### 5. Clip Generation
Clip rendering is implemented in `backend/app/services/clips.py`.
The ffmpeg stage:
- cuts video by selected timestamps
- exports MP4
- creates `.srt` subtitle files
- supports subtitle burn-in
- reformats to 9:16 vertical output for short-form platforms
- includes AMD hardware encoder configuration hooks
### 6. Human-AI Collaborative Editing
The frontend editor lets creators review AI-generated clips and make final decisions:
- adjust start and end timestamps
- edit subtitle text
- delete weak clips
- approve good clips
- regenerate a specific clip
- download the result
## API Overview
| Method | Endpoint | Description |
| --- | --- | --- |
| `GET` | `/health` | Returns service health and accelerator detection. |
| `POST` | `/api/jobs/youtube` | Creates a processing job from a YouTube URL. |
| `POST` | `/api/jobs/upload` | Creates a processing job from an uploaded video. |
| `GET` | `/api/jobs/{job_id}` | Returns status, transcript, clips, timings, and errors. |
| `PATCH` | `/api/jobs/{job_id}/clips/{clip_id}` | Updates trim times, subtitles, approval, or deletion state. |
| `POST` | `/api/jobs/{job_id}/clips/{clip_id}/regenerate` | Re-renders one clip with updated parameters. |
| `GET` | `/api/jobs/{job_id}/clips/{clip_id}/download` | Downloads an exported clip. |
## Local Development
### Requirements
- Python 3.11+
- Node.js 20+
- ffmpeg
### Backend
```bash
cd backend
python -m venv .venv
. .venv/bin/activate
pip install -e .
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
On Windows PowerShell:
```powershell
cd backend
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
### Frontend
```bash
cd frontend
npm install
npm run dev
```
Open:
```text
http://localhost:5173
```
### Demo Mode
By default, the project runs in demo mode:
```env
DEMO_MODE=true
```
Demo mode avoids downloading multi-GB AI models and returns deterministic mock transcript/highlight data while still exercising the API, UI, job state, timing logs, subtitle generation, and ffmpeg rendering path.
## AMD Developer Cloud Deployment
See [infra/amd-cloud.md](infra/amd-cloud.md) for a focused deployment guide.
High-level steps:
```bash
git clone https://github.com/JakgritB/ElevenClip.AI.git
cd ElevenClip.AI
cp .env.example .env
```
Edit `.env`:
```env
DEMO_MODE=false
HF_TOKEN=your_huggingface_token
WHISPER_MODEL_ID=openai/whisper-large-v3
QWEN_TEXT_MODEL_ID=Qwen/Qwen2.5-7B-Instruct
QWEN_VL_MODEL_ID=Qwen/Qwen2-VL-7B-Instruct
```
Install the AI/ROCm stack on the AMD instance:
```bash
cd backend
pip install -e ".[ai,rocm-inference]"
```
Start the API:
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000
```
Validate accelerator detection:
```bash
curl http://localhost:8000/health
```
Expected on AMD Cloud:
- `torch_available: true`
- `cuda_api_available: true`
- `rocm_hip_version` populated
- MI300X visible as the active device
## Docker
```bash
docker compose up --build
```
For AMD Developer Cloud with ROCm extras:
```bash
docker compose build --build-arg INSTALL_EXTRAS=.[ai,rocm-inference] backend
docker compose up
```
The compose file mounts AMD GPU devices (`/dev/kfd`, `/dev/dri`) and uses host IPC for large-model inference.
## Benchmark Plan
The hackathon judges care about technology application and real-world performance. ElevenClip.AI includes step-level timing logs so the demo can show why AMD acceleration matters.
Run a benchmark against a running API:
```bash
python scripts/benchmark.py \
--api http://localhost:8000 \
--youtube-url "https://youtube.com/watch?v=..."
```
Recommended benchmark comparison:
| Scenario | Hardware | Expected Purpose |
| --- | --- | --- |
| CPU baseline | CPU-only runtime | Show the pain of long-form video processing without acceleration. |
| AMD GPU run | AMD Instinct MI300X + ROCm | Show high-throughput transcription and Qwen inference. |
Metrics captured:
- input/download time
- transcription time
- highlight detection time
- multimodal analysis time
- clip generation time
- total wall-clock time
- number of clips generated
Demo target:
- input: two-hour creator video
- output: 10 subtitle-ready clips
- goal: under 10 minutes on MI300X
## Submission Assets Checklist
The lablab.ai submission asks for:
- Project title: `ElevenClip.AI`
- Short description
- Long description
- Technology and category tags
- Cover image
- Video presentation
- Slide presentation
- Public GitHub repository
- Demo application platform
- Application URL
Prepared submission docs:
- `docs/SUBMISSION.md` - copy-ready project text for lablab.ai.
- `docs/DEMO_SCRIPT.md` - draft and final recording script.
- `docs/PITCH_DECK.md` - slide outline for the presentation deck.
- `docs/BUILD_IN_PUBLIC.md` - social post drafts and AMD feedback notes.
- `docs/AMD_CREDIT_RUNBOOK.md` - checklist for the first MI300X run.
Recommended tags:
```text
AMD, ROCm, MI300X, AMD Developer Cloud, Vision AI, Multimodal AI, Video AI, Whisper, Qwen, Qwen-VL, Hugging Face, FastAPI, React
```
## Suggested Short Description
```text
ElevenClip.AI turns long-form videos into personalized short-form clips using Whisper, Qwen, Hugging Face, and AMD ROCm on MI300X.
```
## Suggested Long Description
```text
ElevenClip.AI is a human-AI collaborative clip studio for creators. It takes a YouTube URL or uploaded long-form video, transcribes it with Whisper Large V3, uses Qwen2.5 to identify high-engagement highlight moments based on a reusable channel profile, optionally enriches candidates with Qwen2-VL visual analysis, and renders short-form MP4 clips with subtitles using ffmpeg. The React editor lets creators trim, edit subtitles, approve, delete, regenerate, and download final clips. The project is designed for AMD Developer Cloud with ROCm and AMD Instinct MI300X acceleration, demonstrating how high-throughput multimodal AI can reduce hours of manual editing into a fast creator workflow.
```
## Judging Criteria Mapping
### Application of Technology
ElevenClip.AI integrates Whisper, Qwen2.5, Qwen2-VL, Hugging Face, ROCm, vLLM, and AMD Developer Cloud into an end-to-end video processing product.
### Presentation
The demo is designed to be visual and easy to understand: input a long video, watch AI create candidates, edit clips, and download platform-ready MP4 files.
### Business Value
The product targets a real creator economy workflow. Creators, agencies, podcasters, educators, and streamers all need short-form repurposing, and manual editing is expensive.
### Originality
The system goes beyond generic clipping by personalizing highlight selection to a creator's niche, style, language, clip length, and platform. It also preserves human editorial control instead of fully automating final publishing.
## Build-in-Public Plan
The hackathon includes a build-in-public challenge. Suggested updates:
1. Share the architecture and first local demo.
2. Share AMD Cloud/ROCm setup notes and benchmark results.
3. Publish meaningful feedback about ROCm, AMD Developer Cloud, or inference setup.
Suggested hashtags/topics:
```text
#AMDDeveloperHackathon #ROCm #MI300X #HuggingFace #Qwen #VideoAI #MultimodalAI
```
## Roadmap
- Real Whisper Large V3 run on AMD Developer Cloud.
- Real Qwen2.5 vLLM ROCm inference path.
- Qwen2-VL frame sampling and visual scoring.
- Batch export for 10+ clips.
- Subtitle styling presets per platform.
- Creator profile memory and reusable brand presets.
- Hugging Face Space screenshot and richer project media.
- CPU vs MI300X benchmark report after AMD credits arrive.
## License
MIT. See [LICENSE](LICENSE).
|