Spaces:

lablab-ai-amd-developer-hackathon
/

ElevenClip-AI

Running

App Files Files Community

ElevenClip-AI / README.md

JakgritB

fix(hf-space): add HF Spaces config and root Dockerfile

f066cd2 6 days ago

preview code

raw

history blame contribute delete

15.6 kB

metadata

title: ElevenClip AI
emoji: 🎬
colorFrom: purple
colorTo: red
sdk: docker
pinned: false

ElevenClip.AI

ElevenClip.AI is an AI-powered clip studio for turning long-form videos into personalized short-form content for TikTok, YouTube Shorts, and Instagram Reels.

This project is built for the AMD Developer Hackathon on lablab.ai, targeting Track 3: Vision & Multimodal AI. The system is designed to run on AMD Developer Cloud with ROCm and AMD Instinct MI300X acceleration, while using Hugging Face as the model hub/deployment layer and Qwen models for profile-aware highlight reasoning.

One-Sentence Pitch

ElevenClip.AI helps creators convert long videos into ready-to-edit short clips by combining Whisper transcription, Qwen highlight detection, optional Qwen-VL visual understanding, ffmpeg rendering, and a human-in-the-loop clip editor.

Problem

Long-form creators, podcasters, educators, streamers, and marketing teams often publish hours of video but still need short clips for modern discovery platforms.

The manual workflow is painful:

Watch the full video.
Find high-retention moments.
Trim each clip.
Rewrite subtitles.
Reframe to vertical 9:16.
Export platform-ready MP4 files.

For a two-hour video, this can take several hours of editing time. The bottleneck is not just cutting video; it is understanding which moments match the creator's audience, channel style, language, and target platform.

Solution

ElevenClip.AI automates the first pass of short-form production:

The creator sets up a reusable channel profile.
The creator provides a YouTube URL or uploads a video file.
Whisper Large V3 transcribes the video, including Thai and multilingual speech.
Qwen2.5 analyzes the transcript and scores candidate highlights based on engagement potential and the creator profile.
Optional Qwen2-VL analysis can enrich the scores with visual signals such as reactions, scene changes, and on-screen text.
ffmpeg renders vertical clips with subtitle files and burn-in support.
The React editor lets the human approve, delete, trim, regenerate, and edit subtitles before download.

The product is intentionally human-AI collaborative: AI finds and prepares the clips quickly, while the creator keeps editorial control.

Hackathon Alignment

Track

Track 3: Vision & Multimodal AI

ElevenClip.AI processes multiple media types:

Audio: speech transcription with Whisper Large V3.
Text: transcript reasoning and highlight ranking with Qwen2.5.
Video: frame-aware multimodal analysis with Qwen2-VL as the next pipeline stage.
Rendered media: ffmpeg exports platform-ready video clips.

AMD Technology

The production target is AMD Developer Cloud:

AMD Instinct MI300X for high-throughput model inference.
ROCm 6.x as the GPU software stack.
PyTorch with ROCm support for Whisper inference.
vLLM ROCm backend for fast Qwen2.5 inference.
Optimum-AMD as an optimization path for Hugging Face models on AMD hardware.
ffmpeg hardware acceleration hooks for faster video encoding where available.

The app has a local DEMO_MODE=true path so judges and teammates can inspect the UI/API without downloading large models. On AMD Developer Cloud, set DEMO_MODE=false to activate the real model stack.

Hugging Face Integration

Hugging Face is used as the model hub and deployment layer:

openai/whisper-large-v3 for transcription.
Qwen/Qwen2.5-7B-Instruct for highlight analysis.
Qwen/Qwen2-VL-7B-Instruct for multimodal video understanding.
Public Hugging Face Space for the hackathon demo page: https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/ElevenClip-AI

Qwen Integration

Qwen is not used as a generic chatbot. It is part of the core product logic:

Reads timestamped transcript segments.
Considers creator profile settings.
Scores engagement potential.
Explains why a segment should become a clip.
Returns structured JSON with timestamps, titles, scores, reasons, and subtitle text.

Current MVP Features

Channel profile onboarding:
- niche
- preferred clip style
- preferred clip length
- primary language
- target platform
YouTube URL ingestion through yt-dlp.
Direct video upload endpoint.
Whisper transcription service boundary.
Qwen highlight detection service boundary.
Optional Qwen2-VL multimodal analysis service boundary.
ffmpeg clip generation with subtitle file creation.
Vertical 9:16 export path for TikTok, Shorts, and Reels.
Human-AI review UI:
- trim start/end
- edit subtitles inline
- approve clips
- delete clips
- regenerate a clip
- download MP4 output
Timing logs for benchmark demos.
Docker and AMD Cloud deployment notes.

Architecture

flowchart LR
  A["Creator Profile"] --> D["Qwen2.5 Highlight Scoring"]
  B["YouTube URL"] --> C["yt-dlp / Video Input"]
  B2["Uploaded Video"] --> C
  C --> W["Whisper Large V3 Transcription"]
  W --> D
  C --> V["Qwen2-VL Visual Analysis (Optional)"]
  D --> R["Clip Plan JSON"]
  V --> R
  R --> F["ffmpeg Clip Rendering + Subtitles"]
  F --> E["React Human-AI Editor"]
  E --> O["Approved Short-Form Clips"]

Repository Structure

.
├── backend/
│   ├── app/
│   │   ├── core/          # configuration and timing instrumentation
│   │   ├── models/        # Pydantic request/response schemas
│   │   ├── services/      # ingest, transcription, Qwen scoring, subtitles, rendering
│   │   ├── utils/         # ROCm / accelerator detection
│   │   ├── workers/       # optional Celery wiring
│   │   ├── main.py        # FastAPI application
│   │   └── storage.py     # file-backed job storage for MVP
│   ├── Dockerfile
│   └── pyproject.toml
├── frontend/
│   ├── src/
│   │   ├── App.jsx        # creator workflow and clip editor
│   │   ├── main.jsx
│   │   └── styles.css
│   ├── Dockerfile
│   └── package.json
├── infra/
│   └── amd-cloud.md       # AMD Developer Cloud deployment guide
├── scripts/
│   └── benchmark.py       # end-to-end API benchmark helper
├── docker-compose.yml
└── README.md

Processing Pipeline

1. Video Input

The backend accepts:

YouTube URL through POST /api/jobs/youtube
Uploaded video file through POST /api/jobs/upload

In production, YouTube videos are downloaded with yt-dlp. In demo mode, the app can generate a synthetic ffmpeg test video so the workflow can be tested without external downloads.

2. Transcription

The transcription service is implemented in backend/app/services/transcription.py.

Production target:

Model: openai/whisper-large-v3
Runtime: Hugging Face Transformers
Accelerator: PyTorch ROCm on AMD MI300X
Language goal: Thai and multilingual support

3. Highlight Detection

The highlight detector is implemented in backend/app/services/highlight.py.

Production target:

Model: Qwen/Qwen2.5-7B-Instruct
Runtime: vLLM with ROCm backend
Output: strict structured JSON

Highlight scoring considers:

questions
punchlines
emotional peaks
key information
channel niche
preferred clip style
target platform
target clip length

4. Multimodal Analysis

The multimodal service boundary is implemented in backend/app/services/multimodal.py.

Planned production target:

Model: Qwen/Qwen2-VL-7B-Instruct
Inputs: sampled video frames, transcript context, and clip candidates
Visual signals:
- creator or guest reactions
- scene changes
- on-screen text
- high-motion segments

This is isolated as a replaceable pipeline step so it can be enabled when AMD Cloud resources are available.

5. Clip Generation

Clip rendering is implemented in backend/app/services/clips.py.

The ffmpeg stage:

cuts video by selected timestamps
exports MP4
creates .srt subtitle files
supports subtitle burn-in
reformats to 9:16 vertical output for short-form platforms
includes AMD hardware encoder configuration hooks

6. Human-AI Collaborative Editing

The frontend editor lets creators review AI-generated clips and make final decisions:

adjust start and end timestamps
edit subtitle text
delete weak clips
approve good clips
regenerate a specific clip
download the result

API Overview

Method	Endpoint	Description
`GET`	`/health`	Returns service health and accelerator detection.
`POST`	`/api/jobs/youtube`	Creates a processing job from a YouTube URL.
`POST`	`/api/jobs/upload`	Creates a processing job from an uploaded video.
`GET`	`/api/jobs/{job_id}`	Returns status, transcript, clips, timings, and errors.
`PATCH`	`/api/jobs/{job_id}/clips/{clip_id}`	Updates trim times, subtitles, approval, or deletion state.
`POST`	`/api/jobs/{job_id}/clips/{clip_id}/regenerate`	Re-renders one clip with updated parameters.
`GET`	`/api/jobs/{job_id}/clips/{clip_id}/download`	Downloads an exported clip.

Local Development

Requirements

Python 3.11+
Node.js 20+
ffmpeg

Backend

cd backend
python -m venv .venv
. .venv/bin/activate
pip install -e .
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

On Windows PowerShell:

cd backend
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Frontend

cd frontend
npm install
npm run dev

Open:

http://localhost:5173

Demo Mode

By default, the project runs in demo mode:

DEMO_MODE=true

Demo mode avoids downloading multi-GB AI models and returns deterministic mock transcript/highlight data while still exercising the API, UI, job state, timing logs, subtitle generation, and ffmpeg rendering path.

AMD Developer Cloud Deployment

See infra/amd-cloud.md for a focused deployment guide.

High-level steps:

git clone https://github.com/JakgritB/ElevenClip.AI.git
cd ElevenClip.AI
cp .env.example .env

Edit .env:

DEMO_MODE=false
HF_TOKEN=your_huggingface_token
WHISPER_MODEL_ID=openai/whisper-large-v3
QWEN_TEXT_MODEL_ID=Qwen/Qwen2.5-7B-Instruct
QWEN_VL_MODEL_ID=Qwen/Qwen2-VL-7B-Instruct

Install the AI/ROCm stack on the AMD instance:

cd backend
pip install -e ".[ai,rocm-inference]"

Start the API:

uvicorn app.main:app --host 0.0.0.0 --port 8000

Validate accelerator detection:

curl http://localhost:8000/health

Expected on AMD Cloud:

torch_available: true
cuda_api_available: true
rocm_hip_version populated
MI300X visible as the active device

Docker

docker compose up --build

For AMD Developer Cloud with ROCm extras:

docker compose build --build-arg INSTALL_EXTRAS=.[ai,rocm-inference] backend
docker compose up

The compose file mounts AMD GPU devices (/dev/kfd, /dev/dri) and uses host IPC for large-model inference.

Benchmark Plan

The hackathon judges care about technology application and real-world performance. ElevenClip.AI includes step-level timing logs so the demo can show why AMD acceleration matters.

Run a benchmark against a running API:

python scripts/benchmark.py \
  --api http://localhost:8000 \
  --youtube-url "https://youtube.com/watch?v=..."

Recommended benchmark comparison:

Scenario	Hardware	Expected Purpose
CPU baseline	CPU-only runtime	Show the pain of long-form video processing without acceleration.
AMD GPU run	AMD Instinct MI300X + ROCm	Show high-throughput transcription and Qwen inference.

Metrics captured:

input/download time
transcription time
highlight detection time
multimodal analysis time
clip generation time
total wall-clock time
number of clips generated

Demo target:

input: two-hour creator video
output: 10 subtitle-ready clips
goal: under 10 minutes on MI300X

Submission Assets Checklist

The lablab.ai submission asks for:

Project title: ElevenClip.AI
Short description
Long description
Technology and category tags
Cover image
Video presentation
Slide presentation
Public GitHub repository
Demo application platform
Application URL

Prepared submission docs:

docs/SUBMISSION.md - copy-ready project text for lablab.ai.
docs/DEMO_SCRIPT.md - draft and final recording script.
docs/PITCH_DECK.md - slide outline for the presentation deck.
docs/BUILD_IN_PUBLIC.md - social post drafts and AMD feedback notes.
docs/AMD_CREDIT_RUNBOOK.md - checklist for the first MI300X run.

Recommended tags:

AMD, ROCm, MI300X, AMD Developer Cloud, Vision AI, Multimodal AI, Video AI, Whisper, Qwen, Qwen-VL, Hugging Face, FastAPI, React

Suggested Short Description

ElevenClip.AI turns long-form videos into personalized short-form clips using Whisper, Qwen, Hugging Face, and AMD ROCm on MI300X.

Suggested Long Description

ElevenClip.AI is a human-AI collaborative clip studio for creators. It takes a YouTube URL or uploaded long-form video, transcribes it with Whisper Large V3, uses Qwen2.5 to identify high-engagement highlight moments based on a reusable channel profile, optionally enriches candidates with Qwen2-VL visual analysis, and renders short-form MP4 clips with subtitles using ffmpeg. The React editor lets creators trim, edit subtitles, approve, delete, regenerate, and download final clips. The project is designed for AMD Developer Cloud with ROCm and AMD Instinct MI300X acceleration, demonstrating how high-throughput multimodal AI can reduce hours of manual editing into a fast creator workflow.

Judging Criteria Mapping

Application of Technology

ElevenClip.AI integrates Whisper, Qwen2.5, Qwen2-VL, Hugging Face, ROCm, vLLM, and AMD Developer Cloud into an end-to-end video processing product.

Presentation

The demo is designed to be visual and easy to understand: input a long video, watch AI create candidates, edit clips, and download platform-ready MP4 files.

Business Value

The product targets a real creator economy workflow. Creators, agencies, podcasters, educators, and streamers all need short-form repurposing, and manual editing is expensive.

Originality

The system goes beyond generic clipping by personalizing highlight selection to a creator's niche, style, language, clip length, and platform. It also preserves human editorial control instead of fully automating final publishing.

Build-in-Public Plan

The hackathon includes a build-in-public challenge. Suggested updates:

Share the architecture and first local demo.
Share AMD Cloud/ROCm setup notes and benchmark results.
Publish meaningful feedback about ROCm, AMD Developer Cloud, or inference setup.

Suggested hashtags/topics:

#AMDDeveloperHackathon #ROCm #MI300X #HuggingFace #Qwen #VideoAI #MultimodalAI

Roadmap

Real Whisper Large V3 run on AMD Developer Cloud.
Real Qwen2.5 vLLM ROCm inference path.
Qwen2-VL frame sampling and visual scoring.
Batch export for 10+ clips.
Subtitle styling presets per platform.
Creator profile memory and reusable brand presets.
Hugging Face Space screenshot and richer project media.
CPU vs MI300X benchmark report after AMD credits arrive.

License

MIT. See LICENSE.