studiomi300 / README.md
pollix's picture
restore space from local (app.py + showcase + thumbnail)
4f7e639 verified
---
title: StudioMI300
emoji: 🎬
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: true
license: mit
short_description: One prompt β†’ 30s cinematic reel on a single AMD MI300X
thumbnail: thumbnail.png
tags:
- amd
- amd-hackathon-2026
- mi300x
- rocm
- video-generation
- wan2.2
- flux
- qwen
- text-to-video
- text-to-film
- cinematic
- gradio
---
# StudioMI300
**One prompt β†’ 30-second cinematic reel.** Built for the AMD Developer Hackathon 2026
on a single AMD Instinct MI300X (192 GB HBM3, ROCm 7.2).
## What it does
You write one sentence. The pipeline plans a six-shot story, paints character
keyframes, animates them, scores the music, narrates the voice-over, and stitches
everything into a 30-second `mp4`. No setup. No LoRA training. No per-shot prompting.
```
"A young woman walks through neon-lit Tokyo at night and meets two friends."
↓
[ ~45 minutes on a single MI300X ]
↓
30s cinematic reel.mp4 + audio
```
## How it works (single MI300X, sequential)
1. **Director Agent** β€” Qwen3.5-35B-A3B (BF16, vLLM, AITER MoE) plans 6 shots,
character portraits, music brief, VO script, language tag.
2. **Per-shot keyframes** β€” FLUX.2 [klein] 4B reference editing seeds each
shot from a single canonical character master, pinning identity.
3. **Animation** β€” Wan2.2-I2V-A14B with ParaAttention FBCache (2Γ— lossless)
and selective `torch.compile` on `transformer_2` (1.2Γ— compile win).
4. **Vision Critic** β€” the same Qwen3.5 looks at four sampled frames per clip,
labels failure modes (`STYLIZED_AI_LOOK`, `CHARACTER_DRIFT`, `EXTRAS_INVADE_FRAME`...)
and triggers a re-render with a bumped seed if the score is below threshold.
5. **Music** β€” ACE-Step v1 3.5B generates a 30-second instrumental from the
Director's music brief.
6. **Voice-over** β€” Kokoro-82M narrates the Director's script in any of 9
languages (Director picks the language to match the setting).
7. **Mix** β€” `ffmpeg` concat-and-loudnorm into the final `mp4`.
## The full open-source stack (Apache 2.0 / MIT throughout)
| Stage | Model | License |
|---|---|---|
| Planner / Critic | Qwen3.5-35B-A3B | Apache 2.0 |
| Image | FLUX.2 [klein] 4B | Apache 2.0 |
| Video | Wan2.2-I2V-A14B | Apache 2.0 |
| Music | ACE-Step v1 3.5B | Apache 2.0 |
| TTS | Kokoro-82M | Apache 2.0 |
| Serving | vLLM 0.17 | Apache 2.0 |
| Caching | ParaAttention FBCache | Apache 2.0 |
| AMD kernels | AITER 0.1.13 | MIT |
| Project code | StudioMI300 | MIT |
Every output you generate from this stack is yours to use commercially.
## Why a single MI300X
Most cinematic generation pipelines assume you have a multi-GPU cluster: one GPU
for the planner, one for the image model, one for the video model, etc. On 192 GB
HBM3 the pipeline runs them all sequentially on the same card. That's the project's central
constraint and also its main flex:
- Qwen3.5-35B planner loads / unloads cleanly between Director and Critic phases.
- Wan2.2-I2V-A14B (~80 GB BF16) leaves headroom for FLUX.2 [klein] 4B (~8 GB)
and ACE-Step (~12 GB) to live alongside in subprocess scope.
- AITER MoE for the planner. AITER FA / FP8 was evaluated for Wan2.2 β€” results
documented in `incidents.md` of the GitHub repo (FP8 path crashes mid-pipeline
on ROCm 7.2, AITER/issues#2187, BF16 ships).
## Live demo
This Space hosts the showcase. Live generation requires an MI300X (45 minutes
per reel is too long for a casual visitor anyway). The full pipeline is on
GitHub β€” clone, point it at your MI300X, and it generates.
## Credits
AMD Developer Hackathon 2026 entry. Built solo over six days on one AMD
Developer Cloud MI300X droplet.
Made with the open-source ecosystem: Black Forest Labs, Wan-AI, Alibaba Qwen,
StepFun, hexgrad/Kokoro, vLLM, ParaAttention, diffusers, AMD ROCm + AITER.