Spaces:
Running on Zero
Running on Zero
| title: VideoVoice API | |
| sdk: gradio | |
| sdk_version: 6.12.0 | |
| app_file: app.py | |
| python_version: "3.10" | |
| <!-- | |
| ZeroGPU is enabled from the Space Settings UI (not via frontmatter). | |
| PRO account required. `app.py` mounts the FastAPI pipeline onto Gradio | |
| so the React client keeps calling `/api/*` over CORS unchanged. | |
| --> | |
| # VideoVoice | |
| **AI-powered short video translation with zero-shot voice cloning.** | |
| Translate any short video (β€60s) into 23+ languages while preserving the original speaker's voice. Paste an Instagram Reel, YouTube Short, or upload any video file. | |
| --- | |
| ## How It Works | |
| 1. **Upload or Paste URL** β Drop a video file or paste a social media link | |
| 2. **AI Translates & Clones** β Our 6-step pipeline transcribes, translates, and synthesizes new speech using a voice clone of the original speaker | |
| 3. **Preview & Download** β Watch your translated video and download in full quality | |
| ### Pipeline Architecture | |
| ``` | |
| Video β Extract Audio β Whisper Transcription β LLM Translation | |
| β Chatterbox Voice Clone + TTS β Time-Sync β Final Merge | |
| ``` | |
| | Step | Component | Description | | |
| |------|-----------|-------------| | |
| | 1 | FFmpeg | Extract audio track from video | | |
| | 2 | Whisper Large V3 | Transcribe with word-level timestamps | | |
| | 3 | GPT-4o-mini | Context-aware subtitle translation | | |
| | 4 | Chatterbox Multilingual | Zero-shot voice cloning + TTS synthesis | | |
| | 5 | Dynamic Time-Stretch | Align translated audio to original timing | | |
| | 6 | FFmpeg | Merge new audio track back into video | | |
| --- | |
| ## Running Locally | |
| ### Prerequisites | |
| - Python 3.10+ (`requires-python = ">=3.10,<3.13"`) | |
| - FFmpeg (`brew install ffmpeg` on macOS, `sudo apt install ffmpeg` on Ubuntu) | |
| - An OpenAI API key | |
| ### First-time setup | |
| ```bash | |
| # 1. Install uv (skip if you already have it) | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| # 2. Clone and enter the repo | |
| git clone https://github.com/Video-Voice/VideoVoice-be.git | |
| cd VideoVoice-be | |
| # 3. Install deps with the chatterbox TTS engine (default for local dev) | |
| # Use `--extra omnivoice` instead if you want OmniVoice. The two extras | |
| # are mutually exclusive β pick one. | |
| uv sync --extra chatterbox | |
| # 4. Configure env vars | |
| cp .env.example .env | |
| # Edit .env β at minimum set OPENAI_API_KEY and ARTIFACTS_ROOT=./data | |
| ``` | |
| ### One-time: hide the vendored chatterbox folder | |
| The repo ships a vendored `./chatterbox/` folder that the HF Chatterbox Space needs (it has ZeroGPU-specific tweaks). Locally we want Python to import the PyPI `chatterbox-tts` package instead, so tell git to ignore the working-tree state for that folder and delete it locally: | |
| ```bash | |
| git ls-files chatterbox/ | xargs git update-index --skip-worktree | |
| rm -rf chatterbox/ | |
| ``` | |
| HEAD still contains the folder, so HF deploys are unaffected. Reverse with `git update-index --no-skip-worktree` + `git checkout HEAD -- chatterbox/`. | |
| ### Run the server | |
| ```bash | |
| uv run python server.py | |
| ``` | |
| Open [http://localhost:8000](http://localhost:8000). `/api/*` are the backend routes; `/` serves the legacy static UI in `frontend/`. If the port is in use, set `PORT=8001`. | |
| Per-job artifacts land in `$ARTIFACTS_ROOT/<job_id>/`. With `ARTIFACTS_ROOT=./data` (in `.env`) that's `./data/<job_id>/` next to the repo β same layout the repo has always used. | |
| ### Run the pipeline headlessly | |
| ```bash | |
| uv run python pipeline.py --input data/my_video.mp4 --target-lang Spanish | |
| ``` | |
| --- | |
| ## API Reference | |
| The following endpoints are available on the backend (FastAPI/Gradio Server). When running on Hugging Face, replace `localhost:8000` with your Space's API URL (e.g., `https://rafii-videovoice.hf.space`). | |
| ### Core Endpoints | |
| #### `POST /api/jobs` | |
| Submit a video for translation. You can provide either a local file or a URL. | |
| **Form Data:** | |
| - `file`: (Optional) Video file upload (MP4, MOV, WebM, β€90MB). | |
| - `url`: (Optional) Social media URL (Instagram, YouTube, TikTok). | |
| - `target_language`: (Required) Name of target language (e.g., "Spanish", "Hindi"). | |
| - `source_language`: (Optional) ISO code of source (default: "en"). | |
| - `voice_mode`: (Optional) `chatterbox` or `omnivoice` (must match Space engine). | |
| - `captions`: (Optional) "true" or "false" (default: "true"). | |
| - `preserve_music`: (Optional) "true" or "false" (default: "false"). | |
| **Example:** | |
| ```bash | |
| curl -X POST http://localhost:8000/api/jobs \ | |
| -F "file=@my_video.mp4" \ | |
| -F "target_language=French" | |
| ``` | |
| #### `GET /api/jobs/{job_id}` | |
| Poll for the real-time status and progress messages of a specific job. | |
| **Query Parameters:** | |
| - `after`: (Optional) Index of the last message received to fetch only new ones. | |
| **Example:** | |
| ```bash | |
| curl http://localhost:8000/api/jobs/abc123_1?after=5 | |
| ``` | |
| #### `GET /api/jobs/{job_id}/result` | |
| Download the final translated video file. | |
| **Example:** | |
| ```bash | |
| curl -O -L http://localhost:8000/api/jobs/abc123_1/result | |
| ``` | |
| --- | |
| ### Utility & Configuration | |
| #### `GET /api/config` | |
| Fetch server configuration, including supported languages, max file size, and the active TTS engine. | |
| #### `GET /api/health` | |
| Check if the server is alive and see GPU availability/queue depth. | |
| #### `GET /api/showcase` | |
| Retrieve curated "before & after" demo entries defined in `data/showcase.json`. | |
| #### `GET /api/demo-videos` | |
| List all whitelisted demo videos available for streaming from the `outputs/` and `data/` folders. | |
| #### `GET /api/demo-videos/{video_id}/stream` | |
| Stream a specific demo video by its opaque ID. | |
| --- | |
| ### Interactive / Preview Endpoints | |
| #### `GET /api/jobs/{job_id}/preview/{model_name}` | |
| Retrieve a short audio snippet of the cloned voice for a specific TTS model before proceeding with full synthesis. | |
| #### `POST /api/jobs/{job_id}/select-model` | |
| Confirm which TTS model to use after listening to previews (used in multi-model workflows). | |
| --- | |
| ### ZeroGPU / Gradio Internal API | |
| #### `POST /run_pipeline` (Gradio API) | |
| Internal endpoint used by ZeroGPU to trigger the heavy ML processing logic. Recommended for use via `gradio_client`. | |
| **Example (Python):** | |
| ```python | |
| from gradio_client import Client | |
| client = Client("Rafii/videovoice") | |
| client.predict(job_id="abc123_1", api_name="/run_pipeline") | |
| ``` | |
| --- | |
| ## Testing the API (Hugging Face Spaces) | |
| When running on Hugging Face Spaces (using `app.py`), you can test the API using standard HTTP tools or the Gradio Client. Choose the Space corresponding to the desired TTS engine: | |
| | TTS Engine | Space URL | API Endpoint | | |
| |------------|-----------|--------------| | |
| | **Chatterbox** | `Rafii/videovoice` | `https://rafii-videovoice.hf.space` | | |
| | **OmniVoice** | `Rafii/videovoice-omni` | `https://rafii-videovoice-omni.hf.space` | | |
| ### 1. Using `curl` (FastAPI Routes) | |
| You can check the health of the API and verify that it's running: | |
| ```bash | |
| # Chatterbox Space | |
| curl https://rafii-videovoice.hf.space/api/health | |
| # OmniVoice Space | |
| curl https://rafii-videovoice-omni.hf.space/api/health | |
| ``` | |
| To submit a job via the standard API: | |
| ```bash | |
| curl -X POST https://rafii-videovoice.hf.space/api/jobs \ | |
| -F "url=https://www.instagram.com/reels/XYZ/" \ | |
| -F "target_language=Spanish" | |
| ``` | |
| ### 2. Using `gradio_client` (Gradio API Routes) | |
| The `gradio.Server` endpoints are optimized for ZeroGPU and can be accessed using the Python `gradio_client`: | |
| ```python | |
| from gradio_client import Client | |
| # Change to "Rafii/videovoice-omni" for OmniVoice | |
| client = Client("Rafii/videovoice") | |
| result = client.predict( | |
| job_id="abc123", | |
| api_name="/run_pipeline" | |
| ) | |
| print(result) | |
| ``` | |
| ### 3. Using JavaScript (Frontend) | |
| The new `gradio.Server` mode is designed for custom frontends. You can use the `@gradio/client` JS library: | |
| ```javascript | |
| import { Client } from "@gradio/client"; | |
| // Connect to the specific Space | |
| const client = await Client.connect("Rafii/videovoice"); | |
| const result = await client.predict("/run_pipeline", { | |
| job_id: "abc123", | |
| }); | |
| ``` | |
| --- | |
| ## Supported Languages | |
| Spanish, French, German, Hindi, Portuguese, Italian, Japanese, Chinese, Arabic, Korean β and more. | |
| --- | |
| ## Project Structure | |
| ``` | |
| VideoVoice/ | |
| βββ server.py # FastAPI backend | |
| βββ pipeline.py # Core translation pipeline | |
| βββ steps/ # Pipeline step modules | |
| β βββ s1_extract_audio.py | |
| β βββ s2_transcribe.py | |
| β βββ s3_translate.py | |
| β βββ s4_tts.py | |
| β βββ s5_sync.py | |
| β βββ s6_merge.py | |
| βββ frontend/ # Static web UI | |
| β βββ index.html | |
| β βββ style.css | |
| β βββ app.js | |
| βββ pyproject.toml # Dependencies & project config | |
| βββ uv.lock # Lockfile (reproducible installs) | |
| βββ .env.example | |
| βββ README.md | |
| ``` | |
| --- | |
| ## Entrypoints | |
| Two files intentionally exist, run in different contexts, but **ship the same code**: | |
| | File | When it runs | What it does | | |
| |------|-------------|--------------| | |
| | `server.py` | Local dev (`uv run python server.py`) | Plain FastAPI app β defines every `/api/*` route. | | |
| | `app.py` | Hugging Face Spaces | Gradio Server that imports `server.py`'s router and wraps it with `@spaces.GPU` for ZeroGPU. | | |
| `app.py` depends on `server.py`, so server.py must ship to HF. Do not strip it. | |
| ## Deployment | |
| ### Hugging Face Spaces (production) | |
| Push to `main` β GitHub Actions runs `.github/workflows/deploy-hf.yml` β both Spaces (`Rafii/videovoice` and `Rafii/videovoice-omni`) redeploy automatically. No manual step. | |
| One-time CI setup: | |
| 1. Create an HF access token with write access to both Spaces: https://huggingface.co/settings/tokens | |
| 2. Add it as `HF_TOKEN` under **Settings β Secrets and variables β Actions** in the GitHub repo. | |
| Manual fallback (from a local clean checkout with `space` and `space-omni` remotes configured): | |
| ```bash | |
| ./deploy.sh # skips if remote is already at HEAD | |
| ./deploy.sh --force # always redeploy | |
| ``` | |
| Files filtered out of every Space deploy are listed in `.gitattributes` (`export-ignore`). | |
| ### Branching | |
| `main` is canonical. Use short-lived `feat/<thing>` branches, open a PR, merge, delete. Never maintain a parallel deploy branch β every change on main reaches both Spaces via CI. | |
| ### AWS (alternative GPU host) | |
| ```bash | |
| # On a g4dn.xlarge instance | |
| sudo apt update && sudo apt install -y ffmpeg | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| uv sync | |
| uv run python server.py | |
| ``` | |
| Recommended: use `systemd` service for auto-restart, CloudFront for CDN, S3 for video storage with 24h auto-delete lifecycle policy. | |
| --- | |
| ## License | |
| MIT License β see [LICENSE](LICENSE). | |