--- title: VideoVoice API sdk: gradio sdk_version: 6.12.0 app_file: app.py python_version: "3.10" --- # VideoVoice **AI-powered short video translation with zero-shot voice cloning.** Translate any short video (≤60s) into 23+ languages while preserving the original speaker's voice. Paste an Instagram Reel, YouTube Short, or upload any video file. --- ## How It Works 1. **Upload or Paste URL** — Drop a video file or paste a social media link 2. **AI Translates & Clones** — Our 6-step pipeline transcribes, translates, and synthesizes new speech using a voice clone of the original speaker 3. **Preview & Download** — Watch your translated video and download in full quality ### Pipeline Architecture ``` Video → Extract Audio → Whisper Transcription → LLM Translation → Chatterbox Voice Clone + TTS → Time-Sync → Final Merge ``` | Step | Component | Description | |------|-----------|-------------| | 1 | FFmpeg | Extract audio track from video | | 2 | Whisper Large V3 | Transcribe with word-level timestamps | | 3 | GPT-4o-mini | Context-aware subtitle translation | | 4 | Chatterbox Multilingual | Zero-shot voice cloning + TTS synthesis | | 5 | Dynamic Time-Stretch | Align translated audio to original timing | | 6 | FFmpeg | Merge new audio track back into video | --- ## Running Locally ### Prerequisites - Python 3.10+ (`requires-python = ">=3.10,<3.13"`) - FFmpeg (`brew install ffmpeg` on macOS, `sudo apt install ffmpeg` on Ubuntu) - An OpenAI API key ### First-time setup ```bash # 1. Install uv (skip if you already have it) curl -LsSf https://astral.sh/uv/install.sh | sh # 2. Clone and enter the repo git clone https://github.com/Video-Voice/VideoVoice-be.git cd VideoVoice-be # 3. Install deps with the chatterbox TTS engine (default for local dev) # Use `--extra omnivoice` instead if you want OmniVoice. The two extras # are mutually exclusive — pick one. uv sync --extra chatterbox # 4. Configure env vars cp .env.example .env # Edit .env — at minimum set OPENAI_API_KEY and ARTIFACTS_ROOT=./data ``` ### One-time: hide the vendored chatterbox folder The repo ships a vendored `./chatterbox/` folder that the HF Chatterbox Space needs (it has ZeroGPU-specific tweaks). Locally we want Python to import the PyPI `chatterbox-tts` package instead, so tell git to ignore the working-tree state for that folder and delete it locally: ```bash git ls-files chatterbox/ | xargs git update-index --skip-worktree rm -rf chatterbox/ ``` HEAD still contains the folder, so HF deploys are unaffected. Reverse with `git update-index --no-skip-worktree` + `git checkout HEAD -- chatterbox/`. ### Run the server ```bash uv run python server.py ``` Open [http://localhost:8000](http://localhost:8000). `/api/*` are the backend routes; `/` serves the legacy static UI in `frontend/`. If the port is in use, set `PORT=8001`. Per-job artifacts land in `$ARTIFACTS_ROOT//`. With `ARTIFACTS_ROOT=./data` (in `.env`) that's `./data//` next to the repo — same layout the repo has always used. ### Run the pipeline headlessly ```bash uv run python pipeline.py --input data/my_video.mp4 --target-lang Spanish ``` --- ## API Reference The following endpoints are available on the backend (FastAPI/Gradio Server). When running on Hugging Face, replace `localhost:8000` with your Space's API URL (e.g., `https://rafii-videovoice.hf.space`). ### Core Endpoints #### `POST /api/jobs` Submit a video for translation. You can provide either a local file or a URL. **Form Data:** - `file`: (Optional) Video file upload (MP4, MOV, WebM, ≤90MB). - `url`: (Optional) Social media URL (Instagram, YouTube, TikTok). - `target_language`: (Required) Name of target language (e.g., "Spanish", "Hindi"). - `source_language`: (Optional) ISO code of source (default: "en"). - `voice_mode`: (Optional) `chatterbox` or `omnivoice` (must match Space engine). - `captions`: (Optional) "true" or "false" (default: "true"). - `preserve_music`: (Optional) "true" or "false" (default: "false"). **Example:** ```bash curl -X POST http://localhost:8000/api/jobs \ -F "file=@my_video.mp4" \ -F "target_language=French" ``` #### `GET /api/jobs/{job_id}` Poll for the real-time status and progress messages of a specific job. **Query Parameters:** - `after`: (Optional) Index of the last message received to fetch only new ones. **Example:** ```bash curl http://localhost:8000/api/jobs/abc123_1?after=5 ``` #### `GET /api/jobs/{job_id}/result` Download the final translated video file. **Example:** ```bash curl -O -L http://localhost:8000/api/jobs/abc123_1/result ``` --- ### Utility & Configuration #### `GET /api/config` Fetch server configuration, including supported languages, max file size, and the active TTS engine. #### `GET /api/health` Check if the server is alive and see GPU availability/queue depth. #### `GET /api/showcase` Retrieve curated "before & after" demo entries defined in `data/showcase.json`. #### `GET /api/demo-videos` List all whitelisted demo videos available for streaming from the `outputs/` and `data/` folders. #### `GET /api/demo-videos/{video_id}/stream` Stream a specific demo video by its opaque ID. --- ### Interactive / Preview Endpoints #### `GET /api/jobs/{job_id}/preview/{model_name}` Retrieve a short audio snippet of the cloned voice for a specific TTS model before proceeding with full synthesis. #### `POST /api/jobs/{job_id}/select-model` Confirm which TTS model to use after listening to previews (used in multi-model workflows). --- ### ZeroGPU / Gradio Internal API #### `POST /run_pipeline` (Gradio API) Internal endpoint used by ZeroGPU to trigger the heavy ML processing logic. Recommended for use via `gradio_client`. **Example (Python):** ```python from gradio_client import Client client = Client("Rafii/videovoice") client.predict(job_id="abc123_1", api_name="/run_pipeline") ``` --- ## Testing the API (Hugging Face Spaces) When running on Hugging Face Spaces (using `app.py`), you can test the API using standard HTTP tools or the Gradio Client. Choose the Space corresponding to the desired TTS engine: | TTS Engine | Space URL | API Endpoint | |------------|-----------|--------------| | **Chatterbox** | `Rafii/videovoice` | `https://rafii-videovoice.hf.space` | | **OmniVoice** | `Rafii/videovoice-omni` | `https://rafii-videovoice-omni.hf.space` | ### 1. Using `curl` (FastAPI Routes) You can check the health of the API and verify that it's running: ```bash # Chatterbox Space curl https://rafii-videovoice.hf.space/api/health # OmniVoice Space curl https://rafii-videovoice-omni.hf.space/api/health ``` To submit a job via the standard API: ```bash curl -X POST https://rafii-videovoice.hf.space/api/jobs \ -F "url=https://www.instagram.com/reels/XYZ/" \ -F "target_language=Spanish" ``` ### 2. Using `gradio_client` (Gradio API Routes) The `gradio.Server` endpoints are optimized for ZeroGPU and can be accessed using the Python `gradio_client`: ```python from gradio_client import Client # Change to "Rafii/videovoice-omni" for OmniVoice client = Client("Rafii/videovoice") result = client.predict( job_id="abc123", api_name="/run_pipeline" ) print(result) ``` ### 3. Using JavaScript (Frontend) The new `gradio.Server` mode is designed for custom frontends. You can use the `@gradio/client` JS library: ```javascript import { Client } from "@gradio/client"; // Connect to the specific Space const client = await Client.connect("Rafii/videovoice"); const result = await client.predict("/run_pipeline", { job_id: "abc123", }); ``` --- ## Supported Languages Spanish, French, German, Hindi, Portuguese, Italian, Japanese, Chinese, Arabic, Korean — and more. --- ## Project Structure ``` VideoVoice/ ├── server.py # FastAPI backend ├── pipeline.py # Core translation pipeline ├── steps/ # Pipeline step modules │ ├── s1_extract_audio.py │ ├── s2_transcribe.py │ ├── s3_translate.py │ ├── s4_tts.py │ ├── s5_sync.py │ └── s6_merge.py ├── frontend/ # Static web UI │ ├── index.html │ ├── style.css │ └── app.js ├── pyproject.toml # Dependencies & project config ├── uv.lock # Lockfile (reproducible installs) ├── .env.example └── README.md ``` --- ## Entrypoints Two files intentionally exist, run in different contexts, but **ship the same code**: | File | When it runs | What it does | |------|-------------|--------------| | `server.py` | Local dev (`uv run python server.py`) | Plain FastAPI app — defines every `/api/*` route. | | `app.py` | Hugging Face Spaces | Gradio Server that imports `server.py`'s router and wraps it with `@spaces.GPU` for ZeroGPU. | `app.py` depends on `server.py`, so server.py must ship to HF. Do not strip it. ## Deployment ### Hugging Face Spaces (production) Push to `main` → GitHub Actions runs `.github/workflows/deploy-hf.yml` → both Spaces (`Rafii/videovoice` and `Rafii/videovoice-omni`) redeploy automatically. No manual step. One-time CI setup: 1. Create an HF access token with write access to both Spaces: https://huggingface.co/settings/tokens 2. Add it as `HF_TOKEN` under **Settings → Secrets and variables → Actions** in the GitHub repo. Manual fallback (from a local clean checkout with `space` and `space-omni` remotes configured): ```bash ./deploy.sh # skips if remote is already at HEAD ./deploy.sh --force # always redeploy ``` Files filtered out of every Space deploy are listed in `.gitattributes` (`export-ignore`). ### Branching `main` is canonical. Use short-lived `feat/` branches, open a PR, merge, delete. Never maintain a parallel deploy branch — every change on main reaches both Spaces via CI. ### AWS (alternative GPU host) ```bash # On a g4dn.xlarge instance sudo apt update && sudo apt install -y ffmpeg curl -LsSf https://astral.sh/uv/install.sh | sh uv sync uv run python server.py ``` Recommended: use `systemd` service for auto-restart, CloudFront for CDN, S3 for video storage with 24h auto-delete lifecycle policy. --- ## License MIT License — see [LICENSE](LICENSE).