videovoice / README.md
github-actions[bot]
deploy: switch to chatterbox requirements @ f0510f6
1fa323e
---
title: VideoVoice API
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
python_version: "3.10"
---
<!--
ZeroGPU is enabled from the Space Settings UI (not via frontmatter).
PRO account required. `app.py` mounts the FastAPI pipeline onto Gradio
so the React client keeps calling `/api/*` over CORS unchanged.
-->
# VideoVoice
**AI-powered short video translation with zero-shot voice cloning.**
Translate any short video (≀60s) into 23+ languages while preserving the original speaker's voice. Paste an Instagram Reel, YouTube Short, or upload any video file.
---
## How It Works
1. **Upload or Paste URL** β€” Drop a video file or paste a social media link
2. **AI Translates & Clones** β€” Our 6-step pipeline transcribes, translates, and synthesizes new speech using a voice clone of the original speaker
3. **Preview & Download** β€” Watch your translated video and download in full quality
### Pipeline Architecture
```
Video β†’ Extract Audio β†’ Whisper Transcription β†’ LLM Translation
β†’ Chatterbox Voice Clone + TTS β†’ Time-Sync β†’ Final Merge
```
| Step | Component | Description |
|------|-----------|-------------|
| 1 | FFmpeg | Extract audio track from video |
| 2 | Whisper Large V3 | Transcribe with word-level timestamps |
| 3 | GPT-4o-mini | Context-aware subtitle translation |
| 4 | Chatterbox Multilingual | Zero-shot voice cloning + TTS synthesis |
| 5 | Dynamic Time-Stretch | Align translated audio to original timing |
| 6 | FFmpeg | Merge new audio track back into video |
---
## Running Locally
### Prerequisites
- Python 3.10+ (`requires-python = ">=3.10,<3.13"`)
- FFmpeg (`brew install ffmpeg` on macOS, `sudo apt install ffmpeg` on Ubuntu)
- An OpenAI API key
### First-time setup
```bash
# 1. Install uv (skip if you already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone and enter the repo
git clone https://github.com/Video-Voice/VideoVoice-be.git
cd VideoVoice-be
# 3. Install deps with the chatterbox TTS engine (default for local dev)
# Use `--extra omnivoice` instead if you want OmniVoice. The two extras
# are mutually exclusive β€” pick one.
uv sync --extra chatterbox
# 4. Configure env vars
cp .env.example .env
# Edit .env β€” at minimum set OPENAI_API_KEY and ARTIFACTS_ROOT=./data
```
### One-time: hide the vendored chatterbox folder
The repo ships a vendored `./chatterbox/` folder that the HF Chatterbox Space needs (it has ZeroGPU-specific tweaks). Locally we want Python to import the PyPI `chatterbox-tts` package instead, so tell git to ignore the working-tree state for that folder and delete it locally:
```bash
git ls-files chatterbox/ | xargs git update-index --skip-worktree
rm -rf chatterbox/
```
HEAD still contains the folder, so HF deploys are unaffected. Reverse with `git update-index --no-skip-worktree` + `git checkout HEAD -- chatterbox/`.
### Run the server
```bash
uv run python server.py
```
Open [http://localhost:8000](http://localhost:8000). `/api/*` are the backend routes; `/` serves the legacy static UI in `frontend/`. If the port is in use, set `PORT=8001`.
Per-job artifacts land in `$ARTIFACTS_ROOT/<job_id>/`. With `ARTIFACTS_ROOT=./data` (in `.env`) that's `./data/<job_id>/` next to the repo β€” same layout the repo has always used.
### Run the pipeline headlessly
```bash
uv run python pipeline.py --input data/my_video.mp4 --target-lang Spanish
```
---
## API Reference
The following endpoints are available on the backend (FastAPI/Gradio Server). When running on Hugging Face, replace `localhost:8000` with your Space's API URL (e.g., `https://rafii-videovoice.hf.space`).
### Core Endpoints
#### `POST /api/jobs`
Submit a video for translation. You can provide either a local file or a URL.
**Form Data:**
- `file`: (Optional) Video file upload (MP4, MOV, WebM, ≀90MB).
- `url`: (Optional) Social media URL (Instagram, YouTube, TikTok).
- `target_language`: (Required) Name of target language (e.g., "Spanish", "Hindi").
- `source_language`: (Optional) ISO code of source (default: "en").
- `voice_mode`: (Optional) `chatterbox` or `omnivoice` (must match Space engine).
- `captions`: (Optional) "true" or "false" (default: "true").
- `preserve_music`: (Optional) "true" or "false" (default: "false").
**Example:**
```bash
curl -X POST http://localhost:8000/api/jobs \
-F "file=@my_video.mp4" \
-F "target_language=French"
```
#### `GET /api/jobs/{job_id}`
Poll for the real-time status and progress messages of a specific job.
**Query Parameters:**
- `after`: (Optional) Index of the last message received to fetch only new ones.
**Example:**
```bash
curl http://localhost:8000/api/jobs/abc123_1?after=5
```
#### `GET /api/jobs/{job_id}/result`
Download the final translated video file.
**Example:**
```bash
curl -O -L http://localhost:8000/api/jobs/abc123_1/result
```
---
### Utility & Configuration
#### `GET /api/config`
Fetch server configuration, including supported languages, max file size, and the active TTS engine.
#### `GET /api/health`
Check if the server is alive and see GPU availability/queue depth.
#### `GET /api/showcase`
Retrieve curated "before & after" demo entries defined in `data/showcase.json`.
#### `GET /api/demo-videos`
List all whitelisted demo videos available for streaming from the `outputs/` and `data/` folders.
#### `GET /api/demo-videos/{video_id}/stream`
Stream a specific demo video by its opaque ID.
---
### Interactive / Preview Endpoints
#### `GET /api/jobs/{job_id}/preview/{model_name}`
Retrieve a short audio snippet of the cloned voice for a specific TTS model before proceeding with full synthesis.
#### `POST /api/jobs/{job_id}/select-model`
Confirm which TTS model to use after listening to previews (used in multi-model workflows).
---
### ZeroGPU / Gradio Internal API
#### `POST /run_pipeline` (Gradio API)
Internal endpoint used by ZeroGPU to trigger the heavy ML processing logic. Recommended for use via `gradio_client`.
**Example (Python):**
```python
from gradio_client import Client
client = Client("Rafii/videovoice")
client.predict(job_id="abc123_1", api_name="/run_pipeline")
```
---
## Testing the API (Hugging Face Spaces)
When running on Hugging Face Spaces (using `app.py`), you can test the API using standard HTTP tools or the Gradio Client. Choose the Space corresponding to the desired TTS engine:
| TTS Engine | Space URL | API Endpoint |
|------------|-----------|--------------|
| **Chatterbox** | `Rafii/videovoice` | `https://rafii-videovoice.hf.space` |
| **OmniVoice** | `Rafii/videovoice-omni` | `https://rafii-videovoice-omni.hf.space` |
### 1. Using `curl` (FastAPI Routes)
You can check the health of the API and verify that it's running:
```bash
# Chatterbox Space
curl https://rafii-videovoice.hf.space/api/health
# OmniVoice Space
curl https://rafii-videovoice-omni.hf.space/api/health
```
To submit a job via the standard API:
```bash
curl -X POST https://rafii-videovoice.hf.space/api/jobs \
-F "url=https://www.instagram.com/reels/XYZ/" \
-F "target_language=Spanish"
```
### 2. Using `gradio_client` (Gradio API Routes)
The `gradio.Server` endpoints are optimized for ZeroGPU and can be accessed using the Python `gradio_client`:
```python
from gradio_client import Client
# Change to "Rafii/videovoice-omni" for OmniVoice
client = Client("Rafii/videovoice")
result = client.predict(
job_id="abc123",
api_name="/run_pipeline"
)
print(result)
```
### 3. Using JavaScript (Frontend)
The new `gradio.Server` mode is designed for custom frontends. You can use the `@gradio/client` JS library:
```javascript
import { Client } from "@gradio/client";
// Connect to the specific Space
const client = await Client.connect("Rafii/videovoice");
const result = await client.predict("/run_pipeline", {
job_id: "abc123",
});
```
---
## Supported Languages
Spanish, French, German, Hindi, Portuguese, Italian, Japanese, Chinese, Arabic, Korean β€” and more.
---
## Project Structure
```
VideoVoice/
β”œβ”€β”€ server.py # FastAPI backend
β”œβ”€β”€ pipeline.py # Core translation pipeline
β”œβ”€β”€ steps/ # Pipeline step modules
β”‚ β”œβ”€β”€ s1_extract_audio.py
β”‚ β”œβ”€β”€ s2_transcribe.py
β”‚ β”œβ”€β”€ s3_translate.py
β”‚ β”œβ”€β”€ s4_tts.py
β”‚ β”œβ”€β”€ s5_sync.py
β”‚ └── s6_merge.py
β”œβ”€β”€ frontend/ # Static web UI
β”‚ β”œβ”€β”€ index.html
β”‚ β”œβ”€β”€ style.css
β”‚ └── app.js
β”œβ”€β”€ pyproject.toml # Dependencies & project config
β”œβ”€β”€ uv.lock # Lockfile (reproducible installs)
β”œβ”€β”€ .env.example
└── README.md
```
---
## Entrypoints
Two files intentionally exist, run in different contexts, but **ship the same code**:
| File | When it runs | What it does |
|------|-------------|--------------|
| `server.py` | Local dev (`uv run python server.py`) | Plain FastAPI app β€” defines every `/api/*` route. |
| `app.py` | Hugging Face Spaces | Gradio Server that imports `server.py`'s router and wraps it with `@spaces.GPU` for ZeroGPU. |
`app.py` depends on `server.py`, so server.py must ship to HF. Do not strip it.
## Deployment
### Hugging Face Spaces (production)
Push to `main` β†’ GitHub Actions runs `.github/workflows/deploy-hf.yml` β†’ both Spaces (`Rafii/videovoice` and `Rafii/videovoice-omni`) redeploy automatically. No manual step.
One-time CI setup:
1. Create an HF access token with write access to both Spaces: https://huggingface.co/settings/tokens
2. Add it as `HF_TOKEN` under **Settings β†’ Secrets and variables β†’ Actions** in the GitHub repo.
Manual fallback (from a local clean checkout with `space` and `space-omni` remotes configured):
```bash
./deploy.sh # skips if remote is already at HEAD
./deploy.sh --force # always redeploy
```
Files filtered out of every Space deploy are listed in `.gitattributes` (`export-ignore`).
### Branching
`main` is canonical. Use short-lived `feat/<thing>` branches, open a PR, merge, delete. Never maintain a parallel deploy branch β€” every change on main reaches both Spaces via CI.
### AWS (alternative GPU host)
```bash
# On a g4dn.xlarge instance
sudo apt update && sudo apt install -y ffmpeg
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv run python server.py
```
Recommended: use `systemd` service for auto-restart, CloudFront for CDN, S3 for video storage with 24h auto-delete lifecycle policy.
---
## License
MIT License β€” see [LICENSE](LICENSE).