---
title: VideoVoice API
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
python_version: "3.10"
---

<!--
  ZeroGPU is enabled from the Space Settings UI (not via frontmatter).
  PRO account required. `app.py` mounts the FastAPI pipeline onto Gradio
  so the React client keeps calling `/api/*` over CORS unchanged.
-->


# VideoVoice

**AI-powered short video translation with zero-shot voice cloning.**

Translate any short video (≤60s) into 23+ languages while preserving the original speaker's voice. Paste an Instagram Reel, YouTube Short, or upload any video file.

---

## How It Works

1. **Upload or Paste URL** — Drop a video file or paste a social media link
2. **AI Translates & Clones** — Our 6-step pipeline transcribes, translates, and synthesizes new speech using a voice clone of the original speaker
3. **Preview & Download** — Watch your translated video and download in full quality

### Pipeline Architecture

```
Video → Extract Audio → Whisper Transcription → LLM Translation
      → Chatterbox Voice Clone + TTS → Time-Sync → Final Merge
```

| Step | Component | Description |
|------|-----------|-------------|
| 1 | FFmpeg | Extract audio track from video |
| 2 | Whisper Large V3 | Transcribe with word-level timestamps |
| 3 | GPT-4o-mini | Context-aware subtitle translation |
| 4 | Chatterbox Multilingual | Zero-shot voice cloning + TTS synthesis |
| 5 | Dynamic Time-Stretch | Align translated audio to original timing |
| 6 | FFmpeg | Merge new audio track back into video |

---

## Running Locally

### Prerequisites

- Python 3.10+ (`requires-python = ">=3.10,<3.13"`)
- FFmpeg (`brew install ffmpeg` on macOS, `sudo apt install ffmpeg` on Ubuntu)
- An OpenAI API key

### First-time setup

```bash
# 1. Install uv (skip if you already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and enter the repo
git clone https://github.com/Video-Voice/VideoVoice-be.git
cd VideoVoice-be

# 3. Install deps with the chatterbox TTS engine (default for local dev)
#    Use `--extra omnivoice` instead if you want OmniVoice. The two extras
#    are mutually exclusive — pick one.
uv sync --extra chatterbox

# 4. Configure env vars
cp .env.example .env
# Edit .env — at minimum set OPENAI_API_KEY and ARTIFACTS_ROOT=./data
```

### One-time: hide the vendored chatterbox folder

The repo ships a vendored `./chatterbox/` folder that the HF Chatterbox Space needs (it has ZeroGPU-specific tweaks). Locally we want Python to import the PyPI `chatterbox-tts` package instead, so tell git to ignore the working-tree state for that folder and delete it locally:

```bash
git ls-files chatterbox/ | xargs git update-index --skip-worktree
rm -rf chatterbox/
```

HEAD still contains the folder, so HF deploys are unaffected. Reverse with `git update-index --no-skip-worktree` + `git checkout HEAD -- chatterbox/`.

### Run the server

```bash
uv run python server.py
```

Open [http://localhost:8000](http://localhost:8000). `/api/*` are the backend routes; `/` serves the legacy static UI in `frontend/`. If the port is in use, set `PORT=8001`.

Per-job artifacts land in `$ARTIFACTS_ROOT/<job_id>/`. With `ARTIFACTS_ROOT=./data` (in `.env`) that's `./data/<job_id>/` next to the repo — same layout the repo has always used.

### Run the pipeline headlessly

```bash
uv run python pipeline.py --input data/my_video.mp4 --target-lang Spanish
```

---
## API Reference

The following endpoints are available on the backend (FastAPI/Gradio Server). When running on Hugging Face, replace `localhost:8000` with your Space's API URL (e.g., `https://rafii-videovoice.hf.space`).

### Core Endpoints

#### `POST /api/jobs`
Submit a video for translation. You can provide either a local file or a URL.

**Form Data:**
- `file`: (Optional) Video file upload (MP4, MOV, WebM, ≤90MB).
- `url`: (Optional) Social media URL (Instagram, YouTube, TikTok).
- `target_language`: (Required) Name of target language (e.g., "Spanish", "Hindi").
- `source_language`: (Optional) ISO code of source (default: "en").
- `voice_mode`: (Optional) `chatterbox` or `omnivoice` (must match Space engine).
- `captions`: (Optional) "true" or "false" (default: "true").
- `preserve_music`: (Optional) "true" or "false" (default: "false").

**Example:**
```bash
curl -X POST http://localhost:8000/api/jobs \
  -F "file=@my_video.mp4" \
  -F "target_language=French"
```

#### `GET /api/jobs/{job_id}`
Poll for the real-time status and progress messages of a specific job.

**Query Parameters:**
- `after`: (Optional) Index of the last message received to fetch only new ones.

**Example:**
```bash
curl http://localhost:8000/api/jobs/abc123_1?after=5
```

#### `GET /api/jobs/{job_id}/result`
Download the final translated video file.

**Example:**
```bash
curl -O -L http://localhost:8000/api/jobs/abc123_1/result
```

---

### Utility & Configuration

#### `GET /api/config`
Fetch server configuration, including supported languages, max file size, and the active TTS engine.

#### `GET /api/health`
Check if the server is alive and see GPU availability/queue depth.

#### `GET /api/showcase`
Retrieve curated "before & after" demo entries defined in `data/showcase.json`.

#### `GET /api/demo-videos`
List all whitelisted demo videos available for streaming from the `outputs/` and `data/` folders.

#### `GET /api/demo-videos/{video_id}/stream`
Stream a specific demo video by its opaque ID.

---

### Interactive / Preview Endpoints

#### `GET /api/jobs/{job_id}/preview/{model_name}`
Retrieve a short audio snippet of the cloned voice for a specific TTS model before proceeding with full synthesis.

#### `POST /api/jobs/{job_id}/select-model`
Confirm which TTS model to use after listening to previews (used in multi-model workflows).

---

### ZeroGPU / Gradio Internal API

#### `POST /run_pipeline` (Gradio API)
Internal endpoint used by ZeroGPU to trigger the heavy ML processing logic. Recommended for use via `gradio_client`.

**Example (Python):**
```python
from gradio_client import Client
client = Client("Rafii/videovoice")
client.predict(job_id="abc123_1", api_name="/run_pipeline")
```

---


## Testing the API (Hugging Face Spaces)

When running on Hugging Face Spaces (using `app.py`), you can test the API using standard HTTP tools or the Gradio Client. Choose the Space corresponding to the desired TTS engine:

| TTS Engine | Space URL | API Endpoint |
|------------|-----------|--------------|
| **Chatterbox** | `Rafii/videovoice` | `https://rafii-videovoice.hf.space` |
| **OmniVoice** | `Rafii/videovoice-omni` | `https://rafii-videovoice-omni.hf.space` |

### 1. Using `curl` (FastAPI Routes)

You can check the health of the API and verify that it's running:

```bash
# Chatterbox Space
curl https://rafii-videovoice.hf.space/api/health

# OmniVoice Space
curl https://rafii-videovoice-omni.hf.space/api/health
```

To submit a job via the standard API:

```bash
curl -X POST https://rafii-videovoice.hf.space/api/jobs \
  -F "url=https://www.instagram.com/reels/XYZ/" \
  -F "target_language=Spanish"
```

### 2. Using `gradio_client` (Gradio API Routes)

The `gradio.Server` endpoints are optimized for ZeroGPU and can be accessed using the Python `gradio_client`:

```python
from gradio_client import Client

# Change to "Rafii/videovoice-omni" for OmniVoice
client = Client("Rafii/videovoice")
result = client.predict(
    job_id="abc123",
    api_name="/run_pipeline"
)
print(result)
```

### 3. Using JavaScript (Frontend)

The new `gradio.Server` mode is designed for custom frontends. You can use the `@gradio/client` JS library:

```javascript
import { Client } from "@gradio/client";

// Connect to the specific Space
const client = await Client.connect("Rafii/videovoice");
const result = await client.predict("/run_pipeline", {
    job_id: "abc123",
});
```

---

## Supported Languages

Spanish, French, German, Hindi, Portuguese, Italian, Japanese, Chinese, Arabic, Korean — and more.

---

## Project Structure

```
VideoVoice/
├── server.py            # FastAPI backend
├── pipeline.py          # Core translation pipeline
├── steps/               # Pipeline step modules
│   ├── s1_extract_audio.py
│   ├── s2_transcribe.py
│   ├── s3_translate.py
│   ├── s4_tts.py
│   ├── s5_sync.py
│   └── s6_merge.py
├── frontend/            # Static web UI
│   ├── index.html
│   ├── style.css
│   └── app.js
├── pyproject.toml       # Dependencies & project config
├── uv.lock              # Lockfile (reproducible installs)
├── .env.example
└── README.md
```

---

## Entrypoints

Two files intentionally exist, run in different contexts, but **ship the same code**:

| File | When it runs | What it does |
|------|-------------|--------------|
| `server.py` | Local dev (`uv run python server.py`) | Plain FastAPI app — defines every `/api/*` route. |
| `app.py`    | Hugging Face Spaces               | Gradio Server that imports `server.py`'s router and wraps it with `@spaces.GPU` for ZeroGPU. |

`app.py` depends on `server.py`, so server.py must ship to HF. Do not strip it.

## Deployment

### Hugging Face Spaces (production)

Push to `main` → GitHub Actions runs `.github/workflows/deploy-hf.yml` → both Spaces (`Rafii/videovoice` and `Rafii/videovoice-omni`) redeploy automatically. No manual step.

One-time CI setup:
1. Create an HF access token with write access to both Spaces: https://huggingface.co/settings/tokens
2. Add it as `HF_TOKEN` under **Settings → Secrets and variables → Actions** in the GitHub repo.

Manual fallback (from a local clean checkout with `space` and `space-omni` remotes configured):
```bash
./deploy.sh          # skips if remote is already at HEAD
./deploy.sh --force  # always redeploy
```

Files filtered out of every Space deploy are listed in `.gitattributes` (`export-ignore`).

### Branching

`main` is canonical. Use short-lived `feat/<thing>` branches, open a PR, merge, delete. Never maintain a parallel deploy branch — every change on main reaches both Spaces via CI.

### AWS (alternative GPU host)

```bash
# On a g4dn.xlarge instance
sudo apt update && sudo apt install -y ffmpeg
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv run python server.py
```

Recommended: use `systemd` service for auto-restart, CloudFront for CDN, S3 for video storage with 24h auto-delete lifecycle policy.

---

## License

MIT License — see [LICENSE](LICENSE).