Spaces:

Rafii
/

videovoice

Running on Zero

App Files Files Community

videovoice / README.md

github-actions[bot]

deploy: switch to chatterbox requirements @ f0510f6

1fa323e about 1 month ago

preview code

raw

history blame contribute delete

10.5 kB

	---
	title: VideoVoice API
	sdk: gradio
	sdk_version: 6.12.0
	app_file: app.py
	python_version: "3.10"
	---

	<!--
	ZeroGPU is enabled from the Space Settings UI (not via frontmatter).
	PRO account required. `app.py` mounts the FastAPI pipeline onto Gradio
	so the React client keeps calling `/api/*` over CORS unchanged.
	-->


	# VideoVoice

	AI-powered short video translation with zero-shot voice cloning.

	Translate any short video (≤60s) into 23+ languages while preserving the original speaker's voice. Paste an Instagram Reel, YouTube Short, or upload any video file.

	---

	## How It Works

	1. Upload or Paste URL — Drop a video file or paste a social media link
	2. AI Translates & Clones — Our 6-step pipeline transcribes, translates, and synthesizes new speech using a voice clone of the original speaker
	3. Preview & Download — Watch your translated video and download in full quality

	### Pipeline Architecture

	```
	Video → Extract Audio → Whisper Transcription → LLM Translation
	→ Chatterbox Voice Clone + TTS → Time-Sync → Final Merge
	```

	\| Step \| Component \| Description \|
	\|------\|-----------\|-------------\|
	\| 1 \| FFmpeg \| Extract audio track from video \|
	\| 2 \| Whisper Large V3 \| Transcribe with word-level timestamps \|
	\| 3 \| GPT-4o-mini \| Context-aware subtitle translation \|
	\| 4 \| Chatterbox Multilingual \| Zero-shot voice cloning + TTS synthesis \|
	\| 5 \| Dynamic Time-Stretch \| Align translated audio to original timing \|
	\| 6 \| FFmpeg \| Merge new audio track back into video \|

	---

	## Running Locally

	### Prerequisites

	- Python 3.10+ (`requires-python = ">=3.10,<3.13"`)
	- FFmpeg (`brew install ffmpeg` on macOS, `sudo apt install ffmpeg` on Ubuntu)
	- An OpenAI API key

	### First-time setup

	```bash
	# 1. Install uv (skip if you already have it)
	curl -LsSf https://astral.sh/uv/install.sh \| sh

	# 2. Clone and enter the repo
	git clone https://github.com/Video-Voice/VideoVoice-be.git
	cd VideoVoice-be

	# 3. Install deps with the chatterbox TTS engine (default for local dev)
	# Use `--extra omnivoice` instead if you want OmniVoice. The two extras
	# are mutually exclusive — pick one.
	uv sync --extra chatterbox

	# 4. Configure env vars
	cp .env.example .env
	# Edit .env — at minimum set OPENAI_API_KEY and ARTIFACTS_ROOT=./data
	```

	### One-time: hide the vendored chatterbox folder

	The repo ships a vendored `./chatterbox/` folder that the HF Chatterbox Space needs (it has ZeroGPU-specific tweaks). Locally we want Python to import the PyPI `chatterbox-tts` package instead, so tell git to ignore the working-tree state for that folder and delete it locally:

	```bash
	git ls-files chatterbox/ \| xargs git update-index --skip-worktree
	rm -rf chatterbox/
	```

	HEAD still contains the folder, so HF deploys are unaffected. Reverse with `git update-index --no-skip-worktree` + `git checkout HEAD -- chatterbox/`.

	### Run the server

	```bash
	uv run python server.py
	```

	Open [http://localhost:8000](http://localhost:8000). `/api/*` are the backend routes; `/` serves the legacy static UI in `frontend/`. If the port is in use, set `PORT=8001`.

	Per-job artifacts land in `$ARTIFACTS_ROOT/<job_id>/`. With `ARTIFACTS_ROOT=./data` (in `.env`) that's `./data/<job_id>/` next to the repo — same layout the repo has always used.

	### Run the pipeline headlessly

	```bash
	uv run python pipeline.py --input data/my_video.mp4 --target-lang Spanish
	```

	---
	## API Reference

	The following endpoints are available on the backend (FastAPI/Gradio Server). When running on Hugging Face, replace `localhost:8000` with your Space's API URL (e.g., `https://rafii-videovoice.hf.space`).

	### Core Endpoints

	#### `POST /api/jobs`
	Submit a video for translation. You can provide either a local file or a URL.

	Form Data:
	- `file`: (Optional) Video file upload (MP4, MOV, WebM, ≤90MB).
	- `url`: (Optional) Social media URL (Instagram, YouTube, TikTok).
	- `target_language`: (Required) Name of target language (e.g., "Spanish", "Hindi").
	- `source_language`: (Optional) ISO code of source (default: "en").
	- `voice_mode`: (Optional) `chatterbox` or `omnivoice` (must match Space engine).
	- `captions`: (Optional) "true" or "false" (default: "true").
	- `preserve_music`: (Optional) "true" or "false" (default: "false").

	Example:
	```bash
	curl -X POST http://localhost:8000/api/jobs \
	-F "file=@my_video.mp4" \
	-F "target_language=French"
	```

	#### `GET /api/jobs/{job_id}`
	Poll for the real-time status and progress messages of a specific job.

	Query Parameters:
	- `after`: (Optional) Index of the last message received to fetch only new ones.

	Example:
	```bash
	curl http://localhost:8000/api/jobs/abc123_1?after=5
	```

	#### `GET /api/jobs/{job_id}/result`
	Download the final translated video file.

	Example:
	```bash
	curl -O -L http://localhost:8000/api/jobs/abc123_1/result
	```

	---

	### Utility & Configuration

	#### `GET /api/config`
	Fetch server configuration, including supported languages, max file size, and the active TTS engine.

	#### `GET /api/health`
	Check if the server is alive and see GPU availability/queue depth.

	#### `GET /api/showcase`
	Retrieve curated "before & after" demo entries defined in `data/showcase.json`.

	#### `GET /api/demo-videos`
	List all whitelisted demo videos available for streaming from the `outputs/` and `data/` folders.

	#### `GET /api/demo-videos/{video_id}/stream`
	Stream a specific demo video by its opaque ID.

	---

	### Interactive / Preview Endpoints

	#### `GET /api/jobs/{job_id}/preview/{model_name}`
	Retrieve a short audio snippet of the cloned voice for a specific TTS model before proceeding with full synthesis.

	#### `POST /api/jobs/{job_id}/select-model`
	Confirm which TTS model to use after listening to previews (used in multi-model workflows).

	---

	### ZeroGPU / Gradio Internal API

	#### `POST /run_pipeline` (Gradio API)
	Internal endpoint used by ZeroGPU to trigger the heavy ML processing logic. Recommended for use via `gradio_client`.

	Example (Python):
	```python
	from gradio_client import Client
	client = Client("Rafii/videovoice")
	client.predict(job_id="abc123_1", api_name="/run_pipeline")
	```

	---


	## Testing the API (Hugging Face Spaces)

	When running on Hugging Face Spaces (using `app.py`), you can test the API using standard HTTP tools or the Gradio Client. Choose the Space corresponding to the desired TTS engine:

	\| TTS Engine \| Space URL \| API Endpoint \|
	\|------------\|-----------\|--------------\|
	\| Chatterbox \| `Rafii/videovoice` \| `https://rafii-videovoice.hf.space` \|
	\| OmniVoice \| `Rafii/videovoice-omni` \| `https://rafii-videovoice-omni.hf.space` \|

	### 1. Using `curl` (FastAPI Routes)

	You can check the health of the API and verify that it's running:

	```bash
	# Chatterbox Space
	curl https://rafii-videovoice.hf.space/api/health

	# OmniVoice Space
	curl https://rafii-videovoice-omni.hf.space/api/health
	```

	To submit a job via the standard API:

	```bash
	curl -X POST https://rafii-videovoice.hf.space/api/jobs \
	-F "url=https://www.instagram.com/reels/XYZ/" \
	-F "target_language=Spanish"
	```

	### 2. Using `gradio_client` (Gradio API Routes)

	The `gradio.Server` endpoints are optimized for ZeroGPU and can be accessed using the Python `gradio_client`:

	```python
	from gradio_client import Client

	# Change to "Rafii/videovoice-omni" for OmniVoice
	client = Client("Rafii/videovoice")
	result = client.predict(
	job_id="abc123",
	api_name="/run_pipeline"
	)
	print(result)
	```

	### 3. Using JavaScript (Frontend)

	The new `gradio.Server` mode is designed for custom frontends. You can use the `@gradio/client` JS library:

	```javascript
	import { Client } from "@gradio/client";

	// Connect to the specific Space
	const client = await Client.connect("Rafii/videovoice");
	const result = await client.predict("/run_pipeline", {
	job_id: "abc123",
	});
	```

	---

	## Supported Languages

	Spanish, French, German, Hindi, Portuguese, Italian, Japanese, Chinese, Arabic, Korean — and more.

	---

	## Project Structure

	```
	VideoVoice/
	├── server.py # FastAPI backend
	├── pipeline.py # Core translation pipeline
	├── steps/ # Pipeline step modules
	│ ├── s1_extract_audio.py
	│ ├── s2_transcribe.py
	│ ├── s3_translate.py
	│ ├── s4_tts.py
	│ ├── s5_sync.py
	│ └── s6_merge.py
	├── frontend/ # Static web UI
	│ ├── index.html
	│ ├── style.css
	│ └── app.js
	├── pyproject.toml # Dependencies & project config
	├── uv.lock # Lockfile (reproducible installs)
	├── .env.example
	└── README.md
	```

	---

	## Entrypoints

	Two files intentionally exist, run in different contexts, but ship the same code:

	\| File \| When it runs \| What it does \|
	\|------\|-------------\|--------------\|
	\| `server.py` \| Local dev (`uv run python server.py`) \| Plain FastAPI app — defines every `/api/*` route. \|
	\| `app.py` \| Hugging Face Spaces \| Gradio Server that imports `server.py`'s router and wraps it with `@spaces.GPU` for ZeroGPU. \|

	`app.py` depends on `server.py`, so server.py must ship to HF. Do not strip it.

	## Deployment

	### Hugging Face Spaces (production)

	Push to `main` → GitHub Actions runs `.github/workflows/deploy-hf.yml` → both Spaces (`Rafii/videovoice` and `Rafii/videovoice-omni`) redeploy automatically. No manual step.

	One-time CI setup:
	1. Create an HF access token with write access to both Spaces: https://huggingface.co/settings/tokens
	2. Add it as `HF_TOKEN` under Settings → Secrets and variables → Actions in the GitHub repo.

	Manual fallback (from a local clean checkout with `space` and `space-omni` remotes configured):
	```bash
	./deploy.sh # skips if remote is already at HEAD
	./deploy.sh --force # always redeploy
	```

	Files filtered out of every Space deploy are listed in `.gitattributes` (`export-ignore`).

	### Branching

	`main` is canonical. Use short-lived `feat/<thing>` branches, open a PR, merge, delete. Never maintain a parallel deploy branch — every change on main reaches both Spaces via CI.

	### AWS (alternative GPU host)

	```bash
	# On a g4dn.xlarge instance
	sudo apt update && sudo apt install -y ffmpeg
	curl -LsSf https://astral.sh/uv/install.sh \| sh
	uv sync
	uv run python server.py
	```

	Recommended: use `systemd` service for auto-restart, CloudFront for CDN, S3 for video storage with 24h auto-delete lifecycle policy.

	---

	## License

	MIT License — see [LICENSE](LICENSE).