Chatterbox French Cross-Lingual Voice Cloning LoRA (Rank 32)

This repository contains a highly optimized LoRA adapter for the Chatterbox Multilingual TTS model, fine-tuned specifically for high-fidelity English-to-French cross-lingual voice cloning.

Model Description

  • Base Architecture: Chatterbox Multilingual TTS
  • Adapter Type: LoRA (Low-Rank Adaptation)
  • Optimization Strategy: Sparse Dataset Formulation (2 references per target)
  • LoRA Rank: 32 | Alpha: 64
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Performance Metrics (RTX 4090, 100 Samples)

By switching to a low-redundancy "Sparse" dataset and training with Rank 32, this model achieves near-zero-shot levels of speaker preservation while significantly improving French phonetic accuracy.

  • Speaker Similarity: 0.8452 (Out of 1.0)
  • WER (Word Error Rate): 0.0758
  • chrF (Format/Prosody): 93.57
  • PESQ: 1.22
  • MCD (Mel-Cepstral Distortion): 13.29

How to use for Inference

You must have the Chatterbox framework installed to use this adapter.

import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# 1. Define LoRA mapping helper
class LoRALayer(nn.Module):
    def __init__(self, original, rank=32, alpha=64.0, dropout=0.05):
        super().__init__()
        self.original_layer = original
        self.scaling = alpha / rank
        in_f, out_f = original.in_features, original.out_features
        dev, dt = original.weight.device, original.weight.dtype
        self.lora_A = nn.Parameter(torch.zeros(rank, in_f, device=dev, dtype=dt))
        self.lora_B = nn.Parameter(torch.zeros(out_f, rank, device=dev, dtype=dt))
        self.lora_dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        for p in self.original_layer.parameters():
            p.requires_grad = False

    def forward(self, x):
        return self.original_layer(x) + (self.lora_dropout(x) @ self.lora_A.T @ self.lora_B.T * self.scaling)

# 2. Load Base Model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = ChatterboxMultilingualTTS.from_pretrained(device=device)

# 3. Download and Inject LoRA
lora_path = hf_hub_download(repo_id="amanuelbyte/chatterbox-fr-lora-v2", filename="best_lora_adapter.pt")
payload = torch.load(lora_path, map_location=device, weights_only=True)
lora_sd = payload.get("lora_state_dict", payload)

targets = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
for name, module in model.t3.named_modules():
    if isinstance(module, nn.Linear) and any(x in name for x in targets):
        parent_name, child_name = ".".join(name.split(".")[:-1]), name.split(".")[-1]
        parent = model.t3.get_submodule(parent_name)
        setattr(parent, child_name, LoRALayer(module))

# Load the weights into the injected layers
current_sd = model.t3.state_dict()
for key, value in lora_sd.items():
    if key in current_sd:
        current_sd[key] = value.to(device)
model.t3.load_state_dict(current_sd, strict=False)
model.t3.eval()

# 4. Generate Audio
import torchaudio
text = "Bonjour, bienvenue à cette démonstration de clonage vocal cross-lingue!"
ref_audio_path = "path/to/english_voice.wav"

with torch.inference_mode():
    wav = model.generate(text, audio_prompt_path=ref_audio_path, language_id="fr")
    
torchaudio.save("output_french.wav", wav.cpu().unsqueeze(0), 16000)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support