Sesame CSM-1B Acholi TTS β v2
Fine-tuned version of Sesame CSM-1B for Acholi (ach) text-to-speech synthesis using LoRA, trained on an expanded multi-source Acholi speech dataset.
Part of a masters thesis at Makerere University on offline English-to-Acholi movie translation.
Model Details
| Field | Value |
|---|---|
| Base model | unsloth/csm-1b (Sesame Conversational Speech Model, 1B params) |
| Fine-tuning method | LoRA (Low-Rank Adaptation) |
| LoRA rank | r=32, alpha=64 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training steps | 15,000 (early stopped at step 14,200) |
| Best checkpoint | step 14,200 |
| Best eval_loss | 8.368 |
| Test eval_loss | 8.325 |
| Training time | ~7h 37min (L40S 46GB GPU) |
| Language | Acholi (ach) |
Training Data
| Source | Split | Samples | Notes |
|---|---|---|---|
Sunbird/salt studio-ach |
train | 4,062 | Studio-quality single speaker |
google/WaxalNLP ach_tts |
train | 309 | Community speech |
acellam/acholi-english-movie-dialogue |
train | 1,743 | Movie dialogue, 16kHz |
| Total | 6,114 |
All audio resampled to 24,000 Hz. Samples filtered to 0.5β8 seconds duration.
Training Configuration
learning_rate = 5e-6
lr_scheduler_type = "cosine"
warmup_steps = 1000
max_steps = 15000
per_device_train_batch_size = 2
gradient_accumulation_steps = 4 # effective batch = 8
weight_decay = 0.05
fp16 / bf16 = True (auto)
early_stopping_patience = 20 # 2000-step tolerance
early_stopping_threshold = 0.001
gradient_checkpointing = "unsloth"
Experiment History
| Version | Dataset | LoRA rank | Best eval_loss | Best step | Notes |
|---|---|---|---|---|---|
| v1 | SALT studio-ach only (~3,500 samples) | r=128, alpha=128 | 5.031 | 3,000 | Studio-only, early plateau |
| v2 | SALT + WaxalNLP + movie dialogue (6,114 total) | r=32, alpha=64 | 8.368 | 14,200 | Multi-source, longer training |
Note on loss comparison: v1's lower loss (5.031 vs 8.368) reflects training on a narrow, homogeneous single-speaker studio dataset β not a better model. v2 was trained on a more diverse, multi-speaker, multi-domain dataset which is inherently harder to fit. The expected benefit is better generalization to real movie speech. Perceptual quality comparison (MOS evaluation) is ongoing.
Key Design Decisions (v2)
- LoRA rank reduced r=128 β r=32: Mirroring Spark-TTS findings that rank has minimal impact when data is the bottleneck. Reduces memory and training time.
- Early stopping patience increased 10 β 20: Allows 2,000 steps of patience to avoid premature stopping on the noisier multi-source dataset.
- save_total_limit=2: Disk safety β each Sesame checkpoint is ~4GB.
- speaker_id via cast_column: Schema-level dtype fix required for multi-source dataset concatenation.
- Movie audio resampled from 44100Hz β 16kHz: Native movie audio rate corrected at preprocessing.
Usage
from transformers import CsmForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
# Load base model
base_model = CsmForConditionalGeneration.from_pretrained("unsloth/csm-1b")
processor = AutoProcessor.from_pretrained("unsloth/csm-1b")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "acellam/sesame-csm-salt-ach-v2")
model.eval()
# Inference
text = "Kwo me dano obedo maber tutwal." # Example Acholi text
conversation = [{"role": "0", "content": [{"type": "text", "text": text}]}]
inputs = processor.apply_chat_template(conversation, tokenize=True, return_dict=True, return_tensors="pt")
with torch.no_grad():
output = model.generate(**inputs)
Related Models & Resources
| Resource | Link |
|---|---|
| Spark-TTS v11 (Acholi) | acellam/spark-tts-salt-ach-v11 |
| Spark-TTS v10 (Acholi) | acellam/spark-tts-salt-ach-v10 |
| Sesame CSM-1B v1 (Acholi) | acellam/spark-tts-salt-sesame |
| Sunbird SparkTTS (SALT) | Sunbird/spark-tts-salt |
| Training dataset | acellam/acholi-english-movie-dialogue |
| Training script | GitLab: sesame-tts-training |
| Live demo / dashboard | lebsync.com |
Citation
@mastersthesis{acellam2026acholi,
title = {Offline Speech Translation System for English Movies into Acholi Using AI Techniques},
author = {Guy Acellam},
school = {Makerere University, School of Computing and Informatics Technology},
year = {2026},
note = {Supervisor: Dr. Rose Nakibuule. Student ID: 2017/HD05/84U}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support