Voxtral 3B — Haitian Creole LoRA Adapter (v3)
First-ever fine-tuning of Mistral's Voxtral speech-to-text model for Haitian Creole — a language not in the base model's 13 supported languages. This LoRA adapter enables Haitian Creole (Kreyol Ayisyen) transcription on a model that previously had zero capability in this language.
Model Description
This is a PEFT LoRA adapter for mistralai/Voxtral-Mini-3B-2507, Mistral's offline speech-to-text model based on a Whisper-style encoder with a Mistral LLM decoder. The adapter was trained on the CMU Haitian Creole Speech dataset to add Haitian Creole transcription capability.
- Developed by: Chad Hendren (AcmeClaw)
- Model type: LoRA adapter for encoder-decoder speech-to-text
- Language: Haitian Creole (ht)
- License: Apache 2.0
- Base model: mistralai/Voxtral-Mini-3B-2507
- Training infrastructure: HuggingFace Jobs, NVIDIA L40S 48GB
- Total training cost: ~$2.70
Benchmark Results
Quantitative Evaluation (500 held-out samples)
| Metric | Base Model | Fine-tuned (v2) | Improvement |
|---|---|---|---|
| Word Error Rate (WER) | 101.4% | 9.7% | 91.7 points (88.3% relative) |
| Character Error Rate (CER) | 64.0% | 3.1% | 60.9 points |
| Exact Match Rate | 0.0% | 38.4% | +38.4 points |
| Exact Matches | 0 / 500 | 192 / 500 | — |
The base model achieved zero exact matches on 500 Haitian Creole audio samples and produced a WER above 100% (indicating both substitutions and insertions). The fine-tuned model correctly transcribes 192 out of 500 sentences perfectly and achieves a 9.7% word error rate.
Sample Transcriptions (held-out test set)
| # | Reference (ground truth) | Fine-tuned Output | Match |
|---|---|---|---|
| 1 | dezyèm jou a, se te tou pa netanèl, pitit gason wa a, chèf branch fanmi izaka a. | dezyèm jou a, se te tou pa netanèl, pitit gason wa a, chèf branch fanmi izaka a. | EXACT |
| 2 | koulye a, mwen pral ba ou yon konsèy si ou vle sove lavi ou ansanm ak lavi salomon, pitit gason ou lan. | koulye a, mwen pral ba ou yon konsèy si ou vle sove lavi ou ansanm ak lavi salomon, pitit gason ou lan. | EXACT |
| 3 | nan mòn peyi efrayim yo, ant lavil rama ak lavil betèl, te gen yon pye palmis yo te rele palmis debora. | nan mòn peyi efrayim yo, ant lavil rama ak lavil betèl, te gen yon pye palmis yo te rele palmis debora. | EXACT |
| 4 | li finalman vin deside fè yon pa pou sitadèl ak palè sansousi wa kristòf la premye a fin restore 1988. | li finalman vin deside fè yon pa pou sitadèl ak palè sansousi wa kristòf la, premye a, fin restore an 1988. | Near (punctuation) |
| 5 | anndan vil pòdepè genyen 39 biwo vòt pou anviwonn 15600 moun ki enskri sou lis elektè yo. | an dan vil pòdepè genyen 39 biwo vòt pou anviwon, 15 600 moun ki enskri sou lis elektè yo. | Near (spacing) |
Error Analysis
The remaining errors on the 500-sample benchmark fall into these categories:
- Punctuation and spacing (~40% of errors): Missing or added commas, word boundary differences (e.g., "anndan" vs "an dan"). The semantic content is correct.
- Number formatting (~15% of errors): Written numbers vs digit representation (e.g., "twasan" vs "300", "sèt" vs "7").
- Minor word substitutions (~30% of errors): Individual word differences in otherwise correct sentences (e.g., "yon" vs "on", "zansèt" vs "sansèl").
- Significant transcription errors (~15% of errors): Multiple wrong words, typically on proper nouns or rare vocabulary.
Validation
Validation 1: Quantitative Benchmark (500 samples)
Both the base model and fine-tuned model were evaluated on 500 held-out samples from the CMU Haitian Creole Speech dataset. These samples were excluded from training using a fixed random seed (seed=42), ensuring the model never saw them during training. Both models received identical audio inputs processed through the same VoxtralProcessor.apply_transcription_request() pipeline. Word Error Rate (WER), Character Error Rate (CER), and Exact Match Rate were computed using the jiwer library.
The base model produced 101.4% WER (above 100% due to word insertions) and zero exact matches, confirming it has no Haitian Creole capability. The fine-tuned model achieved 9.7% WER with 192 perfect transcriptions.
Validation 2: Blind Test on Real-World Audio (External Source)
To verify the model generalizes beyond the training dataset, we tested on audio from a completely independent source: Pawòl Lakay, a Haitian Creole language instruction course hosted on the Internet Archive. This audio features native Creole speakers in educational dialogues and was recorded independently from the CMU training data.
Five 25-second clips were extracted from different segments of the CD and processed through both models. The results demonstrate the fundamental capability difference:
| Clip | Base Model | Fine-tuned Model |
|---|---|---|
| Greeting dialogue | "Good evening, sir. I am Roselot, little man. And you, sir?" | "bonswa mesye, mwen rele ozlò, pitit òm, e ou menm, ki jan mwen rele?" |
| Famous Haitians | "Let's listen to some famous Haitians. Wycliffe Jean is a Haitian musician..." | "an nou koute sa m famous haychens. wayklè ljan se yon mizisyen ayisyen ki konn jwe mizik rap..." |
| Practice prompt | "Let's practice." | "anou pratike, write down the missing words in the list." |
The base model translates the Creole audio into English summaries. The fine-tuned model transcribes the actual Creole words spoken. This is the critical distinction: the base model recognizes that speech is occurring and attempts to interpret it in a supported language, while the fine-tuned model produces the actual Creole text.
Note: The Pawòl Lakay course is bilingual (Creole + English instruction). The fine-tuned model correctly transcribes the Creole portions but sometimes attempts to render English instruction segments as Creole, which is expected behavior for a Creole-specific model.
Validation 3: TTS Round-Trip Test
Five Haitian Creole sentences were synthesized using Mistral's Voxtral TTS model (voxtral-mini-tts-2603) with a standard voice preset and then transcribed by both models. This test used sentences not present in any training data:
- "Bonjou, ki jan ou ye jodi a? Mwen espere ou byen."
- "Ayiti se yon bèl peyi ki gen anpil moun ki travay di chak jou."
- "Timoun yo ap jwe nan lakou a pandan manman yo ap fè manje."
- "Nou bezwen travay ansanm pou nou ka bati yon avni pi bon pou tout moun."
- "Dlo a pa t pwòp, men yo te bwè l kanmenm paske yo te swaf anpil."
The base model produced English translations or gibberish (111.6% WER). The fine-tuned model achieved 55.1% WER on these samples — notably worse than the 9.7% on native speech, because the TTS voice speaks with a French/English accent rather than native Creole pronunciation. The model was trained on native Creole speakers and performs best on that acoustic profile. This result confirms the model is genuinely learning Creole phonology, not just memorizing surface patterns.
Validation 4: Training Convergence
Training loss across 5,000 steps showed healthy convergence without signs of catastrophic divergence:
| Epoch | Loss |
|---|---|
| 0.1 | 1.306 |
| 0.5 | 0.463 |
| 1.0 | 0.370 |
| 2.0 | 0.236 |
| 3.0 | 0.205 |
| 4.0 | 0.133 |
| 5.0 | 0.089 |
| 6.0 | 0.062 |
| 6.8 (final) | 0.049 |
Gradient norms remained stable throughout (0.3-1.2 range), with no NaN events on CUDA. The cosine learning rate schedule smoothly decayed from 3e-5 to near zero.
Development Process
This model was developed through an iterative experimental process:
Phase 1: Architecture Exploration
We first attempted fine-tuning the Voxtral Realtime 4B model (Voxtral-Mini-4B-Realtime-2602) on Apple Silicon (M4 Max, 128GB) — this was the first-ever training of that model. While the training pipeline worked (forward/backward pass, loss convergence), the Realtime model's DSM (Delayed Streams Modeling) architecture with element-wise SUM fusion between audio and text embeddings caused repetition collapse during autoregressive generation on unseen languages. Both frozen-encoder and unfrozen-encoder configurations were tested with forced alignment via torchaudio MMS. The model learned Creole vocabulary (outputting real Creole words instead of French) but could not learn when to produce them, resulting in degenerate repetition loops.
Phase 2: Local Training on MLX
We then attempted the Voxtral 3B Offline model via mlx-tune on Apple Silicon. This uncovered three compounding bugs in mlx-tune v0.4.11's Voxtral integration: bf16 gradient accumulation producing NaN (~33% of steps), the STTModelWrapper.transcribe() method passing an unsupported task= parameter, and the processor wrapper replacing the native VoxtralProcessor and breaking generate(). A custom training loop with NaN-safe gradient clipping achieved loss convergence (14 to 7) but the model weights were numerically damaged from the high NaN skip rate, preventing valid inference.
Phase 3: Cloud Training with Proven Patterns
We moved to CUDA-based training on HuggingFace Jobs (L40S GPU), using the proven data collator pattern from Deep-unlearning/Finetune-Voxtral-ASR. Key technical decisions:
- Used
VoxtralProcessor.apply_transcription_request()for correct audio prompt construction with [AUDIO] tokens - Pinned
datasets>=3.0,<4.0to avoid torchcodec dependency issues - Used
task_type="SEQ_2_SEQ_LM"in LoRA configuration (not CAUSAL_LM) - Set
remove_unused_columns=Falsein TrainingArguments
v1 (LoRA r=16, encoder frozen, 2000 steps) achieved 11.8% WER on 500 samples. v2 (LoRA r=32, encoder unfrozen, 5000 steps, cosine schedule) improved to 9.7% WER.
Training Configuration
| Parameter | v1 | v2 (this model) |
|---|---|---|
| LoRA rank (r) | 16 | 32 |
| LoRA alpha | 32 | 64 |
| Trainable params | 14.7M (0.31%) | 29.5M (0.63%) |
| Encoder | Frozen | Unfrozen (LoRA) |
| Learning rate | 5e-5 (linear) | 3e-5 (cosine) |
| Steps | 2,000 | 5,000 |
| Epochs | ~2.5 | ~6.8 |
| Warmup steps | 50 | 100 |
| Batch size | 2 | 2 |
| Gradient accumulation | 4 | 4 |
| Effective batch | 8 | 8 |
| Precision | bf16 | bf16 |
| Max grad norm | 1.0 | 1.0 |
| Final training loss | 0.377 | 0.049 |
| WER (500 samples) | 11.8% | 9.7% |
| Training time | 38 min | 93 min |
| GPU | L40S 48GB | L40S 48GB |
| Cost | $1.14 | $2.70 |
Usage
Requirements
pip install transformers peft torch
Inference
import torch
import soundfile as sf
from transformers import VoxtralForConditionalGeneration, VoxtralProcessor
from peft import PeftModel
MODEL_ID = "mistralai/Voxtral-Mini-3B-2507"
ADAPTER_ID = "chendren/voxtral-creole-lora-v3"
# Load base model + LoRA adapter
processor = VoxtralProcessor.from_pretrained(MODEL_ID)
model = VoxtralForConditionalGeneration.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, ADAPTER_ID)
model.config = model.base_model.model.config # Required for generate()
# Load audio (16kHz mono)
audio, sr = sf.read("path/to/creole_audio.wav")
# Transcribe
inputs = processor.apply_transcription_request(
language="en",
model_id=MODEL_ID,
audio=[audio],
format=["WAV"],
return_tensors="pt",
)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v
for k, v in inputs.items()}
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=256)
text = processor.batch_decode(output, skip_special_tokens=True)[0].strip()
print(text)
Important Notes
- Use
VoxtralProcessor, notAutoProcessor - Use
apply_transcription_request()for both training and inference - Set
model.config = model.base_model.model.configafter loading the PEFT adapter — this is required forgenerate()to work - Audio must be 16kHz mono WAV
Limitations and Bias
- Domain coverage: Trained primarily on biblical and educational text from the CMU Haitian Creole Speech dataset. Performance on other domains (news, casual conversation, medical, legal) has not been evaluated and may be lower.
- Speaker diversity: The training set contains a limited number of speakers. The model may perform worse on voices significantly different from the training distribution.
- Accent sensitivity: Optimized for native Haitian Creole pronunciation. Non-native speakers or speakers with strong regional accents may see higher error rates (confirmed by TTS round-trip test showing 55% WER on French-accented Creole vs 9.7% on native speech).
- Code-switching: In bilingual audio, the model attempts to transcribe English portions as Creole, producing incorrect output for non-Creole segments.
- Number handling: Numbers may appear inconsistently as digits or written words.
- Punctuation: The model occasionally differs from ground truth in comma placement and word boundary segmentation, though semantic content is preserved.
Citation
@misc{hendren2026voxtralcreole,
title={Voxtral 3B Haitian Creole LoRA Adapter},
author={Chad Hendren},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/chendren/voxtral-creole-lora-v3}
}
Acknowledgments
- Mistral AI for the Voxtral model family
- CMU Haitian Creole Speech dataset
- Deep-unlearning/Finetune-Voxtral-ASR for the proven data collator pattern
- Pawòl Lakay for real-world validation audio
- Built with Claude Code by Anthropic
- Downloads last month
- 25
Model tree for chendren/voxtral-creole-lora-v3
Base model
mistralai/Voxtral-Mini-3B-2507Dataset used to train chendren/voxtral-creole-lora-v3
Evaluation results
- Word Error Rate on CMU Haitian Creole Speechtest set self-reported9.700
- Character Error Rate on CMU Haitian Creole Speechtest set self-reported3.100