Voxtral 3B — Haitian Creole LoRA Adapter (v3)

First-ever fine-tuning of Mistral's Voxtral speech-to-text model for Haitian Creole — a language not in the base model's 13 supported languages. This LoRA adapter enables Haitian Creole (Kreyol Ayisyen) transcription on a model that previously had zero capability in this language.

Model Description

This is a PEFT LoRA adapter for mistralai/Voxtral-Mini-3B-2507, Mistral's offline speech-to-text model based on a Whisper-style encoder with a Mistral LLM decoder. The adapter was trained on the CMU Haitian Creole Speech dataset to add Haitian Creole transcription capability.

  • Developed by: Chad Hendren (AcmeClaw)
  • Model type: LoRA adapter for encoder-decoder speech-to-text
  • Language: Haitian Creole (ht)
  • License: Apache 2.0
  • Base model: mistralai/Voxtral-Mini-3B-2507
  • Training infrastructure: HuggingFace Jobs, NVIDIA L40S 48GB
  • Total training cost: ~$2.70

Benchmark Results

Quantitative Evaluation (500 held-out samples)

Metric Base Model Fine-tuned (v2) Improvement
Word Error Rate (WER) 101.4% 9.7% 91.7 points (88.3% relative)
Character Error Rate (CER) 64.0% 3.1% 60.9 points
Exact Match Rate 0.0% 38.4% +38.4 points
Exact Matches 0 / 500 192 / 500

The base model achieved zero exact matches on 500 Haitian Creole audio samples and produced a WER above 100% (indicating both substitutions and insertions). The fine-tuned model correctly transcribes 192 out of 500 sentences perfectly and achieves a 9.7% word error rate.

Sample Transcriptions (held-out test set)

# Reference (ground truth) Fine-tuned Output Match
1 dezyèm jou a, se te tou pa netanèl, pitit gason wa a, chèf branch fanmi izaka a. dezyèm jou a, se te tou pa netanèl, pitit gason wa a, chèf branch fanmi izaka a. EXACT
2 koulye a, mwen pral ba ou yon konsèy si ou vle sove lavi ou ansanm ak lavi salomon, pitit gason ou lan. koulye a, mwen pral ba ou yon konsèy si ou vle sove lavi ou ansanm ak lavi salomon, pitit gason ou lan. EXACT
3 nan mòn peyi efrayim yo, ant lavil rama ak lavil betèl, te gen yon pye palmis yo te rele palmis debora. nan mòn peyi efrayim yo, ant lavil rama ak lavil betèl, te gen yon pye palmis yo te rele palmis debora. EXACT
4 li finalman vin deside fè yon pa pou sitadèl ak palè sansousi wa kristòf la premye a fin restore 1988. li finalman vin deside fè yon pa pou sitadèl ak palè sansousi wa kristòf la, premye a, fin restore an 1988. Near (punctuation)
5 anndan vil pòdepè genyen 39 biwo vòt pou anviwonn 15600 moun ki enskri sou lis elektè yo. an dan vil pòdepè genyen 39 biwo vòt pou anviwon, 15 600 moun ki enskri sou lis elektè yo. Near (spacing)

Error Analysis

The remaining errors on the 500-sample benchmark fall into these categories:

  • Punctuation and spacing (~40% of errors): Missing or added commas, word boundary differences (e.g., "anndan" vs "an dan"). The semantic content is correct.
  • Number formatting (~15% of errors): Written numbers vs digit representation (e.g., "twasan" vs "300", "sèt" vs "7").
  • Minor word substitutions (~30% of errors): Individual word differences in otherwise correct sentences (e.g., "yon" vs "on", "zansèt" vs "sansèl").
  • Significant transcription errors (~15% of errors): Multiple wrong words, typically on proper nouns or rare vocabulary.

Validation

Validation 1: Quantitative Benchmark (500 samples)

Both the base model and fine-tuned model were evaluated on 500 held-out samples from the CMU Haitian Creole Speech dataset. These samples were excluded from training using a fixed random seed (seed=42), ensuring the model never saw them during training. Both models received identical audio inputs processed through the same VoxtralProcessor.apply_transcription_request() pipeline. Word Error Rate (WER), Character Error Rate (CER), and Exact Match Rate were computed using the jiwer library.

The base model produced 101.4% WER (above 100% due to word insertions) and zero exact matches, confirming it has no Haitian Creole capability. The fine-tuned model achieved 9.7% WER with 192 perfect transcriptions.

Validation 2: Blind Test on Real-World Audio (External Source)

To verify the model generalizes beyond the training dataset, we tested on audio from a completely independent source: Pawòl Lakay, a Haitian Creole language instruction course hosted on the Internet Archive. This audio features native Creole speakers in educational dialogues and was recorded independently from the CMU training data.

Five 25-second clips were extracted from different segments of the CD and processed through both models. The results demonstrate the fundamental capability difference:

Clip Base Model Fine-tuned Model
Greeting dialogue "Good evening, sir. I am Roselot, little man. And you, sir?" "bonswa mesye, mwen rele ozlò, pitit òm, e ou menm, ki jan mwen rele?"
Famous Haitians "Let's listen to some famous Haitians. Wycliffe Jean is a Haitian musician..." "an nou koute sa m famous haychens. wayklè ljan se yon mizisyen ayisyen ki konn jwe mizik rap..."
Practice prompt "Let's practice." "anou pratike, write down the missing words in the list."

The base model translates the Creole audio into English summaries. The fine-tuned model transcribes the actual Creole words spoken. This is the critical distinction: the base model recognizes that speech is occurring and attempts to interpret it in a supported language, while the fine-tuned model produces the actual Creole text.

Note: The Pawòl Lakay course is bilingual (Creole + English instruction). The fine-tuned model correctly transcribes the Creole portions but sometimes attempts to render English instruction segments as Creole, which is expected behavior for a Creole-specific model.

Validation 3: TTS Round-Trip Test

Five Haitian Creole sentences were synthesized using Mistral's Voxtral TTS model (voxtral-mini-tts-2603) with a standard voice preset and then transcribed by both models. This test used sentences not present in any training data:

  • "Bonjou, ki jan ou ye jodi a? Mwen espere ou byen."
  • "Ayiti se yon bèl peyi ki gen anpil moun ki travay di chak jou."
  • "Timoun yo ap jwe nan lakou a pandan manman yo ap fè manje."
  • "Nou bezwen travay ansanm pou nou ka bati yon avni pi bon pou tout moun."
  • "Dlo a pa t pwòp, men yo te bwè l kanmenm paske yo te swaf anpil."

The base model produced English translations or gibberish (111.6% WER). The fine-tuned model achieved 55.1% WER on these samples — notably worse than the 9.7% on native speech, because the TTS voice speaks with a French/English accent rather than native Creole pronunciation. The model was trained on native Creole speakers and performs best on that acoustic profile. This result confirms the model is genuinely learning Creole phonology, not just memorizing surface patterns.

Validation 4: Training Convergence

Training loss across 5,000 steps showed healthy convergence without signs of catastrophic divergence:

Epoch Loss
0.1 1.306
0.5 0.463
1.0 0.370
2.0 0.236
3.0 0.205
4.0 0.133
5.0 0.089
6.0 0.062
6.8 (final) 0.049

Gradient norms remained stable throughout (0.3-1.2 range), with no NaN events on CUDA. The cosine learning rate schedule smoothly decayed from 3e-5 to near zero.

Development Process

This model was developed through an iterative experimental process:

Phase 1: Architecture Exploration

We first attempted fine-tuning the Voxtral Realtime 4B model (Voxtral-Mini-4B-Realtime-2602) on Apple Silicon (M4 Max, 128GB) — this was the first-ever training of that model. While the training pipeline worked (forward/backward pass, loss convergence), the Realtime model's DSM (Delayed Streams Modeling) architecture with element-wise SUM fusion between audio and text embeddings caused repetition collapse during autoregressive generation on unseen languages. Both frozen-encoder and unfrozen-encoder configurations were tested with forced alignment via torchaudio MMS. The model learned Creole vocabulary (outputting real Creole words instead of French) but could not learn when to produce them, resulting in degenerate repetition loops.

Phase 2: Local Training on MLX

We then attempted the Voxtral 3B Offline model via mlx-tune on Apple Silicon. This uncovered three compounding bugs in mlx-tune v0.4.11's Voxtral integration: bf16 gradient accumulation producing NaN (~33% of steps), the STTModelWrapper.transcribe() method passing an unsupported task= parameter, and the processor wrapper replacing the native VoxtralProcessor and breaking generate(). A custom training loop with NaN-safe gradient clipping achieved loss convergence (14 to 7) but the model weights were numerically damaged from the high NaN skip rate, preventing valid inference.

Phase 3: Cloud Training with Proven Patterns

We moved to CUDA-based training on HuggingFace Jobs (L40S GPU), using the proven data collator pattern from Deep-unlearning/Finetune-Voxtral-ASR. Key technical decisions:

  • Used VoxtralProcessor.apply_transcription_request() for correct audio prompt construction with [AUDIO] tokens
  • Pinned datasets>=3.0,<4.0 to avoid torchcodec dependency issues
  • Used task_type="SEQ_2_SEQ_LM" in LoRA configuration (not CAUSAL_LM)
  • Set remove_unused_columns=False in TrainingArguments

v1 (LoRA r=16, encoder frozen, 2000 steps) achieved 11.8% WER on 500 samples. v2 (LoRA r=32, encoder unfrozen, 5000 steps, cosine schedule) improved to 9.7% WER.

Training Configuration

Parameter v1 v2 (this model)
LoRA rank (r) 16 32
LoRA alpha 32 64
Trainable params 14.7M (0.31%) 29.5M (0.63%)
Encoder Frozen Unfrozen (LoRA)
Learning rate 5e-5 (linear) 3e-5 (cosine)
Steps 2,000 5,000
Epochs ~2.5 ~6.8
Warmup steps 50 100
Batch size 2 2
Gradient accumulation 4 4
Effective batch 8 8
Precision bf16 bf16
Max grad norm 1.0 1.0
Final training loss 0.377 0.049
WER (500 samples) 11.8% 9.7%
Training time 38 min 93 min
GPU L40S 48GB L40S 48GB
Cost $1.14 $2.70

Usage

Requirements

pip install transformers peft torch

Inference

import torch
import soundfile as sf
from transformers import VoxtralForConditionalGeneration, VoxtralProcessor
from peft import PeftModel

MODEL_ID = "mistralai/Voxtral-Mini-3B-2507"
ADAPTER_ID = "chendren/voxtral-creole-lora-v3"

# Load base model + LoRA adapter
processor = VoxtralProcessor.from_pretrained(MODEL_ID)
model = VoxtralForConditionalGeneration.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, ADAPTER_ID)
model.config = model.base_model.model.config  # Required for generate()

# Load audio (16kHz mono)
audio, sr = sf.read("path/to/creole_audio.wav")

# Transcribe
inputs = processor.apply_transcription_request(
    language="en",
    model_id=MODEL_ID,
    audio=[audio],
    format=["WAV"],
    return_tensors="pt",
)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v
          for k, v in inputs.items()}

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=256)

text = processor.batch_decode(output, skip_special_tokens=True)[0].strip()
print(text)

Important Notes

  • Use VoxtralProcessor, not AutoProcessor
  • Use apply_transcription_request() for both training and inference
  • Set model.config = model.base_model.model.config after loading the PEFT adapter — this is required for generate() to work
  • Audio must be 16kHz mono WAV

Limitations and Bias

  • Domain coverage: Trained primarily on biblical and educational text from the CMU Haitian Creole Speech dataset. Performance on other domains (news, casual conversation, medical, legal) has not been evaluated and may be lower.
  • Speaker diversity: The training set contains a limited number of speakers. The model may perform worse on voices significantly different from the training distribution.
  • Accent sensitivity: Optimized for native Haitian Creole pronunciation. Non-native speakers or speakers with strong regional accents may see higher error rates (confirmed by TTS round-trip test showing 55% WER on French-accented Creole vs 9.7% on native speech).
  • Code-switching: In bilingual audio, the model attempts to transcribe English portions as Creole, producing incorrect output for non-Creole segments.
  • Number handling: Numbers may appear inconsistently as digits or written words.
  • Punctuation: The model occasionally differs from ground truth in comma placement and word boundary segmentation, though semantic content is preserved.

Citation

@misc{hendren2026voxtralcreole,
  title={Voxtral 3B Haitian Creole LoRA Adapter},
  author={Chad Hendren},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/chendren/voxtral-creole-lora-v3}
}

Acknowledgments

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chendren/voxtral-creole-lora-v3

Adapter
(12)
this model

Dataset used to train chendren/voxtral-creole-lora-v3

Evaluation results