Model Card for cohere-transcribe-03-2026-taiwanese-hakka

This model is a fine-tuned version of CohereLabs/cohere-transcribe-03-2026 for Taiwanese Hakka automatic speech recognition with Hanzi output.

Training process

The training of the model was performed with the following hyperparameters:

Hardware: 4x NVIDIA L40S
Per-device batch size: 4
Gradient accumulation steps: 32
Total training steps: 4795
Best checkpoint: step 2877
Learning rate: 2e-4
Warmup ratio: 0.02
Optimizer: adamw_torch_fused
LR scheduler type: linear
Decoder prompt language: zh
Max audio length: 35 seconds

Training data

The model was trained on the following datasets:

formospeech/hat_asr_sixian_reading_clean
formospeech/hat_asr_sixian_broadcast_clean
formospeech/hat_asr_nansixian_reading_clean
formospeech/hat_asr_hailu_reading_clean
formospeech/hat_tts_hailu_clean
formospeech/hat_tts_sixian_clean
formospeech/fsr23_eval_clean
formospeech/fsr25_warmup_reading_clean
formospeech/fsr25_train_clean
formospeech/fsr25_final_clean
formospeech/fsr25_warmup_media_clean
formospeech/hakka_elearning_example_clean
formospeech/hakkatv_hanzawa_clean
formospeech/hakka_elearning_yt_clean
formospeech/hakkaradio_news_clean

Comparison with `formospeech/whisper-large-v2-taiwanese-hakka-v1`

Model	Hailu CER	Hailu Norm CER	Sixian CER	Sixian Norm CER	Speed (RTFx)
`formospeech/whisper-large-v2-taiwanese-hakka-v1`	7.21	3.29	8.69	4.88	144.45
`formospeech/cohere-transcribe-03-2026-taiwanese-hakka`	10.80	3.99	13.26	5.58	524.88

Speed numbers above are temporary values taken from the Open ASR Leaderboard. They will be replaced later with direct vLLM measurements.

Usage

This model follows the same inference interface as Cohere Transcribe in transformers.

pip install "transformers>=5.4.0" torch huggingface_hub soundfile librosa sentencepiece protobuf

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio

model_id = "formospeech/cohere-transcribe-03-2026-taiwanese-hakka"

processor = AutoProcessor.from_pretrained(model_id)
model = CohereAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")

audio = load_audio("path/to/audio.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="zh")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)

Notes

This release contains inference files only. Optimizer states and trainer checkpoints are intentionally excluded.
The tokenizer and processor format follow the upstream Cohere Transcribe release.

Downloads last month: 37

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for formospeech/cohere-transcribe-03-2026-taiwanese-hakka

Base model

CohereLabs/cohere-transcribe-03-2026

Finetuned

(4)

this model

Evaluation results

CER on Hakkaradio News Clean (Sixian)
test set self-reported

13.260
Normalized CER on Hakkaradio News Clean (Sixian)
test set self-reported

5.580
CER on Hakkaradio News Clean (Hailu)
test set self-reported

10.800
Normalized CER on Hakkaradio News Clean (Hailu)
test set self-reported

3.990