Model Card for cohere-transcribe-03-2026-taiwanese-hakka

This model is a fine-tuned version of CohereLabs/cohere-transcribe-03-2026 for Taiwanese Hakka automatic speech recognition with Hanzi output.

Training process

The training of the model was performed with the following hyperparameters:

  • Hardware: 4x NVIDIA L40S
  • Per-device batch size: 4
  • Gradient accumulation steps: 32
  • Total training steps: 4795
  • Best checkpoint: step 2877
  • Learning rate: 2e-4
  • Warmup ratio: 0.02
  • Optimizer: adamw_torch_fused
  • LR scheduler type: linear
  • Decoder prompt language: zh
  • Max audio length: 35 seconds

Training data

The model was trained on the following datasets:

  • formospeech/hat_asr_sixian_reading_clean
  • formospeech/hat_asr_sixian_broadcast_clean
  • formospeech/hat_asr_nansixian_reading_clean
  • formospeech/hat_asr_hailu_reading_clean
  • formospeech/hat_tts_hailu_clean
  • formospeech/hat_tts_sixian_clean
  • formospeech/fsr23_eval_clean
  • formospeech/fsr25_warmup_reading_clean
  • formospeech/fsr25_train_clean
  • formospeech/fsr25_final_clean
  • formospeech/fsr25_warmup_media_clean
  • formospeech/hakka_elearning_example_clean
  • formospeech/hakkatv_hanzawa_clean
  • formospeech/hakka_elearning_yt_clean
  • formospeech/hakkaradio_news_clean

Comparison with formospeech/whisper-large-v2-taiwanese-hakka-v1

Model Hailu CER Hailu Norm CER Sixian CER Sixian Norm CER Speed (RTFx)
formospeech/whisper-large-v2-taiwanese-hakka-v1 7.21 3.29 8.69 4.88 144.45
formospeech/cohere-transcribe-03-2026-taiwanese-hakka 10.80 3.99 13.26 5.58 524.88

Speed numbers above are temporary values taken from the Open ASR Leaderboard. They will be replaced later with direct vLLM measurements.

Usage

This model follows the same inference interface as Cohere Transcribe in transformers.

pip install "transformers>=5.4.0" torch huggingface_hub soundfile librosa sentencepiece protobuf
from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio

model_id = "formospeech/cohere-transcribe-03-2026-taiwanese-hakka"

processor = AutoProcessor.from_pretrained(model_id)
model = CohereAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")

audio = load_audio("path/to/audio.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="zh")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)

Notes

  • This release contains inference files only. Optimizer states and trainer checkpoints are intentionally excluded.
  • The tokenizer and processor format follow the upstream Cohere Transcribe release.
Downloads last month
37
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for formospeech/cohere-transcribe-03-2026-taiwanese-hakka

Finetuned
(4)
this model

Evaluation results