Higgs-Audio-STT
Collection
Understanding Model • 2 items • Updated • 1
A speech-to-text model combining a Whisper-Large-v3 encoder with a Qwen3-8B decoder (8.91B total parameters), fine-tuned with LoRA on diverse ASR benchmarks.
import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer
# Load model
model = AutoModel.from_pretrained(
"bosonai/higgs-audio-v3-8b-stt",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
attn_implementation="eager",
device_map="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained("bosonai/higgs-audio-v3-8b-stt")
# Transcribe audio (16kHz mono numpy array)
from transformers.utils import cached_file
import importlib.util
spec = importlib.util.spec_from_file_location("transcribe", cached_file("bosonai/higgs-audio-v3-8b-stt", "transcribe.py", _raise_exceptions_for_connection_errors=False))
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
audio_np = np.random.randn(16000).astype(np.float32) # replace with your audio
text = mod.transcribe(model, tokenizer, audio_np)
print(text)
torch
transformers>=4.51.0
whisper # for audio preprocessing (WhisperProcessor)
| Dataset | WER |
|---|---|
| AMI | 6.23% |
| Earnings22 | 11.33% |
| GigaSpeech | 9.34% |
| LibriSpeech Clean | 1.24% |
| LibriSpeech Other | 2.34% |
| SPGISpeech | 3.14% |
| TED-LIUM | 3.14% |
| VoxPopuli | 5.63% |
| Average | 5.30% |