SALT - Speech and Language Transformer

SALT is a multimodal model that extends pre-trained large language models (LLMs) by incorporating new audio tokens to handle both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. Our approach bridges the gap between text and audio modalities by expanding the model’s vocabulary with audio representations using SpeechTokenizer.

Unlike existing models that require full retraining or rely on adapters, SALT leverages pre-trained LLM knowledge while fine-tuning for speech-specific tasks.

After resolving training challenges through precision adjustments (e.g., tf32), SALT exhibits stable and effective performance across both TTS and ASR domains.

Downloads last month
2
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vikhrmodels/llama_asr_tts_35000

Finetuned
(53)
this model

Dataset used to train Vikhrmodels/llama_asr_tts_35000