SALT - Speech and Language Transformer
SALT is a multimodal model that extends pre-trained large language models (LLMs) by incorporating new audio tokens to handle both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. Our approach bridges the gap between text and audio modalities by expanding the model’s vocabulary with audio representations using SpeechTokenizer.
Unlike existing models that require full retraining or rely on adapters, SALT leverages pre-trained LLM knowledge while fine-tuning for speech-specific tasks.
After resolving training challenges through precision adjustments (e.g., tf32), SALT exhibits stable and effective performance across both TTS and ASR domains.
- Downloads last month
- 2
Model tree for Vikhrmodels/llama_asr_tts_35000
Base model
TinyLlama/TinyLlama_v1.1