SALT - Speech and Language Transformer

SALT is a multimodal model that extends pre-trained large language models (LLMs) by incorporating new audio tokens to handle both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. Our approach bridges the gap between text and audio modalities by expanding the model’s vocabulary with audio representations using SpeechTokenizer.

Unlike existing models that require full retraining or rely on adapters, SALT leverages pre-trained LLM knowledge while fine-tuning for speech-specific tasks.

After resolving training challenges through precision adjustments (e.g., tf32), SALT exhibits stable and effective performance across both TTS and ASR domains.

Downloads last month: 2

Safetensors

Model size

1B params

Tensor type

F32

Model tree for Vikhrmodels/llama_asr_tts_35000

Base model

TinyLlama/TinyLlama_v1.1

Finetuned

(53)

this model

Vikhrmodels
/

llama_asr_tts_35000

SALT - Speech and Language Transformer

Model tree for Vikhrmodels/llama_asr_tts_35000

Dataset used to train Vikhrmodels/llama_asr_tts_35000