SouraTTS v1

A lightweight, expressive, CPU-friendly Text-to-Speech engine built on top of Pocket-TTS by Kyutai, with emotional expressiveness powered by an EmoShift-inspired activation steering layer.

6 emotions. 8 built-in voices. 27KB of trained weights. Runs on CPU.

Demo

from emotts import EmoTTS

tts = EmoTTS(
    weights="SouraTTS.pt",
    meta="SouraTTS.json"
)

tts.synthesize(
    text      = "I just got the job! I cannot believe it!",
    voice     = "alba",
    emotion   = "happy",
    intensity = 1.0,
    output    = "out.wav"
)

Installation

pip install pocket-tts scipy torch

Then download the three files from this repo:

  • SouraTTS.pt
  • SouraTTS.json
  • emotts.py

Place all three in the same directory and run the demo above.

Supported Emotions

Emotion Recommended Intensity
neutral 0.0
happy 0.8 – 1.0
sad 0.8 – 1.0
angry 0.8 – 1.0
fear 0.8 – 1.0
disgust 0.8 – 1.0

Intensities above 1.2 may cause generation instability on some voice and emotion combinations.

Built-in Voices

alba marius cosette jean fantine eponine azelma javert

Alba is the recommended default β€” highest quality and most stable across all emotions.

How It Works

SouraTTS combines two ideas:

Pocket-TTS is a 100M parameter streaming TTS model by Kyutai, designed for CPU inference with fast first-audio latency and built-in voice cloning support.

EmoShift is an activation steering technique inspired by the EmoShift paper. Instead of fine-tuning the entire model, we learn a small steering vector per emotion (shape 1024,) and inject it into the output of transformer layer 5 during inference. The entire emotion control system is 27KB β€” less than 1/30th of the base model size.

Architecture

Text input ──→ [ Pocket-TTS frozen (100M params) ] ──→ Audio
                         ↑
               Layer 5 output + (intensity Γ— steering_vector[emotion])
                         ↑
               [ EmoShift Layer (27KB, 6 Γ— 1024 params) ]

Training

The steering vectors were trained on CREMA-D, a dataset of 7,442 emotional speech clips from 91 actors. We used a gender-balanced subset (100 male + 100 female samples per emotion) to ensure consistent quality across voice types.

Training objective:

  • Direction loss β€” steer hidden states consistently toward each emotion direction
  • Magnitude loss β€” prevent vectors from collapsing to zero
  • Variance loss β€” encourage emotion-specific activation patterns
  • Orthogonality loss β€” keep emotion vectors pointing in distinct directions

Pocket-TTS weights were fully frozen throughout. Only the 6 steering vectors were trained, using Adam optimizer for 5 epochs on a Kaggle T4 GPU.

Inference Speed

Since Pocket-TTS is optimized for CPU inference, SouraTTS inherits those characteristics β€” fast first-audio latency with no GPU required. The EmoShift layer adds negligible overhead (6 Γ— 1024 float32 additions per forward pass).

Known Limitations

  • The final word of longer sentences may occasionally be slightly truncated β€” a known characteristic of autoregressive TTS generation
  • Intensities above 1.2 may cause audio instability on some voice/emotion combinations
  • Voice cloning (custom voice upload) requires accepting the Kyutai Pocket-TTS terms β€” planned for SouraTTS v2
  • Non-alba voices show slightly reduced emotion stability, particularly on female voices at high intensities β€” planned improvement for v2 with expanded training data

Roadmap

  • v2 β€” Voice cloning support, expanded training data for improved female voice quality, additional emotions
  • v3 β€” Real-time streaming API, Gradio demo Space

Credits

  • Pocket-TTS by Kyutai β€” base TTS model, MIT licensed
  • EmoShift β€” activation steering technique for emotional expressiveness
  • CREMA-D β€” training dataset, Open Database License
  • Built by @Sourajit123

License

MIT β€” same as Pocket-TTS.

Please use responsibly. Do not use this model to clone voices without explicit consent from the speaker, or to generate content that misrepresents real individuals.

Citation

If you use SouraTTS in your work, please cite:

@misc{souratts2026,
  author    = {Sourajit123},
  title     = {SouraTTS v1: Expressive CPU TTS with EmoShift Activation Steering},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/Sourajit123/SouraTTS}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support