Text-to-Speech
vllm
mistral-common

Deterministic SLERP (ZeroVoice) for Voxtral-mini-4b - 3.2 Billion voices?

#19
by mushroomfleet - opened

Prototype is fully working, source: https://github.com/MushroomFleet/ZeroVoice-Voxtral-mini-4b

ZeroBytes = " The Universe springs forth - from Zero Bytes ! "

How It Works

ZeroVoice applies the ZeroBytes System position-is-seed procedural generation paradigm to text-to-speech voice synthesis:

  1. Z-axis selects the voice family: z<100 = English, z=100-199 = European, z>=200 = Asian/Arabic
  2. Position hash (xxHash64 of packed coordinates) selects a cross-family voice pair -- the primary voice (A) and a tint voice (B) are always from different language families
  3. Coherent noise on the X/Y plane derives a smooth SLERP blend weight (capped at 0.20) so nearby coordinates sound similar
  4. Row-wise SLERP blends the two voice embeddings on the 3072-dimensional hypersphere with magnitude preservation and norm calibration

The result is a voice that sounds like the primary preset with subtle characteristics from the secondary, producing natural-sounding variation without artifacts.

Key Numbers

Metric Value
Base preset voices 20 (9 languages, 10 male / 10 female)
Cross-family voice pairs 156 (A and B always from different families)
Perceptually distinct voices per seed ~3,276
Total addressable voices (all seeds) ~3.28 billion
Coordinate addresses (practical range) 301 million+
Additional voice data stored 0 bytes
Audio output 24 kHz WAV

Full statistics and voice inventory: ZeroVoice-stats.md

ZeroBytes Law Compliance

Law Status
O(1) Access Any voice computed directly from (x,y,z) -- no iteration
Parallelism Each coordinate depends only on its own values, never neighbors
Coherence Adjacent coordinates produce similar voices via multi-octave noise
Hierarchy Z selects family, X/Y explore variation within
Determinism Same inputs produce identical output on any machine

I already built this system for KokoroTTS. This allows a 150mb model grant NPC in videogames unique and consistent voices using the world spawn position X,Y,Z as the Voice ID. When divided into Male and Female this offered 1 Billion voices for each gender. This was acheived by applying the determinitism scope to SLERP calculations of the numpy voice data. I sought to see if we can apply this to Voxtral that stored the voices differently as a protoype proof of concept.

This will never beat "fine tuning" or "cloning", but offers consistent voices at large scale, easily integrated into procedurally deterministic applications.


Thanks to the Team for the weights release !

34BF861A-F156-4055-B180-596C7CEB7B7A

3 Billion voices is the technical total, but many will be strange blends, still it offers millions of practically unique voices which are just as easy to reference so it's perfect for gamedev and other development applications.

mushroomfleet changed discussion title from Deterministic SLERP (ZeroVoice) for Voxtral-mini-4b to Deterministic SLERP (ZeroVoice) for Voxtral-mini-4b - 3.2 Billion voices?
mushroomfleet changed discussion status to closed
mushroomfleet changed discussion status to open

Sign up or log in to comment