Deterministic SLERP (ZeroVoice) for Voxtral-mini-4b - 3.2 Billion voices?
Prototype is fully working, source: https://github.com/MushroomFleet/ZeroVoice-Voxtral-mini-4b
ZeroBytes = " The Universe springs forth - from Zero Bytes ! "
How It Works
ZeroVoice applies the ZeroBytes System position-is-seed procedural generation paradigm to text-to-speech voice synthesis:
- Z-axis selects the voice family: z<100 = English, z=100-199 = European, z>=200 = Asian/Arabic
- Position hash (xxHash64 of packed coordinates) selects a cross-family voice pair -- the primary voice (A) and a tint voice (B) are always from different language families
- Coherent noise on the X/Y plane derives a smooth SLERP blend weight (capped at 0.20) so nearby coordinates sound similar
- Row-wise SLERP blends the two voice embeddings on the 3072-dimensional hypersphere with magnitude preservation and norm calibration
The result is a voice that sounds like the primary preset with subtle characteristics from the secondary, producing natural-sounding variation without artifacts.
Key Numbers
| Metric | Value |
|---|---|
| Base preset voices | 20 (9 languages, 10 male / 10 female) |
| Cross-family voice pairs | 156 (A and B always from different families) |
| Perceptually distinct voices per seed | ~3,276 |
| Total addressable voices (all seeds) | ~3.28 billion |
| Coordinate addresses (practical range) | 301 million+ |
| Additional voice data stored | 0 bytes |
| Audio output | 24 kHz WAV |
Full statistics and voice inventory: ZeroVoice-stats.md
ZeroBytes Law Compliance
| Law | Status |
|---|---|
| O(1) Access | Any voice computed directly from (x,y,z) -- no iteration |
| Parallelism | Each coordinate depends only on its own values, never neighbors |
| Coherence | Adjacent coordinates produce similar voices via multi-octave noise |
| Hierarchy | Z selects family, X/Y explore variation within |
| Determinism | Same inputs produce identical output on any machine |
I already built this system for KokoroTTS. This allows a 150mb model grant NPC in videogames unique and consistent voices using the world spawn position X,Y,Z as the Voice ID. When divided into Male and Female this offered 1 Billion voices for each gender. This was acheived by applying the determinitism scope to SLERP calculations of the numpy voice data. I sought to see if we can apply this to Voxtral that stored the voices differently as a protoype proof of concept.
This will never beat "fine tuning" or "cloning", but offers consistent voices at large scale, easily integrated into procedurally deterministic applications.
Thanks to the Team for the weights release !
3 Billion voices is the technical total, but many will be strange blends, still it offers millions of practically unique voices which are just as easy to reference so it's perfect for gamedev and other development applications.
