QORA-TTS 0.6B - Pure Rust Text-to-Speech

Pure Rust TTS engine with 9 built-in speakers. No Python, no CUDA, no external ML frameworks. Single executable + model weights = portable text-to-speech that runs on any machine.

Smart system awareness β€” automatically detects your hardware (RAM, CPU threads) and adjusts generation limits so TTS runs well even on constrained systems. 9 built-in voices β€” works out of the box with no reference audio needed. 10 languages supported.

Based on Qwen3-TTS-12Hz-0.6B-CustomVoice (Apache 2.0).

License

This project is licensed under Apache 2.0. The base model Qwen3-TTS-12Hz-0.6B-CustomVoice is released by the Qwen team under Apache 2.0.

What It Does

QORA-TTS 0.6B converts text to natural-sounding speech. It can:

  • 9 built-in voices β€” male and female speakers embedded in the model, no reference audio needed
  • 10 languages β€” English, Chinese, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian
  • 24 kHz output β€” high-quality mono WAV audio
  • Fast generation β€” lighter model for quicker speech synthesis
  • Controllable generation β€” adjust length, temperature, and sampling parameters

Quick Start

  1. Download all files into the same folder
  2. Run:
# Use built-in speaker
qora-tts.exe --speaker ryan --language english --text "Hello, how are you?"

# Different speaker
qora-tts.exe --speaker serena --language chinese --text "δ½ ε₯½δΈ–η•Œ"

# Japanese speaker
qora-tts.exe --speaker ono_anna --language japanese --text "こんにけは"

# Control length and output
qora-tts.exe --speaker aiden --language english --text "Good morning!" --max-codes 200 --output greeting.wav

# Reproducible output with seed
qora-tts.exe --speaker ryan --text "Same every time" --seed 42

Files

qora-tts.exe          4.1 MB   Inference engine
model.qora-tts       971 MB    Q4 weights (talker + predictor + decoder)
config.json           4.8 KB   Model configuration
tokenizer.json         11 MB   Tokenizer (151,936 vocab)
vocab.json            2.7 MB   Vocabulary
merges.txt            1.6 MB   BPE merges
tokenizer_config.json 7.2 KB   Tokenizer config

No safetensors needed. Everything loads from model.qora-tts. The exe auto-finds all files in its own directory.

Architecture

Component Details
Parameters 0.6B total
Talker 28 layers, hidden=1024, 16/8 GQA heads, SwiGLU FFN 3072
Code Predictor 5 layers, hidden=1024, 16 code groups
Speech Decoder 8-layer transformer + Vocos vocoder, 16 VQ codebooks
Quantization Q4 (4-bit symmetric, group_size=32) with LUT-optimized dequantization
Sample Rate 24 kHz mono WAV
Code Rate 12.5 Hz (1 code = 80ms of audio)

How It Works

  1. Text encoding β€” tokenize input text with 151K BPE vocabulary
  2. Speaker selection β€” load built-in speaker embedding by name
  3. Code generation β€” 28-layer transformer (Talker) generates speech codes autoregressively
  4. Code expansion β€” 5-layer Code Predictor expands code0 into 16 codebooks (codes 0-15)
  5. Audio synthesis β€” VQ decoder + Vocos vocoder converts codes to 24kHz waveform

Smart System Awareness

QORA-TTS detects your system at startup and automatically adjusts generation limits:

QORA-TTS β€” Pure Rust Text-to-Speech Engine
System: 16384 MB RAM (9856 MB free), 12 threads
Available RAM Max Codes Default Audio Length
< 4 GB 200 100 ~8s
4-8 GB 500 300 ~20s
8-12 GB 1000 500 ~40s
>= 12 GB 2000 500 ~80s

Hard caps apply even to explicit user values β€” if you pass --max-codes 2000 on a system with 6 GB free RAM, it gets clamped to 500 automatically. This prevents the model from running for too long on weak systems.

CLI Arguments

Flag Default Description
--text <text> "Hello, how are you today?" Text to synthesize
--speaker <name> ryan Built-in speaker name
--language <name> english Target language
--output <path> output.wav Output WAV path
--max-codes <n> 500 Max code timesteps (~n/12.5 seconds)
--temperature <f> 0.8 Sampling temperature
--top-k <n> 50 Top-K sampling
--seed <n> random Random seed for reproducibility

Built-in Speakers

Speaker Language Description
ryan English Dynamic male voice
aiden English Sunny American male
serena Chinese Warm, gentle female
vivian Chinese Bright young female
uncle_fu Chinese Seasoned male
dylan Beijing dialect Youthful male
eric Sichuan dialect Lively male
ono_anna Japanese Playful female
sohee Korean Warm female

Supported Languages

Language Flag Value
English english
Chinese chinese
German german
Italian italian
Portuguese portuguese
Spanish spanish
Japanese japanese
Korean korean
French french
Russian russian

Performance

Tested on i5-11500 (6C/12T), 16GB RAM, CPU-only:

Phase Time Notes
Model Load ~0.6s From binary, 971 MB
Prefill ~2-5s Text + speaker embedding processing
Code Generation ~1.5s/code Autoregressive, 12.5 codes/sec of audio
Code Expansion ~0.1s 5-layer predictor, 16 codebooks
Audio Decode ~0.5s/frame VQ + Vocos vocoder
RAM Usage ~970 MB Q4 model in memory

Example: "Hello, how are you?" (~3 seconds of audio) takes ~10-15 seconds total.

Comparison with 1.7B

QORA-TTS 0.6B QORA-TTS 1.7B
Parameters 0.6B 1.7B
Model size 971 MB 1559 MB
Voice cloning No Yes (ECAPA-TDNN)
Built-in speakers 9 (embedded) 25 (via voice files)
Code generation ~1.5s/code ~2.5s/code
Quality Good Higher
Best for Speed + simplicity Quality + cloning

Built With

  • Language: Pure Rust (2024 edition)
  • Dependencies: half (f16), rayon (parallelism), tokenizers (HuggingFace tokenizer), memmap2 (mmap for converter), serde_json (config parsing)
  • No ML framework for inference β€” all matrix ops are hand-written Rust
  • Burn framework used only as a build dependency (for binary format types)

Model Binary Format (.qora-tts)

Custom binary format for fast loading:

Header:  "QTTS" magic + version + format byte
Talker:  28 transformer layers (Q4 quantized)
Predictor: 5 transformer layers + code embeddings
Decoder: VQ codebooks + 8 transformer layers + Vocos vocoder
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for qoranet/QORA-TTS-12Hz-0.6B

Finetuned
(4)
this model