QORA-TTS 0.6B - Pure Rust Text-to-Speech
Pure Rust TTS engine with 9 built-in speakers. No Python, no CUDA, no external ML frameworks. Single executable + model weights = portable text-to-speech that runs on any machine.
Smart system awareness β automatically detects your hardware (RAM, CPU threads) and adjusts generation limits so TTS runs well even on constrained systems. 9 built-in voices β works out of the box with no reference audio needed. 10 languages supported.
Based on Qwen3-TTS-12Hz-0.6B-CustomVoice (Apache 2.0).
License
This project is licensed under Apache 2.0. The base model Qwen3-TTS-12Hz-0.6B-CustomVoice is released by the Qwen team under Apache 2.0.
What It Does
QORA-TTS 0.6B converts text to natural-sounding speech. It can:
- 9 built-in voices β male and female speakers embedded in the model, no reference audio needed
- 10 languages β English, Chinese, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian
- 24 kHz output β high-quality mono WAV audio
- Fast generation β lighter model for quicker speech synthesis
- Controllable generation β adjust length, temperature, and sampling parameters
Quick Start
- Download all files into the same folder
- Run:
# Use built-in speaker
qora-tts.exe --speaker ryan --language english --text "Hello, how are you?"
# Different speaker
qora-tts.exe --speaker serena --language chinese --text "δ½ ε₯½δΈη"
# Japanese speaker
qora-tts.exe --speaker ono_anna --language japanese --text "γγγ«γ‘γ―"
# Control length and output
qora-tts.exe --speaker aiden --language english --text "Good morning!" --max-codes 200 --output greeting.wav
# Reproducible output with seed
qora-tts.exe --speaker ryan --text "Same every time" --seed 42
Files
qora-tts.exe 4.1 MB Inference engine
model.qora-tts 971 MB Q4 weights (talker + predictor + decoder)
config.json 4.8 KB Model configuration
tokenizer.json 11 MB Tokenizer (151,936 vocab)
vocab.json 2.7 MB Vocabulary
merges.txt 1.6 MB BPE merges
tokenizer_config.json 7.2 KB Tokenizer config
No safetensors needed. Everything loads from model.qora-tts. The exe auto-finds all files in its own directory.
Architecture
| Component | Details |
|---|---|
| Parameters | 0.6B total |
| Talker | 28 layers, hidden=1024, 16/8 GQA heads, SwiGLU FFN 3072 |
| Code Predictor | 5 layers, hidden=1024, 16 code groups |
| Speech Decoder | 8-layer transformer + Vocos vocoder, 16 VQ codebooks |
| Quantization | Q4 (4-bit symmetric, group_size=32) with LUT-optimized dequantization |
| Sample Rate | 24 kHz mono WAV |
| Code Rate | 12.5 Hz (1 code = 80ms of audio) |
How It Works
- Text encoding β tokenize input text with 151K BPE vocabulary
- Speaker selection β load built-in speaker embedding by name
- Code generation β 28-layer transformer (Talker) generates speech codes autoregressively
- Code expansion β 5-layer Code Predictor expands code0 into 16 codebooks (codes 0-15)
- Audio synthesis β VQ decoder + Vocos vocoder converts codes to 24kHz waveform
Smart System Awareness
QORA-TTS detects your system at startup and automatically adjusts generation limits:
QORA-TTS β Pure Rust Text-to-Speech Engine
System: 16384 MB RAM (9856 MB free), 12 threads
| Available RAM | Max Codes | Default | Audio Length |
|---|---|---|---|
| < 4 GB | 200 | 100 | ~8s |
| 4-8 GB | 500 | 300 | ~20s |
| 8-12 GB | 1000 | 500 | ~40s |
| >= 12 GB | 2000 | 500 | ~80s |
Hard caps apply even to explicit user values β if you pass --max-codes 2000 on a system with 6 GB free RAM, it gets clamped to 500 automatically. This prevents the model from running for too long on weak systems.
CLI Arguments
| Flag | Default | Description |
|---|---|---|
--text <text> |
"Hello, how are you today?" | Text to synthesize |
--speaker <name> |
ryan | Built-in speaker name |
--language <name> |
english | Target language |
--output <path> |
output.wav | Output WAV path |
--max-codes <n> |
500 | Max code timesteps (~n/12.5 seconds) |
--temperature <f> |
0.8 | Sampling temperature |
--top-k <n> |
50 | Top-K sampling |
--seed <n> |
random | Random seed for reproducibility |
Built-in Speakers
| Speaker | Language | Description |
|---|---|---|
| ryan | English | Dynamic male voice |
| aiden | English | Sunny American male |
| serena | Chinese | Warm, gentle female |
| vivian | Chinese | Bright young female |
| uncle_fu | Chinese | Seasoned male |
| dylan | Beijing dialect | Youthful male |
| eric | Sichuan dialect | Lively male |
| ono_anna | Japanese | Playful female |
| sohee | Korean | Warm female |
Supported Languages
| Language | Flag Value |
|---|---|
| English | english |
| Chinese | chinese |
| German | german |
| Italian | italian |
| Portuguese | portuguese |
| Spanish | spanish |
| Japanese | japanese |
| Korean | korean |
| French | french |
| Russian | russian |
Performance
Tested on i5-11500 (6C/12T), 16GB RAM, CPU-only:
| Phase | Time | Notes |
|---|---|---|
| Model Load | ~0.6s | From binary, 971 MB |
| Prefill | ~2-5s | Text + speaker embedding processing |
| Code Generation | ~1.5s/code | Autoregressive, 12.5 codes/sec of audio |
| Code Expansion | ~0.1s | 5-layer predictor, 16 codebooks |
| Audio Decode | ~0.5s/frame | VQ + Vocos vocoder |
| RAM Usage | ~970 MB | Q4 model in memory |
Example: "Hello, how are you?" (~3 seconds of audio) takes ~10-15 seconds total.
Comparison with 1.7B
| QORA-TTS 0.6B | QORA-TTS 1.7B | |
|---|---|---|
| Parameters | 0.6B | 1.7B |
| Model size | 971 MB | 1559 MB |
| Voice cloning | No | Yes (ECAPA-TDNN) |
| Built-in speakers | 9 (embedded) | 25 (via voice files) |
| Code generation | ~1.5s/code | ~2.5s/code |
| Quality | Good | Higher |
| Best for | Speed + simplicity | Quality + cloning |
Built With
- Language: Pure Rust (2024 edition)
- Dependencies:
half(f16),rayon(parallelism),tokenizers(HuggingFace tokenizer),memmap2(mmap for converter),serde_json(config parsing) - No ML framework for inference β all matrix ops are hand-written Rust
- Burn framework used only as a build dependency (for binary format types)
Model Binary Format (.qora-tts)
Custom binary format for fast loading:
Header: "QTTS" magic + version + format byte
Talker: 28 transformer layers (Q4 quantized)
Predictor: 5 transformer layers + code embeddings
Decoder: VQ codebooks + 8 transformer layers + Vocos vocoder
- Downloads last month
- 19
Model tree for qoranet/QORA-TTS-12Hz-0.6B
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice