MOSS-TTS-Local-Transformer GGUF

First known GGUF quantization of MOSS-TTS-Local-Transformer (1.7B parameters).

The OpenMOSS team has only published GGUFs for their 8B Delay model. This repo provides quantized GGUFs for the lightweight 1.7B Local variant, suitable for edge/on-device TTS deployment.

Available Quantizations

File	Quant	Size	BPW	Notes
`moss-tts-local-Q2_K.gguf`	Q2_K	1.1 GB	3.31	Smallest, aggressive
`moss-tts-local-Q3_K_M.gguf`	Q3_K_M	1.4 GB	4.10	Recommended balance
`moss-tts-local-Q4_K_M.gguf`	Q4_K_M	1.9 GB	5.06	Same tier as OpenMOSS 8B GGUF

Architecture

MOSS-TTS-Local is a two-component TTS system:

Text → [Qwen3 backbone (1.7B, 28 layers)] → audio tokens → [ONNX decoder] → 24kHz PCM
                    ↕
         [Local transformer (4 layers)]
                    ↕
         [Bridge MLPs (33 codebooks)]

Backbone: Qwen3 architecture (hidden_size=2048, 28 layers, 16 attention heads, GQA 8 KV heads)
Audio components: 32 codebook embeddings, 33 LM heads, 33 layer norms, 4-layer local transformer, 33+1 bridge MLPs
Total tensors: 555 (after skipping duplicate text embedding)
Audio decoder: MOSS-Audio-Tokenizer-ONNX (encoder 1.5MB + decoder 14MB)

How This Was Converted

The standard convert_hf_to_gguf.py (both upstream llama.cpp and the OpenMOSS fork) fails on the Local variant because 213 audio-specific tensors lack GGUF name mappings. We added:

19 new MODEL_TENSOR entries in gguf-py/gguf/constants.py for local transformer layers, bridge MLPs, and audio layer norms
~70 lines of tensor mapping in convert_hf_to_gguf.py handling:
- Embedding offset: model.embedding_list.0 = text (skip), 1-32 = audio codebooks 0-31
- Local transformer with layer indexing
- Indexed bridge MLPs (33 per projection type)
Quantizer must be built from the OpenMOSS fork (upstream doesn't know moss-tts-delay architecture)

Full conversion guide: see the accompanying PR on OpenMOSS/llama.cpp.

Inference

Current Status

The GGUF files contain all 555 tensors correctly. However, the llama-moss-tts C++ binary currently only supports the 8B Delay model's tensor layout (375 tensors). The Local variant's extra tensors need C++ loading + inference code in moss-tts-delay.cpp.

PyTorch inference works — audio quality verified with the original safetensors producing clean English speech at 24kHz.

Audio Quality Verification

Tested with PyTorch bf16 on RTX 3090:

"Hello, I am LUI, your personal AI assistant. How can I help you today?" → 5.8s clean audio
"Your stress has been elevated for the past hour..." → 5.6s clean audio
"Playing rain sounds. Notifications muted..." → 7.0s clean audio

Future: On-Device Mobile TTS

Target: Android phone with 8GB RAM using hot-swap loading:

Unload chat LLM (Qwen 2.5 1.5B, 940MB)
Load MOSS-TTS-Local Q3_K_M (1.4GB) via llama.cpp
Generate audio tokens
Decode with ONNX audio decoder (15.5MB)
Play audio
Unload TTS, reload chat LLM

Reproduction

# Prerequisites
pip install torch transformers==5.0.0 numpy sentencepiece gguf
git clone https://github.com/OpenMOSS/llama.cpp.git moss-fork

# Apply tensor mapping patches (see PR)
# Then:

# Download model
python3 -c "from huggingface_hub import snapshot_download; snapshot_download('OpenMOSS-Team/MOSS-TTS-Local-Transformer', local_dir='moss-tts-local')"

# Convert
python3 moss-fork/convert_hf_to_gguf.py moss-tts-local/ --outfile moss-tts-local-bf16.gguf --outtype bf16

# Quantize (must use fork's quantizer)
cd moss-fork && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build . --target llama-quantize -j$(nproc)
./bin/llama-quantize ../../moss-tts-local-bf16.gguf ../../moss-tts-local-Q3_K_M.gguf Q3_K_M

Credits

Model: OpenMOSS-Team/MOSS-TTS-Local-Transformer (Apache 2.0)
Conversion infrastructure: OpenMOSS/llama.cpp fork
GGUF conversion + quantization: obirije
Context: Built for LUI, an open-source Android AI assistant with on-device TTS

Downloads last month: 211

GGUF

Model size

3B params

Architecture

moss-tts-delay

Hardware compatibility

2-bit

3-bit

4-bit

Model tree for John9007/MOSS-TTS-Local-GGUF

Base model

OpenMOSS-Team/MOSS-TTS-Local-Transformer

Quantized

(4)

this model