MOSS-TTS-Local-Transformer GGUF

First known GGUF quantization of MOSS-TTS-Local-Transformer (1.7B parameters).

The OpenMOSS team has only published GGUFs for their 8B Delay model. This repo provides quantized GGUFs for the lightweight 1.7B Local variant, suitable for edge/on-device TTS deployment.

Available Quantizations

File Quant Size BPW Notes
moss-tts-local-Q2_K.gguf Q2_K 1.1 GB 3.31 Smallest, aggressive
moss-tts-local-Q3_K_M.gguf Q3_K_M 1.4 GB 4.10 Recommended balance
moss-tts-local-Q4_K_M.gguf Q4_K_M 1.9 GB 5.06 Same tier as OpenMOSS 8B GGUF

Architecture

MOSS-TTS-Local is a two-component TTS system:

Text β†’ [Qwen3 backbone (1.7B, 28 layers)] β†’ audio tokens β†’ [ONNX decoder] β†’ 24kHz PCM
                    ↕
         [Local transformer (4 layers)]
                    ↕
         [Bridge MLPs (33 codebooks)]
  • Backbone: Qwen3 architecture (hidden_size=2048, 28 layers, 16 attention heads, GQA 8 KV heads)
  • Audio components: 32 codebook embeddings, 33 LM heads, 33 layer norms, 4-layer local transformer, 33+1 bridge MLPs
  • Total tensors: 555 (after skipping duplicate text embedding)
  • Audio decoder: MOSS-Audio-Tokenizer-ONNX (encoder 1.5MB + decoder 14MB)

How This Was Converted

The standard convert_hf_to_gguf.py (both upstream llama.cpp and the OpenMOSS fork) fails on the Local variant because 213 audio-specific tensors lack GGUF name mappings. We added:

  1. 19 new MODEL_TENSOR entries in gguf-py/gguf/constants.py for local transformer layers, bridge MLPs, and audio layer norms
  2. ~70 lines of tensor mapping in convert_hf_to_gguf.py handling:
    • Embedding offset: model.embedding_list.0 = text (skip), 1-32 = audio codebooks 0-31
    • Local transformer with layer indexing
    • Indexed bridge MLPs (33 per projection type)
  3. Quantizer must be built from the OpenMOSS fork (upstream doesn't know moss-tts-delay architecture)

Full conversion guide: see the accompanying PR on OpenMOSS/llama.cpp.

Inference

Current Status

The GGUF files contain all 555 tensors correctly. However, the llama-moss-tts C++ binary currently only supports the 8B Delay model's tensor layout (375 tensors). The Local variant's extra tensors need C++ loading + inference code in moss-tts-delay.cpp.

PyTorch inference works β€” audio quality verified with the original safetensors producing clean English speech at 24kHz.

Audio Quality Verification

Tested with PyTorch bf16 on RTX 3090:

  • "Hello, I am LUI, your personal AI assistant. How can I help you today?" β†’ 5.8s clean audio
  • "Your stress has been elevated for the past hour..." β†’ 5.6s clean audio
  • "Playing rain sounds. Notifications muted..." β†’ 7.0s clean audio

Future: On-Device Mobile TTS

Target: Android phone with 8GB RAM using hot-swap loading:

  1. Unload chat LLM (Qwen 2.5 1.5B, 940MB)
  2. Load MOSS-TTS-Local Q3_K_M (1.4GB) via llama.cpp
  3. Generate audio tokens
  4. Decode with ONNX audio decoder (15.5MB)
  5. Play audio
  6. Unload TTS, reload chat LLM

Reproduction

# Prerequisites
pip install torch transformers==5.0.0 numpy sentencepiece gguf
git clone https://github.com/OpenMOSS/llama.cpp.git moss-fork

# Apply tensor mapping patches (see PR)
# Then:

# Download model
python3 -c "from huggingface_hub import snapshot_download; snapshot_download('OpenMOSS-Team/MOSS-TTS-Local-Transformer', local_dir='moss-tts-local')"

# Convert
python3 moss-fork/convert_hf_to_gguf.py moss-tts-local/ --outfile moss-tts-local-bf16.gguf --outtype bf16

# Quantize (must use fork's quantizer)
cd moss-fork && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build . --target llama-quantize -j$(nproc)
./bin/llama-quantize ../../moss-tts-local-bf16.gguf ../../moss-tts-local-Q3_K_M.gguf Q3_K_M

Credits

Downloads last month
211
GGUF
Model size
3B params
Architecture
moss-tts-delay
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for John9007/MOSS-TTS-Local-GGUF

Quantized
(4)
this model