--- license: mit base_model: microsoft/VibeVoice-ASR tags: - automatic-speech-recognition - vibevoice - bitsandbytes - 8-bit - int8 - quantized - diarization - multilingual pipeline_tag: automatic-speech-recognition library_name: transformers --- # VibeVoice-ASR — Selective INT8 Quantization Selectively quantized version of [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) for low-VRAM deployment. **Only the Qwen2.5-7B LLM backbone is quantized to INT8.** Audio tokenizers, connectors, and lm_head remain in full BF16 precision — preserving diarization accuracy and transcription quality. > ⚠️ This model uses the **standalone** `vibevoice` package (`pip install git+https://github.com/microsoft/VibeVoice.git`), NOT the HF-native `transformers >= 5.3.0` variant. It requires `transformers == 4.57.3`. ## Key details | | | |---|---| | Base model | [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR) | | Quantization | INT8 (bitsandbytes `Linear8bitLt`) | | Modules quantized | `model.language_model.model.layers.*` (196 layers) | | Modules in BF16 | `acoustic_tokenizer`, `semantic_tokenizer`, `acoustic_connector`, `semantic_connector`, `lm_head` (161 layers) | | Model size | ~9.2 GB (down from 17.3 GB) | | Peak VRAM | ~12.5 GB (including inference activations) | | Transformers | == 4.57.3 | | bitsandbytes | >= 0.48.1 | ## Why selective quantization? Naive INT8 quantization of the entire model produces `[Unintelligible Speech]` — the model detects speech boundaries but cannot decode content. The acoustic and semantic tokenizer encoders process raw audio signals where quantization errors propagate catastrophically. The LLM backbone (Qwen2.5-7B) handles INT8 quantization gracefully. **Critical discovery:** The standalone `vibevoice` package uses different module names than the HF-native variant. The correct skip list for the standalone model is: | Standalone (this model) | HF-native (won't work here) | |---|---| | `acoustic_tokenizer` | `acoustic_tokenizer_encoder` | | `semantic_tokenizer` | `semantic_tokenizer_encoder` | | `acoustic_connector` | `acoustic_projection` | | `semantic_connector` | `semantic_projection` | Using the HF-native names with the standalone package silently quantizes audio-critical modules, producing garbage output. ## Usage ```python import torch from vibevoice.modular.modeling_vibevoice_asr import VibeVoiceASRForConditionalGeneration from vibevoice.processor.vibevoice_asr_processor import VibeVoiceASRProcessor model_id = "Dubedo/VibeVoice-ASR-INT8" # Load processor (no preprocessor_config.json — default ratio=3200 is correct) processor = VibeVoiceASRProcessor.from_pretrained( model_id, language_model_pretrained_name="Qwen/Qwen2.5-7B", ) # Load quantized model model = VibeVoiceASRForConditionalGeneration.from_pretrained( model_id, device_map="auto", trust_remote_code=True, ) model.eval() # Transcribe inputs = processor( audio=["path/to/audio.wav"], sampling_rate=None, return_tensors="pt", padding=True, add_generation_prompt=True, ) inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()} with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=32768, pad_token_id=processor.pad_id, eos_token_id=processor.tokenizer.eos_token_id, do_sample=False, ) input_length = inputs["input_ids"].shape[1] generated_ids = output_ids[0, input_length:] text = processor.decode(generated_ids, skip_special_tokens=True) segments = processor.post_process_transcription(text) ``` ## Quantization method Quantized on NVIDIA L4 (22GB) using the standalone `vibevoice` package with `BitsAndBytesConfig`: ```python from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_skip_modules=[ "acoustic_tokenizer", "semantic_tokenizer", "acoustic_connector", "semantic_connector", "lm_head", ], ) model = VibeVoiceASRForConditionalGeneration.from_pretrained( "microsoft/VibeVoice-ASR", quantization_config=quantization_config, torch_dtype=torch.bfloat16, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) ``` ## Important notes - **Do NOT create a `preprocessor_config.json`** — the standalone processor's default fallback sets `speech_tok_compress_ratio=3200`, which is correct. Creating one with `ratio=320` causes a 10x mask shape mismatch and `IndexError`. - **Requires `bitsandbytes >= 0.48.1`** — v0.48.0 has a confirmed critical bug breaking INT8 quantization. - **INT8 models cannot be moved between CPU and GPU** — use delete+reload pattern for VRAM management. ## Acknowledgments Based on [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR). Built for the [Dubedo](https://dubedo.com) AI video dubbing platform.