MMAudio NSFW - FP16 Optimized
This repository contains an FP16 safetensors version of the fine-tuned MMAudio model from cloud19/NSFW_MMaudio, optimized for improved memory efficiency and faster loading times.
Base Model: cloud19/NSFW_MMaudio
Original Project: hkchengrex/MMAudio
Model Details
- Base Architecture:
large_44k(from the original MMAudio) - Fine-tuning: Fine-tuned on NSFW content (see base model for details)
- Optimization: Converted from FP32 PyTorch checkpoint to FP16 safetensors
- Capabilities: Video-to-Audio, Image-to-Audio, Text-to-Audio
- Format: Safetensors (
.safetensors) - Precision: 16-bit floating point
Improvements Over Base Model
โ
~50% smaller file size (FP32 โ FP16 conversion)
โ
Faster loading with safetensors format
โ
Lower GPU memory usage during inference
โ
Same quality output (minimal precision loss with FP16)
โ
Better compatibility with modern ML frameworks
How to Use
This model can be used as a drop-in replacement for the original model. Load the safetensors file instead of the original PyTorch checkpoint:
from safetensors.torch import load_file
# Load the FP16 model weights
model_weights = load_file("model_fp16.safetensors")
# Load into your MMAudio model architecture
# (follow the same usage pattern as the base model)
System Requirements:
- GPU: 8-12 GB VRAM (reduced from 12-16 GB due to FP16 optimization)
- Python 3.10+
- PyTorch with CUDA support
Installation
For usage instructions, please refer to the base model repository and simply replace the model loading with the FP16 safetensors version.
Technical Details
- Original Format: FP32 PyTorch (.pth) - ~2.5GB
- Optimized Format: FP16 Safetensors (.safetensors) - ~1.25GB
- Conversion Method: Direct FP32 โ FP16 tensor conversion
- Quality Impact: Negligible quality loss in practice
Limitations
- Same limitations as the base model apply
- Content Warning: Due to the NSFW nature of the fine-tuning dataset, the model may generate explicit or mature audio content. User discretion is advised.
- FP16 precision may introduce minimal numerical differences compared to FP32
Credits & Citation
Base Model: cloud19/NSFW_MMaudio
Original MMAudio: hkchengrex/MMAudio
Optimization: FP16 conversion for improved efficiency
All credit for the original architecture, fine-tuning, and model development goes to the respective authors. This repository only provides format optimization.
@inproceedings{cheng2025taming,
title={{MMAudio}: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
booktitle={CVPR},
year={2025}
}