MMAudio NSFW - FP16 Optimized

This repository contains an FP16 safetensors version of the fine-tuned MMAudio model from cloud19/NSFW_MMaudio, optimized for improved memory efficiency and faster loading times.

Base Model: cloud19/NSFW_MMaudio
Original Project: hkchengrex/MMAudio

Model Details

  • Base Architecture: large_44k (from the original MMAudio)
  • Fine-tuning: Fine-tuned on NSFW content (see base model for details)
  • Optimization: Converted from FP32 PyTorch checkpoint to FP16 safetensors
  • Capabilities: Video-to-Audio, Image-to-Audio, Text-to-Audio
  • Format: Safetensors (.safetensors)
  • Precision: 16-bit floating point

Improvements Over Base Model

โœ… ~50% smaller file size (FP32 โ†’ FP16 conversion)
โœ… Faster loading with safetensors format
โœ… Lower GPU memory usage during inference
โœ… Same quality output (minimal precision loss with FP16)
โœ… Better compatibility with modern ML frameworks

How to Use

This model can be used as a drop-in replacement for the original model. Load the safetensors file instead of the original PyTorch checkpoint:

from safetensors.torch import load_file

# Load the FP16 model weights
model_weights = load_file("model_fp16.safetensors")

# Load into your MMAudio model architecture
# (follow the same usage pattern as the base model)

System Requirements:

  • GPU: 8-12 GB VRAM (reduced from 12-16 GB due to FP16 optimization)
  • Python 3.10+
  • PyTorch with CUDA support

Installation

For usage instructions, please refer to the base model repository and simply replace the model loading with the FP16 safetensors version.

Technical Details

  • Original Format: FP32 PyTorch (.pth) - ~2.5GB
  • Optimized Format: FP16 Safetensors (.safetensors) - ~1.25GB
  • Conversion Method: Direct FP32 โ†’ FP16 tensor conversion
  • Quality Impact: Negligible quality loss in practice

Limitations

  • Same limitations as the base model apply
  • Content Warning: Due to the NSFW nature of the fine-tuning dataset, the model may generate explicit or mature audio content. User discretion is advised.
  • FP16 precision may introduce minimal numerical differences compared to FP32

Credits & Citation

Base Model: cloud19/NSFW_MMaudio
Original MMAudio: hkchengrex/MMAudio
Optimization: FP16 conversion for improved efficiency

All credit for the original architecture, fine-tuning, and model development goes to the respective authors. This repository only provides format optimization.

@inproceedings{cheng2025taming,
  title={{MMAudio}: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
  author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
  booktitle={CVPR},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for king5699/NSFW_MMaudio

Finetuned
(3)
this model