NanoCodec for Apple Silicon

This is an MLX implementation of NVIDIA NeMo NanoCodec, a lightweight neural audio codec.

Model Description

Architecture: fully convolutional generator neural network and three discriminators. The generator comprises an encoder, followed by vector quantization, and a HiFi-GAN-based decoder.
Sample Rate: 22.05 kHz
Framework: MLX
Parameters: 105M

Installation

pip install nanocodec-mlx soundfile

Usage

from nanocodec_mlx.models.audio_codec import AudioCodecModel
import soundfile as sf
import mlx.core as mx
import numpy as np

# Load model from HuggingFace Hub
model = AudioCodecModel.from_pretrained("nineninesix/nemo-nano-codec-22khz-0.6kbps-12.5fps-MLX")

# Load audio
audio, sr = sf.read("input.wav")
audio_mlx = mx.array(audio, dtype=mx.float32)[None, None, :]
audio_len = mx.array([len(audio)], dtype=mx.int32)

# Encode and decode
tokens, tokens_len = model.encode(audio_mlx, audio_len)
reconstructed, recon_len = model.decode(tokens, tokens_len)

# Save output
output = np.array(reconstructed[0, 0, :int(recon_len[0])])
sf.write("output.wav", output, 22050)

Input

Input Type: Audio
Input Format(s): .wav files
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: 22050 Hz Mono-channel Audio

Output

Output Type: Audio
Output Format: .wav files
Output Parameters: One Dimensional (1D)
Other Properties Related to Output: 22050 Hz Mono-channel Audio

License

This code is licensed under the Apache License 2.0.

The original NVIDIA NeMo NanoCodec model weights and architecture are developed by NVIDIA and are licensed under the NVIDIA Open Model License. See NOTICE for attribution.

When using this project, you must comply with both licenses.

Citation

This is an MLX implementation of NVIDIA NeMo NanoCodec. If you use this work, please cite the original:

Downloads last month: 144

Safetensors

Model size

0.1B params

Tensor type

F32

I32

MLX

Hardware compatibility

Quantized

Paper for nineninesix/nemo-nano-codec-22khz-0.6kbps-12.5fps-MLX

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Paper • 2010.05646 • Published Oct 12, 2020