metadata
license: mit
library_name: transformers
pipeline_tag: audio-to-audio
WavCube
WavCube is a 128-dim, 50Hz continuous representation that unifies speech understanding, reconstruction, and generation within a single space. It is presented in the paper WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling.
- Code: GitHub Repository
- Paper: arXiv:2605.06407
Usage
Before using the model, ensure you have installed the requirements as described in the official repository.
Extract Representation from Speech
You can get continuous representations from raw wav using the following command:
python wav_to_feature.py \
--audio 19_198_000000_000002.wav \
--config configs/WavCube-stage2.yaml \
--ckpt WavCube/checkpoints/vocos_checkpoint_epoch=177_step=195000_val_loss=3.3080.ckpt \
--output 19_198_000000_000002.pt
Reconstruct Speech from Representation
You can reconstruct waveform from representations using the following command:
python feature_to_wav.py \
--feature 19_198_000000_000002.pt \
--config configs/WavCube-stage2.yaml \
--ckpt WavCube/checkpoints/vocos_checkpoint_epoch=177_step=195000_val_loss=3.3080.ckpt
Citation
@misc{yang2025wavcube,
title={WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling},
author={Haohan Yang and others},
year={2025},
eprint={2605.06407},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2605.06407},
}