| --- |
| license: mit |
| library_name: transformers |
| pipeline_tag: audio-to-audio |
| --- |
| |
| # WavCube |
|
|
| WavCube is a 128-dim, 50Hz continuous representation that unifies speech understanding, reconstruction, and generation within a single space. It is presented in the paper [WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling](https://huggingface.co/papers/2605.06407). |
|
|
| - **Code:** [GitHub Repository](https://github.com/yanghaha0908/WavCube) |
| - **Paper:** [arXiv:2605.06407](https://arxiv.org/abs/2605.06407) |
|
|
| ## Usage |
|
|
| Before using the model, ensure you have installed the requirements as described in the [official repository](https://github.com/yanghaha0908/WavCube). |
|
|
| ### Extract Representation from Speech |
| You can get continuous representations from raw wav using the following command: |
|
|
| ```bash |
| python wav_to_feature.py \ |
| --audio 19_198_000000_000002.wav \ |
| --config configs/WavCube-stage2.yaml \ |
| --ckpt WavCube/checkpoints/vocos_checkpoint_epoch=177_step=195000_val_loss=3.3080.ckpt \ |
| --output 19_198_000000_000002.pt |
| ``` |
|
|
| ### Reconstruct Speech from Representation |
| You can reconstruct waveform from representations using the following command: |
|
|
| ```bash |
| python feature_to_wav.py \ |
| --feature 19_198_000000_000002.pt \ |
| --config configs/WavCube-stage2.yaml \ |
| --ckpt WavCube/checkpoints/vocos_checkpoint_epoch=177_step=195000_val_loss=3.3080.ckpt |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{yang2025wavcube, |
| title={WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling}, |
| author={Haohan Yang and others}, |
| year={2025}, |
| eprint={2605.06407}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.SD}, |
| url={https://arxiv.org/abs/2605.06407}, |
| } |
| ``` |