File size: 6,515 Bytes
0e42f6f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | # WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling
<p align="center">
<img src="doc/wavcube_logo.png" alt="WavCube Logo" width="400"/>
</p>
[](https://github.com/yanghaha0908/WavCube)
[](https://arxiv.org/abs/2605.06407)
[](https://huggingface.co/yhaha/WavCube)
WavCube is a 128-dim, 50Hz continuous representation that unifies speech understanding,
reconstruction, and generation within a single space.
This is the official code for the paper [WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling](https://arxiv.org/pdf/2605.06407) [[abs](https://arxiv.org/abs/2605.06407)].
## β¨ Key Features
- **Unified Speech Representation** β A single continuous latent space that simultaneously supports speech understanding, reconstruction, and generation.
- **Semantic-Acoustic Joint Modeling** β Harmonizes high-level semantic structures with low-level acoustic textures.
- **Compact & Diffusion-Friendly** β Features a compact 128-dimensional bottleneck (8x compression from standard SSL features) enabling easier diffusion modeling.
<!-- By infusing fine-grained acoustic details into a distilled SSL semantic manifold, -->
## π οΈ Installation
We recommend creating a fresh conda environment for installation.
### Env Setup
```bash
conda create -n WavCube python=3.10 -y
conda activate WavCube
```
### Basic Requirements
```bash
git clone https://github.com/yanghaha0908/WavCube.git
cd WavCube
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu126
conda install -c conda-forge sox ffmpeg libsndfile
pip install -e ".[train]"
```
## π Quick Start
### Checkpoint Download
Pre-trained model checkpoints are available. Please use the following links to download the checkpoints:
| Representation | Dimension | Sample Rate | Frame Rate |
|----------------|-----------|-------------|------------|
| π€ [WavCube](https://huggingface.co/yhaha/WavCube/tree/main/WavCube) | 128 | 16k Hz | 50 Hz |
| π€ [WavCube-pro](https://huggingface.co/yhaha/WavCube/tree/main/WavCube-Pro) | 128 | 16k Hz | 50 Hz |
### Extract Representation from Speech
You can get continuous representations from raw wav using the following code:
```bash
python wav_to_feature.py \
--audio 19_198_000000_000002.wav \
--config configs/WavCube-stage2.yaml \
--ckpt WavCube/checkpoints/vocos_checkpoint_epoch=177_step=195000_val_loss=3.3080.ckpt \
--output 19_198_000000_000002.pt
```
### Reconstruct Speech from Representation
You can reconstruct waveform from representations using the following code:
```bash
python feature_to_wav.py \
--feature 19_198_000000_000002.pt \
--config configs/WavCube-stage2.yaml \
--ckpt WavCube/checkpoints/vocos_checkpoint_epoch=177_step=195000_val_loss=3.3080.ckpt
```
<!-- ## π‘ Tips
- For devices that do not support BF16, you can manually disable PyTorch's mixed precision manager.
- If you encounter any issues or have questions, please feel free to open an issue. -->
## π§ Training
WavCube employs a **two-stage training** pipeline, all scripts are located in `scripts/train/`.
```bash
# ----------------- WavCube -----------------
bash scripts/train/train_WavCube_stage1.sh
bash scripts/train/train_WavCube_stage2.sh
# --------------- WavCube-Pro ---------------
bash scripts/train/train_WavCube_pro_stage1.sh
bash scripts/train/train_WavCube_pro_stage2.sh
# Note: Update `stage1_ckpt_path` in config to your Stage 1 checkpoint before running.
```
## π€ Additional Resources
### Evaluation Checkpoints
To make it easier to reproduce our results, we have uploaded supplementary resources to our π€ [WavCube](https://huggingface.co/yhaha/WavCube/tree/main/ckpts). These include the `wavlm-large` weights and the necessary evaluation checkpoints for computing metrics such as WER, Speaker Similarity, and UTMOS.
```bash
# For offline testing or if you experience network issues, you can manually copy the checkpoints to your local cache:
cp -r ckpts/hub ~/.cache/torch/
cp ckpts/utmos22_strong_step7459_v1.pt ~/.cache/torch/hub/checkpoints/
cp -r ckpts/s3prl ~/.cache
```
### Data Preparation
**Small-scale data** β uses `VocosDataModule`. Prepare a filelist of audio paths for training and validation:
```bash
find $TRAIN_DATASET_DIR -name "*.wav" > filelist.train
find $VAL_DATASET_DIR -name "*.wav" > filelist.val
```
Each line is a plain audio path, for example:
```
/data/LibriSpeech/test-clean/672/122797/672-122797-0026.flac
/data/LibriSpeech/test-clean/672/122797/672-122797-0071.flac
/data/LibriSpeech/test-clean/672/122797/672-122797-0037.flac
```
**Large-scale data** β uses `VocosEmiliaDataModule`. Two files are required:
1. **Filelist** β same format as above for LibriSpeech; for LibriHeavy, each line is a JSON entry, for example:
```json
{"id": "medium/968/.../voyagesdolittle_55_lofting_64kb_38", "start": 22.32, "duration": 19.36, "channel": 0, "recording": {"sources": [{"source": "download/librilight/medium/968/.../voyagesdolittle_55_lofting_64kb.flac"}], "sampling_rate": 16000}, "type": "MonoCut"}
```
2. **Index file** (`.idx`) β a byte-offset index for fast random access, generated via:
```bash
python data/generate_idx.py
```
Example data manifest files for both formats are provided in the `data/` directory for reference.
## β€οΈ Acknowledgements
We sincerely thank the authors of the following open-source projects, whose excellent work laid the foundation for WavCube: [Semantic-VAE](https://github.com/ZhikangNiu/Semantic-VAE), [F5-TTS](https://github.com/swivid/f5-tts), [Vocos](https://github.com/gemelo-ai/vocos), [MiMo-Audio-Tokenizer](https://github.com/XiaomiMiMo/MiMo-Audio-Tokenizer), [s3prl](https://github.com/s3prl/s3prl).
## π Citation
If you find this repo helpful, please cite our work:
```bibtex
@misc{[CITATION_KEY],
title={[Paper Title Placeholder]},
author={[Author List]},
year={2025},
eprint={[ARXIV_ID]},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/[ARXIV_ID]},
}
```
## π License
The code in this repository is released under the MIT license, see [LICENSE](LICENSE) for details.
|