File size: 6,515 Bytes
0e42f6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
# WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

<p align="center">
  <img src="doc/wavcube_logo.png" alt="WavCube Logo" width="400"/>
</p>

[![github](https://img.shields.io/badge/Code-Repo-black?logo=github)](https://github.com/yanghaha0908/WavCube)
[![arXiv](https://img.shields.io/badge/%F0%9F%93%84%20ArXiv-Paper-red.svg)](https://arxiv.org/abs/2605.06407)
[![model](https://img.shields.io/badge/%F0%9F%A4%97%20WavCube-Models-blueviolet)](https://huggingface.co/yhaha/WavCube)


WavCube is a 128-dim, 50Hz continuous representation that unifies speech understanding,
reconstruction, and generation within a single space.
This is the official code for the paper [WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling](https://arxiv.org/pdf/2605.06407) [[abs](https://arxiv.org/abs/2605.06407)].

## ✨ Key Features
- **Unified Speech Representation** – A single continuous latent space that simultaneously supports speech understanding, reconstruction, and generation.
- **Semantic-Acoustic Joint Modeling** – Harmonizes high-level semantic structures with low-level acoustic textures.
- **Compact & Diffusion-Friendly** – Features a compact 128-dimensional bottleneck (8x compression from standard SSL features) enabling easier diffusion modeling.
<!-- By infusing fine-grained acoustic details into a distilled SSL semantic manifold, -->



## πŸ› οΈ Installation

We recommend creating a fresh conda environment for installation. 
### Env Setup
```bash
conda create -n WavCube python=3.10 -y
conda activate WavCube
```

### Basic Requirements
```bash
git clone https://github.com/yanghaha0908/WavCube.git
cd WavCube
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu126
conda install -c conda-forge sox ffmpeg libsndfile
pip install -e ".[train]"
```

## πŸš€ Quick Start

### Checkpoint Download
Pre-trained model checkpoints are available. Please use the following links to download the checkpoints:

| Representation | Dimension | Sample Rate | Frame Rate |
|----------------|-----------|-------------|------------|
| πŸ€— [WavCube](https://huggingface.co/yhaha/WavCube/tree/main/WavCube) | 128 | 16k Hz | 50 Hz |
| πŸ€— [WavCube-pro](https://huggingface.co/yhaha/WavCube/tree/main/WavCube-Pro) | 128 | 16k Hz | 50 Hz |


### Extract Representation from Speech
You can get continuous representations from raw wav using the following code:

```bash
python wav_to_feature.py \
    --audio 19_198_000000_000002.wav \
    --config configs/WavCube-stage2.yaml \
    --ckpt WavCube/checkpoints/vocos_checkpoint_epoch=177_step=195000_val_loss=3.3080.ckpt \
    --output 19_198_000000_000002.pt
```

### Reconstruct Speech from Representation

You can reconstruct waveform from representations using the following code:

```bash
python feature_to_wav.py \
    --feature 19_198_000000_000002.pt \
    --config configs/WavCube-stage2.yaml \
    --ckpt WavCube/checkpoints/vocos_checkpoint_epoch=177_step=195000_val_loss=3.3080.ckpt
```

<!-- ## πŸ’‘ Tips
- For devices that do not support BF16, you can manually disable PyTorch's mixed precision manager.
- If you encounter any issues or have questions, please feel free to open an issue. -->

## πŸ”§ Training

WavCube employs a **two-stage training** pipeline, all scripts are located in `scripts/train/`.

```bash
# ----------------- WavCube -----------------
bash scripts/train/train_WavCube_stage1.sh
bash scripts/train/train_WavCube_stage2.sh

# --------------- WavCube-Pro ---------------
bash scripts/train/train_WavCube_pro_stage1.sh
bash scripts/train/train_WavCube_pro_stage2.sh
# Note: Update `stage1_ckpt_path` in config to your Stage 1 checkpoint before running.
```

## 🀝 Additional Resources

### Evaluation Checkpoints

To make it easier to reproduce our results, we have uploaded supplementary resources to our πŸ€— [WavCube](https://huggingface.co/yhaha/WavCube/tree/main/ckpts). These include the `wavlm-large` weights and the necessary evaluation checkpoints for computing metrics such as WER, Speaker Similarity, and UTMOS.

```bash
# For offline testing or if you experience network issues, you can manually copy the checkpoints to your local cache:
cp -r ckpts/hub ~/.cache/torch/
cp ckpts/utmos22_strong_step7459_v1.pt ~/.cache/torch/hub/checkpoints/ 
cp -r ckpts/s3prl ~/.cache
```

### Data Preparation

**Small-scale data** β€” uses `VocosDataModule`. Prepare a filelist of audio paths for training and validation:

```bash
find $TRAIN_DATASET_DIR -name "*.wav" > filelist.train
find $VAL_DATASET_DIR -name "*.wav" > filelist.val
```

Each line is a plain audio path, for example:
```
/data/LibriSpeech/test-clean/672/122797/672-122797-0026.flac
/data/LibriSpeech/test-clean/672/122797/672-122797-0071.flac
/data/LibriSpeech/test-clean/672/122797/672-122797-0037.flac
```

**Large-scale data** β€” uses `VocosEmiliaDataModule`. Two files are required:

1. **Filelist** β€” same format as above for LibriSpeech; for LibriHeavy, each line is a JSON entry, for example:
```json
{"id": "medium/968/.../voyagesdolittle_55_lofting_64kb_38", "start": 22.32, "duration": 19.36, "channel": 0, "recording": {"sources": [{"source": "download/librilight/medium/968/.../voyagesdolittle_55_lofting_64kb.flac"}], "sampling_rate": 16000}, "type": "MonoCut"}
```

2. **Index file** (`.idx`) β€” a byte-offset index for fast random access, generated via:
```bash
python data/generate_idx.py
```

Example data manifest files for both formats are provided in the `data/` directory for reference.


## ❀️ Acknowledgements

We sincerely thank the authors of the following open-source projects, whose excellent work laid the foundation for WavCube: [Semantic-VAE](https://github.com/ZhikangNiu/Semantic-VAE), [F5-TTS](https://github.com/swivid/f5-tts), [Vocos](https://github.com/gemelo-ai/vocos), [MiMo-Audio-Tokenizer](https://github.com/XiaomiMiMo/MiMo-Audio-Tokenizer), [s3prl](https://github.com/s3prl/s3prl).



## πŸ“ Citation

If you find this repo helpful, please cite our work:

```bibtex
@misc{[CITATION_KEY],
      title={[Paper Title Placeholder]},
      author={[Author List]},
      year={2025},
      eprint={[ARXIV_ID]},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/[ARXIV_ID]},
}
```

## πŸ“„ License

The code in this repository is released under the MIT license, see [LICENSE](LICENSE) for details.