Joblib
Safetensors
hubert
File size: 2,178 Bytes
7252422
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46076ff
7252422
 
 
 
 
 
 
 
 
46076ff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: cc-by-nc-sa-4.0
language:
- bg
- cs
- da
- el
- es
- et
- fi
- hr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sk
- sl
- sv
---

# HuBERT VP-20

HuBERT VP-20 is a HuBERT base model pretrained on a subset of 6k hours and 20 languages of VoxPopuli
(all EU languages except English, French, and German) for the [DiscoPhon benchmark](https://benchmarks.cognitive-ml.fr/discophon).
It was pretrained using the [`minimal_hubert`](https://github.com/mxmpl/minimal_hubert) library.

You can load it with HuggingFace Transformers:

```python
from transformers import HubertModel

model = HubertModel.from_pretrained("coml/hubert-base-vp20")
```

Or with `minimal_hubert`:
```python
from minimal_hubert import HuBERT, HuBERTPretrain

# Standard model
model = HuBERT.from_pretrained("coml/hubert-base-vp20")
# With pretraining head for classification
model_for_pretraining = HuBERTPretrain.from_pretrained("https://huggingface.co/coml/hubert-base-vp20/resolve/main/it2.pt")
```

Check out [`minimal_hubert`](https://github.com/mxmpl/minimal_hubert) if you are interested in pretraining or want
to load HuBERT checkpoints from different libraries.

## Files:

- `model.safetensors` and `config.json`: HuggingFace Transformers checkpoint and config.
- `it1.pt`: 1st iteration checkpoint.
- `it2.pt`: 2nd iteration checkpoint. Converted to HuggingFace state_dict to get `model.safetensors`.
- `km100-mfcc.joblib`: K-means trained on MFCCs of VoxPopuli-20. Used to train the 1st iteration.
- `km500-it1-l10.joblib`: K-means trained on features from the 10th layer of the 1st iteration model. Used to train the 2nd iteration.
- `km256-it2-l11.joblib`: K-means trained on features from the 11th layer of the 2nd iteration model. Used for DiscoPhon finetuning.

## Citing

```bibtex
@misc{poli2026discophon,
  title={{DiscoPhon}: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units},
  author={Maxime Poli and Manel Khentout and Angelo Ortiz Tandazo and Ewan Dunbar and Emmanuel Chemla and Emmanuel Dupoux},
  year={2026},
  eprint={2603.18612},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.18612},
}
```