From Babble to Words - a phonemetransformers Collection

updated Apr 3, 2025

The models, tokenizers and datasets used in From Babble to Words, one of the winning BabyLM 2024 submissions, exploring phoneme-based training.

Upvote

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Paper • 2410.22906 • Published Oct 30, 2024
phonemetransformers/IPA-BabyLM

Viewer • Updated Apr 8, 2025 • 12.5M • 35 • 2
phonemetransformers/IPA-BabyLM-evaluation

Preview • Updated Apr 3, 2025 • 18
phonemetransformers/babble-tokenizers

Updated Apr 3, 2025
phonemetransformers/GPT2-85M-BPE-PHON

97.5M • Updated Apr 3, 2025 • 58

Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON tokenizer.
phonemetransformers/GPT2-85M-BPE-PHON-SPACELESS

97.5M • Updated Apr 3, 2025 • 14

Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT-SPACELESS

85.3M • Updated Apr 3, 2025 • 18

Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON

85.3M • Updated Apr 3, 2025 • 84

Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON-SPACELESS

85.3M • Updated Apr 3, 2025 • 26

Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT

85.3M • Updated Apr 3, 2025 • 29

Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT-SPACELESS

97.5M • Updated Apr 3, 2025 • 8

Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT

97.5M • Updated Apr 3, 2025 • 97 • 1

Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT tokenizer.

Upvote