From Babble to Words
The models, tokenizers and datasets used in From Babble to Words, one of the winning BabyLM 2024 submissions, exploring phoneme-based training.
- Paper • 2410.22906 • Published
-
phonemetransformers/IPA-BabyLM
Viewer • Updated • 12.5M • 35 • 2 -
phonemetransformers/IPA-BabyLM-evaluation
Preview • Updated • 18 -
phonemetransformers/babble-tokenizers
Updated
phonemetransformers/GPT2-85M-BPE-PHON
97.5M • Updated • 58Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON tokenizer.
phonemetransformers/GPT2-85M-BPE-PHON-SPACELESS
97.5M • Updated • 14Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT-SPACELESS
85.3M • Updated • 18Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON
85.3M • Updated • 84Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON-SPACELESS
85.3M • Updated • 26Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT
85.3M • Updated • 29Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT-SPACELESS
97.5M • Updated • 8Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT
97.5M • Updated • 97 • 1Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT tokenizer.