PronounceAI G2P - British English Grapheme-to-Phoneme
A fine-tuned ByT5-small model that converts English word spellings into British English (Received Pronunciation) IPA phoneme sequences.
Built on CharsiuG2P with unlikelihood contrastive training to actively suppress American English pronunciation patterns.
Intended Use
- TTS engines requiring British English phoneme input (Kokoro, Piper, etc.)
- Language learning applications
- Accessibility tools and screen readers
- Dictionary and reference applications
- Any system needing British English IPA transcriptions
Not intended for: sentence-level processing, homograph disambiguation, or non-English languages.
Quick Start
Direct Model Inference
from transformers import T5ForConditionalGeneration, AutoTokenizer
model = T5ForConditionalGeneration.from_pretrained("inqVine/pronounceai-g2p")
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
word = "hello"
inputs = tokenizer(f"pronounce: {word}", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128, num_beams=8)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(phonemes) # h e l əʊ
Hosted API
A hosted public API is not yet available for direct use.
Training
Data
| Source | Words | Description |
|---|---|---|
| Britfone | 15,206 | British English pronunciation lexicon (Received Pronunciation) |
| Wiktionary IPA | 53,407 | IPA pronunciations from English Wiktionary |
| Combined | 67,092 | Merged with Britfone priority on conflicts |
| Contrastive pairs | ~300 | Curated British/American pronunciation differences, upsampled 10x |
Split: 60,382 train / 6,710 validation
Method
- Fine-tuning: Standard seq2seq training on British English word-IPA pairs
- Unlikelihood contrastive loss: Additionally penalises the model when it assigns high probability to American English pronunciations
L = L_british + alpha * max(0, margin - L_american)
alpha = 0.5(contrastive loss weight)margin = 2.0(loss threshold before penalty fires)
This targets known British/American differences: BATH vowel (ɑː vs. æ), non-rhoticity, -ile endings (aɪl vs. əl), and word-specific pronunciations (lieutenant, herb, schedule, etc.).
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | charsiu/g2p_multilingual_byT5_small_100 |
| Epochs | 5 |
| Batch size | 16 x 2 gradient accumulation = 32 effective |
| Learning rate | 1e-5 |
| Beam search | 8 beams (inference default) |
| Hardware | NVIDIA RTX 3090 |
| Training time | ~35 minutes |
Evaluation
Phoneme Error Rate (PER)
| Test Set | Words | Exact Match | Good (<10% PER) | Poor (>30% PER) | Mean PER |
|---|---|---|---|---|---|
| Full validation | 7,815 | 74.41% | 76.29% | 5.72% | 5.87% |
| External (unseen) | 914 | 74.62% | 75.71% | 6.13% | 5.98% |
The 0.11 percentage point gap between in-distribution and unseen word PER indicates strong generalisation.
Performance by Word Length
| Category | Words | Mean PER |
|---|---|---|
| Short (1-4 chars) | 464 | 9.26% |
| Medium (5-7 chars) | 2,753 | 6.65% |
| Long (8-10 chars) | 3,221 | 5.12% |
| Very long (11+ chars) | 1,377 | 4.94% |
Inference Latency
| Environment | Throughput | Latency (p50) | Latency (p95) |
|---|---|---|---|
| GPU (RTX 3090) | 3.06 req/s | 290 ms | 431 ms |
| CPU (HF Spaces free) | 0.76 req/s | 1,203 ms | 2,407 ms |
Phoneme Inventory
Output uses standard British English (RP) IPA symbols with these normalisations applied during training:
| Normalisation | From | To | Reason |
|---|---|---|---|
| DRESS vowel | ɛ | e | British convention |
| R sound | ɹ | r | Simplified notation |
| STRUT vowel | ɐ | ʌ | British convention |
| IPA g | U+0067 (g) | U+0261 (ɡ) | Correct IPA codepoint |
Stress markers are stripped from output. Phonemes are space-separated in the output string.
Limitations
- Short words (1-4 characters) have higher error rates (9.26% PER vs. 5.87% overall)
- Proper nouns and loanwords are higher risk as they often follow non-English spelling conventions
- British place names with irregular pronunciation (e.g., Cholmondeston, Alresford) may be unreliable
- Homographs (e.g., "read" present vs. past) cannot be disambiguated without sentence context
- Word-level only — does not process phrases or sentences
- British English (RP) only — not optimised for Scottish, Irish, or regional British accents
Model Architecture
- Architecture: T5ForConditionalGeneration (encoder-decoder)
- Tokenizer: ByT5 (byte-level, no vocabulary limit)
- Parameters: ~300M
- d_model: 1,472
- Encoder layers: 12
- Decoder layers: 4
- Attention heads: 6
- Model size: ~1.14 GB (safetensors)
License
Apache 2.0
- Downloads last month
- 5
Model tree for inqVine/pronounceai-g2p
Base model
charsiu/g2p_multilingual_byT5_small_100Evaluation results
- PER (Full Validation, 7815 words)self-reported5.870
- PER (Unseen Words, 914 words)self-reported5.980