PronounceAI G2P - British English Grapheme-to-Phoneme

A fine-tuned ByT5-small model that converts English word spellings into British English (Received Pronunciation) IPA phoneme sequences.

Built on CharsiuG2P with unlikelihood contrastive training to actively suppress American English pronunciation patterns.

Intended Use

  • TTS engines requiring British English phoneme input (Kokoro, Piper, etc.)
  • Language learning applications
  • Accessibility tools and screen readers
  • Dictionary and reference applications
  • Any system needing British English IPA transcriptions

Not intended for: sentence-level processing, homograph disambiguation, or non-English languages.

Quick Start

Direct Model Inference

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained("inqVine/pronounceai-g2p")
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")

word = "hello"
inputs = tokenizer(f"pronounce: {word}", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128, num_beams=8)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(phonemes)  # h e l əʊ

Hosted API

A hosted public API is not yet available for direct use.

Training

Data

Source Words Description
Britfone 15,206 British English pronunciation lexicon (Received Pronunciation)
Wiktionary IPA 53,407 IPA pronunciations from English Wiktionary
Combined 67,092 Merged with Britfone priority on conflicts
Contrastive pairs ~300 Curated British/American pronunciation differences, upsampled 10x

Split: 60,382 train / 6,710 validation

Method

  1. Fine-tuning: Standard seq2seq training on British English word-IPA pairs
  2. Unlikelihood contrastive loss: Additionally penalises the model when it assigns high probability to American English pronunciations
L = L_british + alpha * max(0, margin - L_american)
  • alpha = 0.5 (contrastive loss weight)
  • margin = 2.0 (loss threshold before penalty fires)

This targets known British/American differences: BATH vowel (ɑː vs. æ), non-rhoticity, -ile endings (aɪl vs. əl), and word-specific pronunciations (lieutenant, herb, schedule, etc.).

Hyperparameters

Parameter Value
Base model charsiu/g2p_multilingual_byT5_small_100
Epochs 5
Batch size 16 x 2 gradient accumulation = 32 effective
Learning rate 1e-5
Beam search 8 beams (inference default)
Hardware NVIDIA RTX 3090
Training time ~35 minutes

Evaluation

Phoneme Error Rate (PER)

Test Set Words Exact Match Good (<10% PER) Poor (>30% PER) Mean PER
Full validation 7,815 74.41% 76.29% 5.72% 5.87%
External (unseen) 914 74.62% 75.71% 6.13% 5.98%

The 0.11 percentage point gap between in-distribution and unseen word PER indicates strong generalisation.

Performance by Word Length

Category Words Mean PER
Short (1-4 chars) 464 9.26%
Medium (5-7 chars) 2,753 6.65%
Long (8-10 chars) 3,221 5.12%
Very long (11+ chars) 1,377 4.94%

Inference Latency

Environment Throughput Latency (p50) Latency (p95)
GPU (RTX 3090) 3.06 req/s 290 ms 431 ms
CPU (HF Spaces free) 0.76 req/s 1,203 ms 2,407 ms

Phoneme Inventory

Output uses standard British English (RP) IPA symbols with these normalisations applied during training:

Normalisation From To Reason
DRESS vowel ɛ e British convention
R sound ɹ r Simplified notation
STRUT vowel ɐ ʌ British convention
IPA g U+0067 (g) U+0261 (ɡ) Correct IPA codepoint

Stress markers are stripped from output. Phonemes are space-separated in the output string.

Limitations

  • Short words (1-4 characters) have higher error rates (9.26% PER vs. 5.87% overall)
  • Proper nouns and loanwords are higher risk as they often follow non-English spelling conventions
  • British place names with irregular pronunciation (e.g., Cholmondeston, Alresford) may be unreliable
  • Homographs (e.g., "read" present vs. past) cannot be disambiguated without sentence context
  • Word-level only — does not process phrases or sentences
  • British English (RP) only — not optimised for Scottish, Irish, or regional British accents

Model Architecture

  • Architecture: T5ForConditionalGeneration (encoder-decoder)
  • Tokenizer: ByT5 (byte-level, no vocabulary limit)
  • Parameters: ~300M
  • d_model: 1,472
  • Encoder layers: 12
  • Decoder layers: 4
  • Attention heads: 6
  • Model size: ~1.14 GB (safetensors)

License

Apache 2.0

Downloads last month
5
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inqVine/pronounceai-g2p

Finetuned
(1)
this model

Evaluation results