PronounceAI G2P - British English Grapheme-to-Phoneme

A fine-tuned ByT5-small model that converts English word spellings into British English (Received Pronunciation) IPA phoneme sequences.

Built on CharsiuG2P with unlikelihood contrastive training to actively suppress American English pronunciation patterns.

Intended Use

TTS engines requiring British English phoneme input (Kokoro, Piper, etc.)
Language learning applications
Accessibility tools and screen readers
Dictionary and reference applications
Any system needing British English IPA transcriptions

Not intended for: sentence-level processing, homograph disambiguation, or non-English languages.

Quick Start

Direct Model Inference

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained("inqVine/pronounceai-g2p")
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")

word = "hello"
inputs = tokenizer(f"pronounce: {word}", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128, num_beams=8)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(phonemes)  # h e l əʊ

Hosted API

A hosted public API is not yet available for direct use.

Training

Data

Source	Words	Description
Britfone	15,206	British English pronunciation lexicon (Received Pronunciation)
Wiktionary IPA	53,407	IPA pronunciations from English Wiktionary
Combined	67,092	Merged with Britfone priority on conflicts
Contrastive pairs	~300	Curated British/American pronunciation differences, upsampled 10x

Split: 60,382 train / 6,710 validation

Method

Fine-tuning: Standard seq2seq training on British English word-IPA pairs
Unlikelihood contrastive loss: Additionally penalises the model when it assigns high probability to American English pronunciations

L = L_british + alpha * max(0, margin - L_american)

alpha = 0.5 (contrastive loss weight)
margin = 2.0 (loss threshold before penalty fires)

This targets known British/American differences: BATH vowel (ɑː vs. æ), non-rhoticity, -ile endings (aɪl vs. əl), and word-specific pronunciations (lieutenant, herb, schedule, etc.).

Hyperparameters

Parameter	Value
Base model	`charsiu/g2p_multilingual_byT5_small_100`
Epochs	5
Batch size	16 x 2 gradient accumulation = 32 effective
Learning rate	1e-5
Beam search	8 beams (inference default)
Hardware	NVIDIA RTX 3090
Training time	~35 minutes

Evaluation

Phoneme Error Rate (PER)

Test Set	Words	Exact Match	Good (<10% PER)	Poor (>30% PER)	Mean PER
Full validation	7,815	74.41%	76.29%	5.72%	5.87%
External (unseen)	914	74.62%	75.71%	6.13%	5.98%

The 0.11 percentage point gap between in-distribution and unseen word PER indicates strong generalisation.

Performance by Word Length

Category	Words	Mean PER
Short (1-4 chars)	464	9.26%
Medium (5-7 chars)	2,753	6.65%
Long (8-10 chars)	3,221	5.12%
Very long (11+ chars)	1,377	4.94%

Inference Latency

Environment	Throughput	Latency (p50)	Latency (p95)
GPU (RTX 3090)	3.06 req/s	290 ms	431 ms
CPU (HF Spaces free)	0.76 req/s	1,203 ms	2,407 ms

Phoneme Inventory

Output uses standard British English (RP) IPA symbols with these normalisations applied during training:

Normalisation	From	To	Reason
DRESS vowel	ɛ	e	British convention
R sound	ɹ	r	Simplified notation
STRUT vowel	ɐ	ʌ	British convention
IPA g	U+0067 (g)	U+0261 (ɡ)	Correct IPA codepoint

Stress markers are stripped from output. Phonemes are space-separated in the output string.

Limitations

Short words (1-4 characters) have higher error rates (9.26% PER vs. 5.87% overall)
Proper nouns and loanwords are higher risk as they often follow non-English spelling conventions
British place names with irregular pronunciation (e.g., Cholmondeston, Alresford) may be unreliable
Homographs (e.g., "read" present vs. past) cannot be disambiguated without sentence context
Word-level only — does not process phrases or sentences
British English (RP) only — not optimised for Scottish, Irish, or regional British accents

Model Architecture

Architecture: T5ForConditionalGeneration (encoder-decoder)
Tokenizer: ByT5 (byte-level, no vocabulary limit)
Parameters: ~300M
d_model: 1,472
Encoder layers: 12
Decoder layers: 4
Attention heads: 6
Model size: ~1.14 GB (safetensors)

License

Apache 2.0

Downloads last month: 5

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inqVine/pronounceai-g2p

Base model

charsiu/g2p_multilingual_byT5_small_100

Finetuned

(1)

this model

Evaluation results

PER (Full Validation, 7815 words)
self-reported

5.870
PER (Unseen Words, 914 words)
self-reported

5.980