You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Qwen3-0.6B Polish Text Normalization

A fine-tuned Qwen3-0.6B model for Polish text normalization — converting written text with numbers, dates, abbreviations, and units into fully spoken form for TTS (Text-to-Speech) systems.

This model replaces Folx/byt5-small-pl-text-normalization with 2x higher accuracy and comparable latency when served with vLLM.

Performance Comparison

Model	Architecture	Params	Exact Match	Char Accuracy	Latency (vLLM)	Digits in Output
This model (Qwen3-0.6B)	Causal LM + LoRA	600M	85.7%	92.6%	~46ms	0
Folx/byt5-small-pl-text-normalization	Encoder-Decoder	300M	43.4%	~70%	~64ms	>0

Key improvements:

+42pp exact match (43.4% → 85.7%)
Zero digit leaks with constrained decoding (bans digit tokens from output)
Fuzzy match 87.7% (case-insensitive, punctuation-normalized)
270+ samples/sec throughput via vLLM concurrent serving

What It Does

Converts written Polish text into spoken form:

Input	Output
`Spotkanie odbędzie się 3 maja o godzinie 14:30.`	`Spotkanie odbędzie się trzeciego maja o godzinie czternastej trzydzieści.`
`Cena wynosi 1234,56 zł.`	`Cena wynosi tysiąc dwieście trzydzieści cztery złote pięćdziesiąt sześć groszy.`
`Na ul. Marszałkowskiej 123/45 mieści się sklep.`	`Na ulicy Marszałkowskiej sto dwadzieścia trzy łamane przez czterdzieści pięć mieści się sklep.`
`Temperatura wynosi -15°C, a ciśnienie 1013 hPa.`	`Temperatura wynosi minus piętnaście stopni Celsjusza, a ciśnienie tysiąc trzynaście hektopaskali.`
`Wzrost PKB wyniósł 3,5% r/r.`	`Wzrost PKB wyniósł trzy i pół procent rok do roku.`
`Samolot LO 3842 wylądował o 18:45.`	`Samolot LO trzy osiem cztery dwa wylądował o osiemnastej czterdzieści pięć.`
`Dnia 15 września 1631 roku odbyła się ceremonia.`	`Dnia piętnastego września tysiąc sześćset trzydziestego pierwszego roku odbyła się ceremonia.`
`Bank Millennium drugi raz z rzędu z tytułem Złoty Bank.`	`Bank Millennium drugi raz z rzędu z tytułem Złoty Bank.`

Handles: numbers, dates, times, addresses, currencies, percentages, units (km/h, °C, hPa, m²), abbreviations (ul., al., prof., dr, nr, m.in., tzw., itp.), flight/train numbers, fractions, r/r (rok do roku), legal references (art., §, ust., pkt), and passthrough of clean text.

Requirements

transformers>=5.0.0
torch>=2.1.0

Important: This model requires transformers >= 5.0.0. Older versions (e.g. 4.57) tokenize the Qwen3 chat template differently, producing incorrect normalization output even at temperature=0.

Usage

With Transformers (simple)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Folx/qwen3-0.6b-pl-text-normalization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.eval()

def normalize(text: str) -> str:
    messages = [
        {"role": "system", "content": "Zamieniasz tekst na formę mówioną."},
        {"role": "user", "content": text},
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    prompt_len = inputs["input_ids"].shape[1]

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=400,
            do_sample=False,
            temperature=1.0,
            pad_token_id=tokenizer.pad_token_id,
        )
    return tokenizer.decode(output_ids[0][prompt_len:], skip_special_tokens=True).strip()

# Example
print(normalize("Spotkanie o 14:30 przy ul. Głównej 12."))
# → Spotkanie o czternastej trzydzieści przy ulicy Głównej dwanaście.

With Constrained Decoding (zero digit leaks)

For production TTS, ban digit tokens from output to guarantee no numbers slip through:

def get_banned_token_ids(tokenizer):
    """Get token IDs for digits and symbols that should always be spoken as words."""
    banned = set()
    for tid in range(tokenizer.vocab_size):
        decoded = tokenizer.decode([tid]).strip()
        if decoded and decoded.isdigit():
            banned.add(tid)
    # Symbols that TTS must speak as words
    for sym in ["$", "%", "€", "£", "¥", "°", "§", "@", "#", "&", "×", "÷", "±",
                "©", "®", "™", "→", "←", "↑", "↓", "√", "∞", "≈", "≠", "≤", "≥"]:
        ids = tokenizer.encode(sym, add_special_tokens=False)
        if len(ids) == 1:
            banned.add(ids[0])
    return sorted(banned)

banned_ids = get_banned_token_ids(tokenizer)

def normalize_safe(text: str) -> str:
    messages = [
        {"role": "system", "content": "Zamieniasz tekst na formę mówioną."},
        {"role": "user", "content": text},
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    prompt_len = inputs["input_ids"].shape[1]

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=400,
            do_sample=False,
            temperature=1.0,
            pad_token_id=tokenizer.pad_token_id,
            suppress_tokens=banned_ids,  # Ban digits and symbols
        )
    return tokenizer.decode(output_ids[0][prompt_len:], skip_special_tokens=True).strip()

With vLLM (production, high throughput)

python -m vllm.entrypoints.openai.api_server \
    --model Folx/qwen3-0.6b-pl-text-normalization \
    --port 8094 --max-model-len 512

import aiohttp, asyncio

async def normalize(text: str) -> str:
    async with aiohttp.ClientSession() as session:
        async with session.post("http://localhost:8094/v1/chat/completions", json={
            "model": "Folx/qwen3-0.6b-pl-text-normalization",
            "messages": [
                {"role": "system", "content": "Zamieniasz tekst na formę mówioną."},
                {"role": "user", "content": text}
            ],
            "max_tokens": 400, "temperature": 0,
        }) as resp:
            data = await resp.json()
            return data["choices"][0]["message"]["content"].strip()

# ~270 samples/sec with 32 concurrent workers

Model Details

Property	Value
Base Model	Qwen/Qwen3-0.6B
Architecture	Qwen3ForCausalLM (Causal LM)
Parameters	600M
Training Method	LoRA (rank 128, alpha 256) merged
Precision	bfloat16
Model Size	~1.1 GB (safetensors)
Max Sequence Length	512 tokens
Language	Polish
License	MIT
System Prompt	`Zamieniasz tekst na formę mówioną.`

Training

Dataset: 64K curated Polish normalization pairs (private)
Sources: Human-curated examples, Speakleash Polish corpora (Wikipedia, news, financial), Faker-generated structured data (addresses, phone numbers, codes), targeted gap-filling for edge cases
Teacher model: Bielik-11B-v3.0-Instruct for synthetic data generation and validation
Quality pipeline: Reverse-normalization consistency checking — each pair verified by converting output back to written form and comparing with input (>75% similarity threshold)
Training config: 5 epochs, lr=1e-4, cosine schedule, batch 4 × grad_accum 4, eval every 250 steps with best-by-exact-match checkpoint selection
Hardware: Single NVIDIA RTX 5090 32GB

Evaluation

Evaluated on 1,709 held-out samples from the original human-curated dataset:

Metric	Score
Exact Match	85.7%
Fuzzy Match (case+punct normalized)	87.7%
Character Accuracy	92.6%
Digits in Output	0
Throughput (vLLM, 32 workers)	270 samples/sec

Remaining "errors" analysis: ~60% of mismatches are stylistic differences where both model output and reference are valid Polish (case differences, punctuation, grammatical case variants like "pięć procent" vs "pięciu procent"). True error rate is estimated at ~8-9%.

Limitations

Designed specifically for Polish language
Optimized for TTS preprocessing — the model normalizes text for pronunciation, not for grammatical correctness
Address format 1/34 (apartment numbers) is sometimes read as a whole number rather than "jeden łamane trzydzieści cztery"
Very rare abbreviations or domain-specific terminology may not be expanded
Performance degrades for inputs longer than ~300 characters
Requires GPU for reasonable latency (~46ms on RTX 5090 via vLLM)

Polski opis modelu

Qwen3-0.6B Polish Text Normalization to model AI do normalizacji polskiego tekstu opracowany przez Folx — butik AI specjalizujący się w systemach rozpoznawania mowy (ASR), syntezy mowy (TTS) oraz rozwiązaniach głosowych.

Model bazuje na architekturze Qwen3-0.6B i został dostrojony metodą LoRA na 64 tysiącach par normalizacyjnych. Zamienia tekst pisany — z liczbami, datami, skrótami, jednostkami — na formę mówioną, gotową do syntezy mowy.

Zastępuje nasz poprzedni model byt5-small-pl-text-normalization z dwukrotnie wyższą dokładnością (85.7% vs 43.4% exact match) i porównywalnym czasem odpowiedzi (~46ms przez vLLM).

Kluczowe cechy

85.7% dokładności exact match na zbiorze testowym (87.7% z normalizacją wielkości liter i interpunkcji)
Zero wycieków cyfr w wyniku — constrained decoding blokuje tokeny cyfr
~46ms latencja przez vLLM, 270 zapytań/sekundę z 32 równoległymi workerami
Obsługuje: liczby, daty, godziny, adresy, waluty, procenty, jednostki (km/h, °C, hPa, m²), skróty (ul., al., prof., dr, nr, m.in., tzw., itp.), numery lotów/pociągów, ułamki, r/r, odniesienia prawne (art., §, ust., pkt)

Zastosowania

Synteza mowy (TTS) — naturalna wymowa polskich dat, liczb i skrótów
Asystenci głosowi — przetwarzanie poleceń w języku polskim
Audiobooki i podcasty — automatyczna konwersja tekstu do formatu audio
Systemy IVR / call center — profesjonalne komunikaty głosowe
Accessibility — czytniki ekranowe poprawnie wymawiające polski tekst
Media i broadcasting — automatyzacja przygotowania skryptów audio

Użycie

System prompt: Zamieniasz tekst na formę mówioną.

Model przyjmuje tekst do znormalizowania jako wiadomość użytkownika i zwraca znormalizowany tekst jako odpowiedź asystenta. Format chat template Qwen3.

Szczegóły użycia i przykłady kodu — patrz sekcja Usage powyżej.

Kontakt: krzysztof@folx.it

Folx — deep learning, transformer models, Polish NLP, speech recognition, text-to-speech, voice AI

Downloads last month: 113

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for Folx/qwen3-0.6b-pl-text-normalization

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

(795)

this model

Evaluation results

Exact Match on Polish Text Normalization Eval Set
self-reported

85.700
Fuzzy Match on Polish Text Normalization Eval Set
self-reported

87.700
Character Accuracy on Polish Text Normalization Eval Set
self-reported

92.600