A newer version of this model is available: lunahr/CeluneNorm-0.6B-v1.3

Model Card for CeluneNorm-0.6B-v1.1

Model Details

Model Description

CeluneNorm is a lightweight text normalization model designed for TTS and general preprocessing pipelines.

It converts poorly formatted input into clean, readable text while preserving the original meaning.

Example:

  • Input: this is a badly formed sentence
  • Output: This is a badly formed sentence.

The model is conservative by design:

  • It does not rewrite sentences
  • It avoids changing meaning
  • It preserves domain-specific tokens (e.g. URLs, commands, names)

Usage

The model expects input in the following format:

YOUR INPUT<NORM>

It will generate the normalized version of the input.

Inference example:

from transformers import pipeline, AutoTokenizer

model_id = "lunahr/CeluneNorm-0.6B-v1.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=model_id,
    device="cuda:0",  # "cpu" for CPU-only, slower
)

def normalize(text: str) -> str:
    history = [
        {"role": "user", "content": text}
    ]
    prompt = tokenizer.apply_chat_template(history, tokenize=False)

    out = pipe(
        prompt,
        max_new_tokens=512,
        do_sample=False,
        return_full_text=False,
    )

    return out[0]["generated_text"].strip()

# example
print(normalize("if i type something more complicated into celune it will fix it"))

Key Characteristics

  • Deterministic (no sampling required)
  • Preserves structure and intent
  • Handles mixed text (natural language + technical content)
  • Conservative punctuation (prefers . over ! unless explicit)
  • Supports multi-sentence normalization when boundaries are clear

  • Developed by: https://huggingface.co/lunahr
  • Model type: Causal Language Model
  • Language(s): English
  • License: MIT
  • Base model: Qwen/Qwen3-0.6B-Base

Limitations

This model is not intended to be a full grammar correction system.

Possible limitations include:

  • May miss some punctuation or casing corrections
  • May be conservative with contractions (e.g. there s → unchanged)
  • May preserve ambiguous casing when intent is unclear
  • Does not expand slang or rewrite informal language

The model prioritizes safety and meaning preservation over aggressive correction.


Training Details

Dataset

Trained on: https://huggingface.co/datasets/lunahr/normalization-data-mixed

The dataset includes a mix of:

  • Formal text (Wikipedia-style)
  • Conversational text (PersonaChat)
  • Synthetic edge cases
  • Quoted text handling

This combination helps the model generalize across both clean and noisy inputs.


Training Procedure

  • Fine-tuned from Qwen3-0.6B-Base
  • Hardware: Kaggle dual NVIDIA T4 (FP16)
  • Training time: ~1.5 hours
  • Epochs: 3

Training configuration highlights:

  • Learning rate: 8e-5
  • Gradient clipping: 1.0
  • Warmup: 200 steps (~10%)

Metrics

  • Final training loss: 0.08841
  • Mean token accuracy: 97.53%

These metrics reflect token-level accuracy; real-world normalization quality is slightly lower but more representative (~90–95% human-level correctness).


Downloads last month
2,944
Safetensors
Model size
0.6B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lunahr/CeluneNorm-0.6B-v1.1

Finetuned
(577)
this model
Quantizations
1 model

Dataset used to train lunahr/CeluneNorm-0.6B-v1.1

Collection including lunahr/CeluneNorm-0.6B-v1.1