Malaysian-Normalizer-Qwen3-8B

Finetune Qwen/Qwen3-8B on mesolitica/Malaysian-Normalizer

Prompt

given the text
text: {text}

normalize to {language} language

text is the text you want to normalize.
language is language you want to normalize, you can omit normalize to {language} language this make the model normalize based on the text language.

Output

It will return a JSON,

{"normalized_text": "All suspects, aged twenty five to thirty seven, were remanded for seven days beginning today to facilitate investigations under Sections twelve open parenthesis two close parenthesis, thirty nine A open parenthesis one close parenthesis and thirty nine A open parenthesis two close parenthesis of the Dangerous Drugs Act one thousand nine hundred fifty two. dash Bernama", "normalizer_mapping": {"25": "twenty five", "37": "thirty seven", "12(2)": "twelve open parenthesis two close parenthesis", "39A(1)": "thirty nine A open parenthesis one close parenthesis", "39A(2)": "thirty nine A open parenthesis two close parenthesis", "1952": "one thousand nine hundred fifty two", "\u2014": "dash"}}

Example

from transformers import TextStreamer, AutoModelForCausalLM, AutoTokenizer
import transformers
import torch

model = AutoModelForCausalLM.from_pretrained(
    'malaysia-ai/Malaysian-Normalizer-Qwen3-8B',
    torch_dtype='auto'
).cuda()
tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/Malaysian-Normalizer-Qwen3-8B')

user = """
given the text
text: “Oleochemical exports dropped 2.72 per cent m-o-m to 210,924 tonnes from 216,816 tonnes while biodiesel exports fell 48.89 per cent m-o-m to 23,689 tonnes from 46,345 tonnes,” it said.

normalize to english language
"""
message = [
    {'role': 'user', 'content': user.strip()}
]
prompt = tokenizer.apply_chat_template(message, add_generation_prompt = True, tokenize = False)
generate_kwargs = dict(
    **tokenizer(prompt, return_tensors = 'pt').to('cuda'),
    max_new_tokens=1024,
    top_p=0.9,
    top_k=50,
    temperature=0.9,
    do_sample=True,
    repetition_penalty=1.0,
)
generation_output = model.generate(**generate_kwargs)

Output,

<|im_start|>user
given the text
text: “Oleochemical exports dropped 2.72 per cent m-o-m to 210,924 tonnes from 216,816 tonnes while biodiesel exports fell 48.89 per cent m-o-m to 23,689 tonnes from 46,345 tonnes,” it said.

normalize to english language<|im_end|>
<|im_start|>assistant
<think>

</think>

{"normalized_text": "open quote Oleochemical exports dropped two point seven two per cent m dash o dash m to two hundred ten thousand nine hundred twenty four tonnes from two hundred sixteen thousand eight hundred sixteen tonnes while biodiesel exports fell forty eight point eight nine per cent m dash o dash m to twenty three thousand six hundred eighty nine tonnes from forty six thousand three hundred forty five tonnes, close quote it said.", "normalizer_mapping": {"\u201c": "open quote", "2.72": "two point seven two", "m-o-m": "m dash o dash m", "210,924": "two hundred ten thousand nine hundred twenty four", "216,816": "two hundred sixteen thousand eight hundred sixteen", "48.89": "forty eight point eight nine", "23,689": "twenty three thousand six hundred eighty nine", "46,345": "forty six thousand three hundred forty five", "\u201d": "close quote"}}<|im_end|>

Revision

current stage, 7e4483ac0c66fef90556113d8b32665c80786b5f

This revision trained on mesolitica/Malaysian-SFT/malaysian_normalizer and mesolitica/Malaysian-SFT/malaysian_normalizer_pseudolabel.
This revision trained on proper train set.

older stage, 7b502263c605355fbc93a1b76f6712461812f863

This revision trained initially on mesolitica/Malaysian-SFT/malaysian_normalizer.
This revision pseudolabelled more dataset and released it at mesolitica/Malaysian-Normalizer#pseudolabel
This revision trained on leaked test set.

Acknowledgement

Special thanks to Lambda Research Grant program for Lambda cloud credit!

Downloads last month: 4

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for malaysia-ai/Malaysian-Normalizer-Qwen3-8B

Quantizations

1 model