Fix: Use SentencePiece directly instead of AlbertTokenizer which strips some important khmer characters

by djsamseng - opened Mar 17

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

-1

djsamseng

Mar 17

•

edited Mar 17

The current tokenizer is stripping combining characters and subscript markers

Original

text = "ភ្នំពេញគឺជា[MASK]នៃប្រទេសកម្ពុជា។"
tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small")
inputs = tokenizer(text, return_tensors="pt")
print(f"Original text: {text}")
print(f"Tokenized: {inputs}")
print(f"Inverse mapping: {tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze())}")
print(f"Sentence piece decoding: {sp.decode_ids(inputs['input_ids'].squeeze().tolist())}")

Original text: ភ្នំពេញគឺជា[MASK]នៃប្រទេសកម្ពុជា។
Tokenized: {'input_ids': tensor([[   2,    5,  784,   77,  752,  520,  440,  242,    4,    5,   77, 3951,
          471,   64,   79,   86,  752,  242,    6,    3]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Inverse mapping: ['[CLS]', '▁', 'ភ', 'ន', 'ព', 'ញ', 'គ', 'ជ', '[MASK]', '▁', 'ន', 'បរ', 'ទ', 'ស', 'ក', 'ម', 'ព', 'ជ', '។', '[SEP]']
Sentence piece decoding: ភនពញគជ[MASK] នបរទសកមពជ។

^ In the above we can see missing khmer characters

Fixed

However if we use SentencePiece directly we retain the correct components

import sentencepiece as spm

tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small")
sp = spm.SentencePieceProcessor()
sp.load(tokenizer.vocab_file)

text = "ភ្នំពេញគឺជា[MASK]នៃប្រទេសកម្ពុជា។"

pieces = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)
input_ids = [2] + ids + [3]
    
input_ids = torch.LongTensor(input_ids).unsqueeze(0)
attention_mask = torch.ones_like(input_ids)

print(f"Original text: {text}")
print(f"Tokenized: {input_ids} {attention_mask}")
print(f"Inverse mapping: {tokenizer.convert_ids_to_tokens(input_ids.squeeze())}")
print(f"Sentence piece decoding: {sp.decode_ids(input_ids.squeeze().tolist())}")

Original text: ភ្នំពេញគឺជា[MASK]នៃប្រទេសកម្ពុជា។
Tokenized: tensor([[    2,   791,   855,     4, 10456,     6,     3]]) tensor([[1, 1, 1, 1, 1, 1, 1]])
Inverse mapping: ['[CLS]', '▁ភ្នំពេញ', 'គឺជា', '[MASK]', 'នៃប្រទេសកម្ពុជា', '។', '[SEP]']
Sentence piece decoding: ភ្នំពេញគឺជា[MASK]នៃប្រទេសកម្ពុជា។

^ Retains the full khmer. Correctly separates by words.

The model output is similar with both. From a few tests, the results using SentencePiece with the full khmer characters is slightly better

Original Outputs

1. ភ្នំពេញគឺជាៈនៃប្រទេសកម្ពុជា។
2. ភ្នំពេញគឺជាកនៃប្រទេសកម្ពុជា។
3. ភ្នំពេញគឺជារនៃប្រទេសកម្ពុជា។
4. ភ្នំពេញគឺជាននៃប្រទេសកម្ពុជា។
5. ភ្នំពេញគឺជាិនៃប្រទេសកម្ពុជា។

Fixed Outputs

1. ភ្នំពេញគឺជារាជធានីនៃប្រទេសកម្ពុជា។
2. ភ្នំពេញគឺជាទឹកដីនៃប្រទេសកម្ពុជា។
3. ភ្នំពេញគឺជាបេះដូងនៃប្រទេសកម្ពុជា។
4. ភ្នំពេញគឺជារដ្ឋធានីនៃប្រទេសកម្ពុជា។
5. ភ្នំពេញគឺជាគោលដៅទេសចរណ៍នៃប្រទេសកម្ពុជា។

Fix: Use SentencePiece directly instead of AlbertTokenizer which strips some important khmer characters6c85eb77

seanghay changed pull request status to merged Mar 20

seanghay

Owner Mar 20

Thank you!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment