Fix: Use SentencePiece directly instead of AlbertTokenizer which strips some important khmer characters

#1

The current tokenizer is stripping combining characters and subscript markers

Original

text = "αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆ[MASK]αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”"
tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small")
inputs = tokenizer(text, return_tensors="pt")
print(f"Original text: {text}")
print(f"Tokenized: {inputs}")
print(f"Inverse mapping: {tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze())}")
print(f"Sentence piece decoding: {sp.decode_ids(inputs['input_ids'].squeeze().tolist())}")
Original text: αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆ[MASK]αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
Tokenized: {'input_ids': tensor([[   2,    5,  784,   77,  752,  520,  440,  242,    4,    5,   77, 3951,
          471,   64,   79,   86,  752,  242,    6,    3]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Inverse mapping: ['[CLS]', '▁', 'αž—', 'αž“', 'αž–', 'αž‰', 'αž‚', 'αž‡', '[MASK]', '▁', 'αž“', 'αž”αžš', 'αž‘', 'ស', 'αž€', 'ម', 'αž–', 'αž‡', 'αŸ”', '[SEP]']
Sentence piece decoding: αž—αž“αž–αž‰αž‚αž‡[MASK] αž“αž”αžšαž‘αžŸαž€αž˜αž–αž‡αŸ”

^ In the above we can see missing khmer characters


Fixed

However if we use SentencePiece directly we retain the correct components

import sentencepiece as spm

tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small")
sp = spm.SentencePieceProcessor()
sp.load(tokenizer.vocab_file)

text = "αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆ[MASK]αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”"

pieces = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)
input_ids = [2] + ids + [3]
    
input_ids = torch.LongTensor(input_ids).unsqueeze(0)
attention_mask = torch.ones_like(input_ids)

print(f"Original text: {text}")
print(f"Tokenized: {input_ids} {attention_mask}")
print(f"Inverse mapping: {tokenizer.convert_ids_to_tokens(input_ids.squeeze())}")
print(f"Sentence piece decoding: {sp.decode_ids(input_ids.squeeze().tolist())}")
Original text: αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆ[MASK]αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
Tokenized: tensor([[    2,   791,   855,     4, 10456,     6,     3]]) tensor([[1, 1, 1, 1, 1, 1, 1]])
Inverse mapping: ['[CLS]', 'β–αž—αŸ’αž“αŸ†αž–αŸαž‰', 'αž‚αžΊαž‡αžΆ', '[MASK]', 'αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆ', 'αŸ”', '[SEP]']
Sentence piece decoding: αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆ[MASK]αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”

^ Retains the full khmer. Correctly separates by words.

The model output is similar with both. From a few tests, the results using SentencePiece with the full khmer characters is slightly better

Original Outputs

1. αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαŸˆαž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
2. αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαž€αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
3. αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαžšαž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
4. αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαž“αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
5. αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαž·αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”

Fixed Outputs

1. αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαžšαžΆαž‡αž’αžΆαž“αžΈαž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
2. αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαž‘αžΉαž€αžŠαžΈαž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
3. αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαž”αŸαŸ‡αžŠαžΌαž„αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
4. αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαžšαžŠαŸ’αž‹αž’αžΆαž“αžΈαž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
5. αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαž‚αŸ„αž›αžŠαŸ…αž‘αŸαžŸαž…αžšαžŽαŸαž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
seanghay changed pull request status to merged

Thank you!

Sign up or log in to comment