Fix: Use SentencePiece directly instead of AlbertTokenizer which strips some important khmer characters
#1
by djsamseng - opened
The current tokenizer is stripping combining characters and subscript markers
Original
text = "αααααααααΊααΆ[MASK]ααααααααααααα»ααΆα"
tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small")
inputs = tokenizer(text, return_tensors="pt")
print(f"Original text: {text}")
print(f"Tokenized: {inputs}")
print(f"Inverse mapping: {tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze())}")
print(f"Sentence piece decoding: {sp.decode_ids(inputs['input_ids'].squeeze().tolist())}")
Original text: αααααααααΊααΆ[MASK]ααααααααααααα»ααΆα
Tokenized: {'input_ids': tensor([[ 2, 5, 784, 77, 752, 520, 440, 242, 4, 5, 77, 3951,
471, 64, 79, 86, 752, 242, 6, 3]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Inverse mapping: ['[CLS]', 'β', 'α', 'α', 'α', 'α', 'α', 'α', '[MASK]', 'β', 'α', 'αα', 'α', 'α', 'α', 'α', 'α', 'α', 'α', '[SEP]']
Sentence piece decoding: αααααα[MASK] αααααααααα
^ In the above we can see missing khmer characters
Fixed
However if we use SentencePiece directly we retain the correct components
import sentencepiece as spm
tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small")
sp = spm.SentencePieceProcessor()
sp.load(tokenizer.vocab_file)
text = "αααααααααΊααΆ[MASK]ααααααααααααα»ααΆα"
pieces = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)
input_ids = [2] + ids + [3]
input_ids = torch.LongTensor(input_ids).unsqueeze(0)
attention_mask = torch.ones_like(input_ids)
print(f"Original text: {text}")
print(f"Tokenized: {input_ids} {attention_mask}")
print(f"Inverse mapping: {tokenizer.convert_ids_to_tokens(input_ids.squeeze())}")
print(f"Sentence piece decoding: {sp.decode_ids(input_ids.squeeze().tolist())}")
Original text: αααααααααΊααΆ[MASK]ααααααααααααα»ααΆα
Tokenized: tensor([[ 2, 791, 855, 4, 10456, 6, 3]]) tensor([[1, 1, 1, 1, 1, 1, 1]])
Inverse mapping: ['[CLS]', 'βααααααα', 'ααΊααΆ', '[MASK]', 'ααααααααααααα»ααΆ', 'α', '[SEP]']
Sentence piece decoding: αααααααααΊααΆ[MASK]ααααααααααααα»ααΆα
^ Retains the full khmer. Correctly separates by words.
The model output is similar with both. From a few tests, the results using SentencePiece with the full khmer characters is slightly better
Original Outputs
1. αααααααααΊααΆαααααααααααααα»ααΆα
2. αααααααααΊααΆαααααααααααααα»ααΆα
3. αααααααααΊααΆαααααααααααααα»ααΆα
4. αααααααααΊααΆαααααααααααααα»ααΆα
5. αααααααααΊααΆα·ααααααααααααα»ααΆα
Fixed Outputs
1. αααααααααΊααΆααΆαααΆααΈααααααααααααα»ααΆα
2. αααααααααΊααΆααΉαααΈααααααααααααα»ααΆα
3. αααααααααΊααΆαααααΌαααααααααααααα»ααΆα
4. αααααααααΊααΆααααααΆααΈααααααααααααα»ααΆα
5. αααααααααΊααΆααααα
αααα
αααααααααααααααα»ααΆα
seanghay changed pull request status to merged
Thank you!