Strange behaviour of the tokenizer
#58
by andercorral - opened
Inspecting the tokenizer config in 'tokenizer.json' I've noticed that the normalizer replaces blank spaces with the special character '▁'. Then, the pre-tokenizer is applied but it will never split on blank spaces as the resulting string from the normalizer will be a single string of '▁' concatenated words.
For example:
Original text: 'Hello World!"
Normalized text: "Hello▁World!"
Pretokenized text: "Hello▁World!" (no split at all)
"normalizer": {
"type": "Replace",
"pattern": {
"String": " "
},
"content": "▁"
},
"pre_tokenizer": {
"type": "Split",
"pattern": {
"String": " "
},
"behavior": "MergedWithPrevious",
"invert": false
}
Is this something expected? What is the pre-tokenizer for then?