Strange behaviour of the tokenizer

#58
by andercorral - opened

Inspecting the tokenizer config in 'tokenizer.json' I've noticed that the normalizer replaces blank spaces with the special character '▁'. Then, the pre-tokenizer is applied but it will never split on blank spaces as the resulting string from the normalizer will be a single string of '▁' concatenated words.

For example:

Original text: 'Hello World!"
Normalized text: "Hello▁World!"
Pretokenized text: "Hello▁World!" (no split at all)

"normalizer": {
    "type": "Replace",
    "pattern": {
      "String": " "
    },
    "content": "▁"
  },
  "pre_tokenizer": {
    "type": "Split",
    "pattern": {
      "String": " "
    },
    "behavior": "MergedWithPrevious",
    "invert": false
  }

Is this something expected? What is the pre-tokenizer for then?

Sign up or log in to comment