fix: set `clean_up_tokenization_spaces` to `false`

#356
by maxsloef - opened

tokenizer_config.json has "clean_up_tokenization_spaces": true, which causes tokenizer.decode() to silently corrupt text. This affects every Llama 3.x model on the Hub and every fine-tune or downstream model that inherits their tokenizer config. Both Llama 2 and Llama 4 ship with false.

The fix is a one-line change: "clean_up_tokenization_spaces": true"clean_up_tokenization_spaces": false.

What clean_up_tokenization_spaces does

When true, tokenizer.decode() strips spaces before punctuation marks during decoding. Specifically, it applies these string replacements to the decoded text:

text.replace(" .", ".").replace(" ?", "?").replace(" !", "!")
    .replace(" ,", ",").replace(" ' ", "'")
    .replace(" n't", "n't").replace(" 'm", "'m")
    .replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re")

This was designed for BERT-era WordPiece tokenizers (2019) where decoding produced artifacts like "Hello , world .". Llama 3's BPE tokenizer encodes spaces as part of tokens and does not produce these artifacts. The cleanup is actively destructive — it strips legitimate spaces from the decoded text.

Minimal reproduction

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

text = "x != y and a.b == c"
ids = tokenizer.encode(text, add_special_tokens=False)

decoded = tokenizer.decode(ids)
print(repr(decoded))

decoded_fixed = tokenizer.decode(ids, clean_up_tokenization_spaces=False)
print(repr(decoded_fixed))

Output:

'x!= y and a.b == c'    ← space before != silently dropped
'x != y and a.b == c'   ← correct

Impact

  • The bug is specific to HuggingFace's tokenizer implementation. Every other tokenizer — including Meta's own tiktoken, vLLM, SGLang, OLMo, etc. — does not have this behavior.
  • Every model that uses a Llama 3 tokenizer from HuggingFace has been and is currently decoding wrong — not just the official meta-llama repos, but all fine-tunes and derivatives that inherited the tokenizer config.

How Llama 3 got clean_up_tokenization_spaces=True

This was never an intentional choice by Meta:

  1. Llama 2 explicitly set it to False in LlamaTokenizer.__init__
  2. Llama 3 switched to PreTrainedTokenizerFast via a new Llama3Converter (PR #30334). The converter didn't pass clean_up_tokenization_spaces, so it inherited the HuggingFace transformers library default of True
  3. The uploaded tokenizer_config.json files on the Hub baked in True
  4. PR #33778 (Llama 3.2 support, Oct 2024) then hardcoded True in the conversion script for backward compatibility — without discussion of whether the value was correct
  5. The library default was changed to False in Sep 2024 (PR #31938), but the Llama 3 configs already had True frozen

This has been flagged multiple times:

@ArthurZucker acknowledged in #35175: "It should be set to False!"

Both Llama 2 and Llama 4 ship with false, confirming this is recognized as a bug.

Affected models

All 21 Llama 3.x text model repos on the Hub have "clean_up_tokenization_spaces": true:

  • Llama 3.0: Meta-Llama-3-8B, -8B-Instruct, -70B, -70B-Instruct
  • Llama 3.1: Llama-3.1-8B, -8B-Instruct, -70B, -70B-Instruct, -405B, -405B-FP8, -405B-Instruct, -405B-Instruct-FP8
  • Llama 3.2: Llama-3.2-1B, -1B-Instruct, -3B, -3B-Instruct
  • Llama 3.3: Llama-3.3-70B-Instruct

Companion PRs have been opened on each of these repos. Downstream models (fine-tunes and derivatives) that inherited the tokenizer config are not covered by these PRs and will need to be fixed independently.

(removed due to broken links - see below)

(removed due to broken links - see below)

Companion PRs

The same one-line fix has been opened on all 24 meta-llama repos that have clean_up_tokenization_spaces=true in their tokenizer_config.json. Tested across every version of transformers from 4.40.0 (first Llama 3 support, April 2024) through 5.3.0 (latest, March 2026) — all produce incorrect decoded text.

Llama 3.0:

Llama 3.1:

Llama 3.2:

Llama 3.3:

Llama Guard:

Prompt Guard:

The remaining 46 meta-llama repos either have false already (Llama 4, Llama-Guard-4) or don't have their own tokenizer_config.json (CodeLlama, Llama 2, quantized/vision/Original-format variants). Downstream models (fine-tunes and derivatives) that inherited the tokenizer config are not covered by these PRs and will need to be fixed independently.

High-download descendant PRs

Surveyed the top Llama 3 derivative models by download count on the Hub. Opened fix PRs on the 13 highest-download non-meta-llama models that ship their own tokenizer_config.json with clean_up_tokenization_spaces=true. Together with the 24 official meta-llama PRs above, these cover ~90% of total downloads across all affected models found.

RedHatAI (quantizations):

AWQ quantizations:

unsloth (mirrors/quantizations):

Other:

Total PRs filed: 37 (24 official meta-llama + 13 high-download descendants). There are ~170 more affected models on the Hub with lower download counts not covered here.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment