TokSuite – Llama-3.2 - Model Initialization seed=222
Model Summary
TokSuite–Llama-3.2 is part of TokSuite, a suite of language models designed to study the impact of tokenizer choice on language model behavior under controlled conditions.
This model uses the Llama-3.2 tokenizer and is otherwise identical to the other TokSuite-Llama models
(toksuite/meta-llama-Llama-3.2-1B and
toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_888)
in architecture, training data, training budget and only differ in the initialization, denoted by the model_seed.
Tokenizer
- Tokenizer: Llama-3.2
- Tokenization method: BPE
- Vocabulary size: 128,256
- Out-of-vocabulary handling: Byte-fallback
- Language coverage: Multilingual
- Pretokenization source: GPT-4
Processing details:
- Numbers: Group by 3
- Contractions: GPT-4
- Unicode normalization: None
- Whitespace / boundary markers: Learned
- Zerowidth chars: Token
Why Llama-3.2?
Llama-3.2 was included in TokSuite to represent a multilingual BPE tokenizer with modern GPT-4–style pretokenization and a medium-to-large vocabulary size. As described in the tokenizer selection rationale of the TokSuite paper, Llama-3.2 exemplifies a contemporary tokenizer design that combines subword segmentation with preprocessing conventions used in recent large language models.
Including Llama-3.2 enables TokSuite to study tokenizer behavior in settings where:
- BPE segmentation is paired with GPT-4–style pretokenization,
- vocabulary size is substantially larger than early English-centric tokenizers,
- and multilingual text is handled through a single shared tokenizer.
This makes Llama-3.2 a representative example of modern multilingual BPE tokenization.
Model Architecture
- Architecture: Decoder-only Transformer (Lingua's Llama-3.2-1B configuration)
- Non-embedding parameters: ~1B
- Context length: 4096 tokens
- Framework: Meta Lingua
- Initialization: Shared super-vocabulary initialization across TokSuite models
The architecture and training setup are identical across all TokSuite models; only the tokenizer differs.
Training Data
The model was trained on a multilingual corpus totaling approximately 100B tokens, composed of:
- English: 40B tokens from FineWeb-Edu
- Multilingual: 60B tokens evenly distributed across:
- Chinese (ZH)
- Turkish (TR)
- Italian (IT)
- Farsi (FA)
You can find the pretraining dataset here: toksuite/toksuite_pretraining_data
All TokSuite models are trained using a fixed token budget, following common practice in large-scale language model training.
Training Procedure
- Training steps: 100,000
- Sequence length: 4096
- Batch size: 256 sequences
- Optimizer: AdamW
- Peak learning rate: 1e-3
- Learning rate schedule: Cosine decay with 2,000 warm-up steps
- Weight decay: 0.1
TokSuite Robustness Benchmark
TokSuite–Llama-3.2 variants are evaluated on the TokSuite robustness benchmark, which measures sensitivity to real-world text perturbations, including:
- orthographic and spelling variations,
- diacritics presence and absence,
- keyboard and input-method noise,
- Unicode formatting and homoglyphs,
- OCR and spacing artifacts,
- LaTeX and STEM-style formatting.
Tokenization Robustness under Multilingual Text Perturbations
Values represent relative performance drop, computed as (Acc_clean − Acc_perturbed) / Acc_clean, where lower values indicate greater robustness.
Perturbation types include:
- Input: non-native keyboard input and romanization
- Diacr.: optional diacritics
- Orth.& Gram.: orthographic and grammatical errors
- Morph: morphological variations including derivations, inflections, and contractions
- Noise: homoglyph substitutions, OCR artifacts, typos, and spacing errors
- LaTeX: LaTeX-style mathematical formatting
- STEM: scientific diagrams and notational conventions
- Unic.: Unicode styling characters
NEN denotes non-English inputs and EN denotes English inputs. The Avg column reports the average relative performance drop across all perturbation categories.
| Model | Input (NEN) | Diacr. (NEN) | Orth. & Gram. (EN) | Orth. & Gram. (NEN) | Morph (EN) | Morph (NEN) | Noise (EN) | Noise (NEN) | LaTeX (EN) | STEM (EN) | Unic. (EN) | Avg ↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_888 | 0.31 | 0.44 | 0.12 | 0.15 | 0.25 | 0.13 | 0.09 | 0.24 | 0.09 | 0.26 | 0.58 | 0.24 |
| toksuite/meta-llama-Llama-3.2-1B | 0.31 | 0.56 | 0.11 | 0.11 | 0.25 | 0.10 | 0.09 | 0.26 | 0.13 | 0.29 | 0.60 | 0.26 |
| toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_222 | 0.29 | 0.54 | 0.09 | 0.12 | 0.20 | 0.13 | 0.14 | 0.29 | 0.19 | 0.37 | 0.59 | 0.27 |
| Avg | 0.30 | 0.51 | 0.11 | 0.12 | 0.24 | 0.12 | 0.10 | 0.26 | 0.13 | 0.31 | 0.59 | 0.25 |
Intended Use
This model is intended for:
- research on tokenization and robustness,
- multilingual NLP analysis,
- controlled ablation studies,
- benchmarking tokenizer behavior under noise.
It is not instruction-tuned, aligned, or optimized for deployment.
Limitations
- Trained on a limited set of five languages.
- Not optimized for instruction following or dialogue.
- Fixed token budget constrains exposure to raw text depending on tokenization efficiency.
- Intended strictly for research purposes.
Ethical Considerations
TokSuite models are released to support scientific investigation of tokenization effects.
They may reflect biases present in large-scale web data and should not be used in high-stakes or user-facing applications without additional safeguards.
Citation
If you use this model, please cite:
@article{toksuite2025,
title={TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior},
author={Altıntaş, Gul Sena and Ehghaghi, Malikeh and Lester, Brian and Liu, Fengyuan and Zhao, Wanru and Ciccone, Marco and Raffel, Colin},
year={2025},
arxiv={https://arxiv.org/abs/2512.20757},
}
- Downloads last month
- 170