TokSuite – Llama-3.2 - Model Initialization seed=222

Model Summary

TokSuite–Llama-3.2 is part of TokSuite, a suite of language models designed to study the impact of tokenizer choice on language model behavior under controlled conditions.

This model uses the Llama-3.2 tokenizer and is otherwise identical to the other TokSuite-Llama models (toksuite/meta-llama-Llama-3.2-1B and toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_888) in architecture, training data, training budget and only differ in the initialization, denoted by the model_seed.

Tokenizer

Tokenizer: Llama-3.2
Tokenization method: BPE
Vocabulary size: 128,256
Out-of-vocabulary handling: Byte-fallback
Language coverage: Multilingual
Pretokenization source: GPT-4

Processing details:

Numbers: Group by 3
Contractions: GPT-4
Unicode normalization: None
Whitespace / boundary markers: Learned
Zerowidth chars: Token

Why Llama-3.2?

Llama-3.2 was included in TokSuite to represent a multilingual BPE tokenizer with modern GPT-4–style pretokenization and a medium-to-large vocabulary size. As described in the tokenizer selection rationale of the TokSuite paper, Llama-3.2 exemplifies a contemporary tokenizer design that combines subword segmentation with preprocessing conventions used in recent large language models.

Including Llama-3.2 enables TokSuite to study tokenizer behavior in settings where:

BPE segmentation is paired with GPT-4–style pretokenization,
vocabulary size is substantially larger than early English-centric tokenizers,
and multilingual text is handled through a single shared tokenizer.

This makes Llama-3.2 a representative example of modern multilingual BPE tokenization.

Model Architecture

Architecture: Decoder-only Transformer (Lingua's Llama-3.2-1B configuration)
Non-embedding parameters: ~1B
Context length: 4096 tokens
Framework: Meta Lingua
Initialization: Shared super-vocabulary initialization across TokSuite models

The architecture and training setup are identical across all TokSuite models; only the tokenizer differs.

Training Data

The model was trained on a multilingual corpus totaling approximately 100B tokens, composed of:

English: 40B tokens from FineWeb-Edu
Multilingual: 60B tokens evenly distributed across:
- Chinese (ZH)
- Turkish (TR)
- Italian (IT)
- Farsi (FA)

You can find the pretraining dataset here: toksuite/toksuite_pretraining_data

All TokSuite models are trained using a fixed token budget, following common practice in large-scale language model training.

Training Procedure

Training steps: 100,000
Sequence length: 4096
Batch size: 256 sequences
Optimizer: AdamW
Peak learning rate: 1e-3
Learning rate schedule: Cosine decay with 2,000 warm-up steps
Weight decay: 0.1

TokSuite Robustness Benchmark

TokSuite–Llama-3.2 variants are evaluated on the TokSuite robustness benchmark, which measures sensitivity to real-world text perturbations, including:

orthographic and spelling variations,
diacritics presence and absence,
keyboard and input-method noise,
Unicode formatting and homoglyphs,
OCR and spacing artifacts,
LaTeX and STEM-style formatting.

Tokenization Robustness under Multilingual Text Perturbations
Values represent relative performance drop, computed as (Acc_clean − Acc_perturbed) / Acc_clean, where lower values indicate greater robustness.

Perturbation types include:

Input: non-native keyboard input and romanization
Diacr.: optional diacritics
Orth.& Gram.: orthographic and grammatical errors
Morph: morphological variations including derivations, inflections, and contractions
Noise: homoglyph substitutions, OCR artifacts, typos, and spacing errors
LaTeX: LaTeX-style mathematical formatting
STEM: scientific diagrams and notational conventions
Unic.: Unicode styling characters

NEN denotes non-English inputs and EN denotes English inputs. The Avg column reports the average relative performance drop across all perturbation categories.

Model	Input (NEN)	Diacr. (NEN)	Orth. & Gram. (EN)	Orth. & Gram. (NEN)	Morph (EN)	Morph (NEN)	Noise (EN)	Noise (NEN)	LaTeX (EN)	STEM (EN)	Unic. (EN)	Avg ↓
toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_888	0.31	0.44	0.12	0.15	0.25	0.13	0.09	0.24	0.09	0.26	0.58	0.24
toksuite/meta-llama-Llama-3.2-1B	0.31	0.56	0.11	0.11	0.25	0.10	0.09	0.26	0.13	0.29	0.60	0.26
toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_222	0.29	0.54	0.09	0.12	0.20	0.13	0.14	0.29	0.19	0.37	0.59	0.27
Avg	0.30	0.51	0.11	0.12	0.24	0.12	0.10	0.26	0.13	0.31	0.59	0.25

Intended Use

This model is intended for:

research on tokenization and robustness,
multilingual NLP analysis,
controlled ablation studies,
benchmarking tokenizer behavior under noise.

It is not instruction-tuned, aligned, or optimized for deployment.

Limitations

Trained on a limited set of five languages.
Not optimized for instruction following or dialogue.
Fixed token budget constrains exposure to raw text depending on tokenization efficiency.
Intended strictly for research purposes.

Ethical Considerations

TokSuite models are released to support scientific investigation of tokenization effects.
They may reflect biases present in large-scale web data and should not be used in high-stakes or user-facing applications without additional safeguards.

Citation

If you use this model, please cite:

@article{toksuite2025,
  title={TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior},
  author={Altıntaş, Gul Sena and Ehghaghi, Malikeh and Lester, Brian and Liu, Fengyuan and Zhao, Wanru and Ciccone, Marco and Raffel, Colin},
  year={2025},
  arxiv={https://arxiv.org/abs/2512.20757},
}

Downloads last month: 170

Safetensors

Model size

2B params

Tensor type

BF16

Paper for toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_222

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Paper • 2512.20757 • Published Dec 23, 2025 • 18