YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
TokTok Spanish Tokenizer
BPE tokenizer trained on Spanish Wikipedia using SentencePiece.
Details
- Vocab size: 64,000
- Model type: BPE
- Training data: Spanish Wikipedia (100K articles)
- Character coverage: 99.95%
- Byte fallback: enabled
Usage
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")
text = "El procesamiento de lenguaje natural es fascinante."
tokens = sp.encode_as_pieces(text)
print(tokens)
Why?
English-centric tokenizers (GPT-4, Llama) use 20-40% more tokens on Spanish text. This tokenizer is optimized for Spanish, giving better compression and cheaper inference.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support