Pruned Small-100 for Singlish-Sinhala Translation
A vocabulary-optimized checkpoint of alirezamsh/small100, pruned from 128k to 12k tokens based on 1M Singlish-Sinhala sentence pairs. Designed specifically for fine-tuning on romanized Sinhala (Singlish) to native Sinhala script translation.
What This Is
Small-100 supports 100 languages but carries massive vocabulary overhead when you're only working with two. I extracted token usage statistics from a 1M-pair Singlish-Sinhala dataset and found that 12,212 tokens (9.5% of the original vocab) covered the entire corpus. This model keeps only those tokens, with original embeddings transferred to their new IDs.
Not a retrained tokenizer. Same BPE algorithm, same segmentation logic—just 90% fewer tokens and proportionally fewer parameters to update during fine-tuning.
Why Bother?
Training speed. The embedding layer shrinks from 128k×256 to 12k×256 parameters. Gradient computation is faster, memory footprint drops, and you avoid wasting updates on Swahili tokens when translating "kohomada" to "කොහොමද".
Limitations
- Zero-shot performance will be poor. This hasn't been fine-tuned yet.
- Only useful for Singlish-Sinhala tasks. Tokens for other languages were discarded.
- If your domain uses vocabulary outside the 1M training pairs (medical terms, technical jargon), you'll hit unknown tokens.
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("savinugunarathna/pruned-Small-100-for-fineTune")
model = AutoModelForSeq2SeqLM.from_pretrained("savinugunarathna/pruned-Small-100-for-fineTune")
# Example (no fine-tuning yet, so output will be random)
inputs = tokenizer("kohomada oyalata", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Fine-tune this on your Singlish-Sinhala pairs using standard seq2seq training. Check the original Small-100 paper for architecture details.
Technical Notes
Pruning was done by:
- Tokenizing 1M sentence pairs with the original Small-100 tokenizer
- Collecting all unique token IDs that appeared
- Filtering
vocab.jsonto keep only those tokens - Renumbering IDs contiguously (0, 1, 2...)
- Transferring old embedding weights to new positions
The sentencepiece.bpe.model file is unchanged—it's the vocabulary map (vocab.json) that got pruned.
Citation
If you use this pruned model, please cite:
@misc{gunarathna2025prunedsmall100,
author = {Savinu Gunarathna},
title = {Pruned Small-100 for Singlish-Sinhala Translation},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/savinugunarathna/pruned-Small-100-for-fineTune}}
}
And the original Small-100 work:
@misc{mohammadshahi2022small100,
title={SMALL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages},
author={Alireza Mohammadshahi and Vassilina Nikoulina and Alexandre Berard and Caroline Brun and James Henderson and Laurent Besacier},
year={2022},
eprint={2210.11621},
archivePrefix={arXiv}
}
License
Inherits the license from alirezamsh/small100 (CC-BY-NC-4.0). Non-commercial use only.
- Downloads last month
- 3
Model tree for savinugunarathna/pruned-Small-100-for-fineTune
Base model
alirezamsh/small100