Pruned Small-100 for Singlish-Sinhala Translation

A vocabulary-optimized checkpoint of alirezamsh/small100, pruned from 128k to 12k tokens based on 1M Singlish-Sinhala sentence pairs. Designed specifically for fine-tuning on romanized Sinhala (Singlish) to native Sinhala script translation.

What This Is

Small-100 supports 100 languages but carries massive vocabulary overhead when you're only working with two. I extracted token usage statistics from a 1M-pair Singlish-Sinhala dataset and found that 12,212 tokens (9.5% of the original vocab) covered the entire corpus. This model keeps only those tokens, with original embeddings transferred to their new IDs.

Not a retrained tokenizer. Same BPE algorithm, same segmentation logic—just 90% fewer tokens and proportionally fewer parameters to update during fine-tuning.

Why Bother?

Training speed. The embedding layer shrinks from 128k×256 to 12k×256 parameters. Gradient computation is faster, memory footprint drops, and you avoid wasting updates on Swahili tokens when translating "kohomada" to "කොහොමද".

Limitations

  • Zero-shot performance will be poor. This hasn't been fine-tuned yet.
  • Only useful for Singlish-Sinhala tasks. Tokens for other languages were discarded.
  • If your domain uses vocabulary outside the 1M training pairs (medical terms, technical jargon), you'll hit unknown tokens.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("savinugunarathna/pruned-Small-100-for-fineTune")
model = AutoModelForSeq2SeqLM.from_pretrained("savinugunarathna/pruned-Small-100-for-fineTune")

# Example (no fine-tuning yet, so output will be random)
inputs = tokenizer("kohomada oyalata", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Fine-tune this on your Singlish-Sinhala pairs using standard seq2seq training. Check the original Small-100 paper for architecture details.

Technical Notes

Pruning was done by:

  1. Tokenizing 1M sentence pairs with the original Small-100 tokenizer
  2. Collecting all unique token IDs that appeared
  3. Filtering vocab.json to keep only those tokens
  4. Renumbering IDs contiguously (0, 1, 2...)
  5. Transferring old embedding weights to new positions

The sentencepiece.bpe.model file is unchanged—it's the vocabulary map (vocab.json) that got pruned.

Citation

If you use this pruned model, please cite:

@misc{gunarathna2025prunedsmall100,
  author = {Savinu Gunarathna},
  title = {Pruned Small-100 for Singlish-Sinhala Translation},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/savinugunarathna/pruned-Small-100-for-fineTune}}
}

And the original Small-100 work:

@misc{mohammadshahi2022small100,
  title={SMALL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages}, 
  author={Alireza Mohammadshahi and Vassilina Nikoulina and Alexandre Berard and Caroline Brun and James Henderson and Laurent Besacier},
  year={2022},
  eprint={2210.11621},
  archivePrefix={arXiv}
}

License

Inherits the license from alirezamsh/small100 (CC-BY-NC-4.0). Non-commercial use only.

Downloads last month
3
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for savinugunarathna/pruned-Small-100-for-fineTune

Finetuned
(3)
this model

Papers for savinugunarathna/pruned-Small-100-for-fineTune