YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
BoKenlm-syl-v0.4 - Tibetan KenLM Language Model
A KenLM n-gram language model trained on Tibetan text, tokenized with syllable tokenizer.
Model Details
| Parameter | Value |
|---|---|
| Model Type | Modified Kneser-Ney 5-gram |
| Tokenizer | Tibetan syllable-based (botok-rs SimpleTokenizer) |
| Training Corpus | bo_corpus.txt |
| Pruning | 0 0 1 |
| Tokens | 166,004,503 |
| Vocabulary Size | 20,564 |
N-gram Statistics
| Order | Count | D1 | D2 | D3+ |
|---|---|---|---|---|
| 1 | 20,564 | 0.5906 | 0.9834 | 1.4625 |
| 2 | 1,542,980 | 0.6391 | 1.0500 | 1.4271 |
| 3 | 5,447,977 | 0.7179 | 1.0874 | 1.4057 |
| 4 | 11,954,834 | 0.7930 | 1.1368 | 1.3910 |
| 5 | 16,598,281 | 0.7284 | 1.2407 | 1.4638 |
Memory Estimates
| Type | MB | Details |
|---|---|---|
| probing | 719 | assuming -p 1.5 |
| probing | 827 | assuming -r models -p 1.5 |
| trie | 321 | without quantization |
| trie | 170 | assuming -q 8 -b 8 quantization |
| trie | 285 | assuming -a 22 array pointer compression |
| trie | 133 | assuming -a 22 -q 8 -b 8 array pointer compression and quantization |
Training Resources
| Metric | Value |
|---|---|
| Peak Virtual Memory | 12,333 MB |
| Peak RSS | 4,702 MB |
| Wall Time | 92.8s |
| User Time | 91.7s |
| System Time | 25.1s |
Usage
import kenlm
model = kenlm.Model("BoKenlm-syl-v0.4.arpa")
# Score a tokenized sentence
score = model.score("བོད་ སྐད་ ཀྱི་ ཚིག་ གྲུབ་ འདི་ ཡིན།")
print(score)
Files
BoKenlm-syl-v0.4.arpa— ARPA format language modelREADME.md— This model card
License
Apache 2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support