vishesh-t27 commited on
Commit
6c4fbcc
·
verified ·
1 Parent(s): 655e347

added tokenizer's fertility score in Readme

Browse files
Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -92,6 +92,24 @@ The model is trained on English and a diverse set of Indic languages, including:
92
  ### Note
93
  Mobile-LLM model checkpoints are not publicly available; their results are reported directly from the original paper. All other models have been evaluated using `lm-eval` under a consistent setup.
94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  ## 🚀 Usage
96
 
97
  ```python
 
92
  ### Note
93
  Mobile-LLM model checkpoints are not publicly available; their results are reported directly from the original paper. All other models have been evaluated using `lm-eval` under a consistent setup.
94
 
95
+ ## Tokenization Fertility Score across Languages
96
+
97
+ | Language | SmolLM3-3B | Qwen3-0.6B-Base | Sarvam-30B | Nandi-Mini-150M |
98
+ |----------|------------|-----------------|------------|------------------|
99
+ | English | 1.17 | 1.16 | 1.18 | 1.18 |
100
+ | Bengali | 8.66 | 7.51 | 1.46 | 1.44 |
101
+ | Gujarati | 10.47 | 9.37 | 1.70 | 1.53 |
102
+ | Hindi | 2.71 | 5.14 | 1.23 | 1.32 |
103
+ | Kannada | 16.43 | 12.96 | 2.08 | 1.90 |
104
+ | Malayalam| 17.77 | 14.56 | 2.81 | 2.05 |
105
+ | Marathi | 3.73 | 6.70 | 1.77 | 1.55 |
106
+ | Oriya | 19.07 | 15.75 | 1.77 | 2.68 |
107
+ | Punjabi | 9.23 | 8.66 | 1.42 | 1.42 |
108
+ | Tamil | 13.56 | 10.93 | 2.35 | 2.05 |
109
+ | Telugu | 15.40 | 13.38 | 2.09 | 1.77 |
110
+ | Assamese | 9.26 | 8.13 | 2.38 | 1.51 |
111
+
112
+
113
  ## 🚀 Usage
114
 
115
  ```python