Update README.md
Browse files
README.md
CHANGED
|
@@ -45,6 +45,65 @@ Evaluated the tokenizer's performance on:
|
|
| 45 |
| **Hindi** | 49 | 14 | 9.07 | 0.928 |
|
| 46 |
| **English** | 65 | 16 | 4.06 | 0.937 |
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |

|
| 50 |
|
|
|
|
| 45 |
| **Hindi** | 49 | 14 | 9.07 | 0.928 |
|
| 46 |
| **English** | 65 | 16 | 4.06 | 0.937 |
|
| 47 |
|
| 48 |
+
### 4. Encoding-Decoding Capabilities
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
Hindi Analysis:
|
| 52 |
+
Original Text: नमस्ते, मैं भारत से हूँ। दिल्ली बहुत बड़ा शहर है।
|
| 53 |
+
Token IDs Count: 14
|
| 54 |
+
Token Strings: ['नम', 'सà¥įतà¥ĩ', ',', 'Ġमà¥Īà¤Ĥ', 'Ġà¤Ńारत', 'Ġसà¥ĩ', 'Ġहà¥Ĥà¤ģ', '।', 'Ġदिलà¥įलà¥Ģ', 'Ġबहà¥ģत', 'Ġबड़ा', 'Ġशहर', 'Ġहà¥Ī', '।']
|
| 55 |
+
Decoded Text: नमस्ते, मैं भारत से हूँ। दिल्ली बहुत बड़ा शहर है।
|
| 56 |
+
Text Reconstruction: True
|
| 57 |
+
|
| 58 |
+
Hindi Analysis:
|
| 59 |
+
Original Text: हिंदी भाषा बहुत सुंदर है।
|
| 60 |
+
Token IDs Count: 7
|
| 61 |
+
Token Strings: ['ह', 'िà¤Ĥदà¥Ģ', 'Ġà¤Ńाषा', 'Ġबहà¥ģत', 'Ġसà¥ģà¤Ĥदर', 'Ġहà¥Ī', '।']
|
| 62 |
+
Decoded Text: हिंदी भाषा बहुत सुंदर है।
|
| 63 |
+
Text Reconstruction: True
|
| 64 |
+
|
| 65 |
+
Hindi Analysis:
|
| 66 |
+
Original Text: मुझे किताबें पढ़ना पसंद है।
|
| 67 |
+
Token IDs Count: 7
|
| 68 |
+
Token Strings: ['म', 'à¥ģà¤Ŀà¥ĩ', 'Ġà¤ķिताबà¥ĩà¤Ĥ', 'Ġपढ़ना', 'Ġपसà¤Ĥद', 'Ġहà¥Ī', '।']
|
| 69 |
+
Decoded Text: मुझे किताबें पढ़ना पसंद है।
|
| 70 |
+
Text Reconstruction: True
|
| 71 |
+
|
| 72 |
+
Hindi Analysis:
|
| 73 |
+
Original Text: यह एक उदाहरण वाक्य है।
|
| 74 |
+
Token IDs Count: 6
|
| 75 |
+
Token Strings: ['यह', 'Ġà¤ıà¤ķ', 'Ġà¤īदाहरण', 'Ġवाà¤ķà¥įय', 'Ġहà¥Ī', '।']
|
| 76 |
+
Decoded Text: यह एक उदाहरण वाक्य है।
|
| 77 |
+
Text Reconstruction: True
|
| 78 |
+
|
| 79 |
+
English Analysis:
|
| 80 |
+
Original Text: Hello, I am from India. Delhi is a big city.
|
| 81 |
+
Token IDs Count: 13
|
| 82 |
+
Token Strings: ['Hello', ',', 'ĠI', 'Ġam', 'Ġfrom', 'ĠIndia', '.', 'ĠDelhi', 'Ġis', 'Ġa', 'Ġbig', 'Ġcity', '.']
|
| 83 |
+
Decoded Text: Hello, I am from India. Delhi is a big city.
|
| 84 |
+
Text Reconstruction: True
|
| 85 |
+
|
| 86 |
+
English Analysis:
|
| 87 |
+
Original Text: The English language is widely spoken.
|
| 88 |
+
Token IDs Count: 7
|
| 89 |
+
Token Strings: ['The', 'ĠEnglish', 'Ġlanguage', 'Ġis', 'Ġwidely', 'Ġspoken', '.']
|
| 90 |
+
Decoded Text: The English language is widely spoken.
|
| 91 |
+
Text Reconstruction: True
|
| 92 |
+
|
| 93 |
+
English Analysis:
|
| 94 |
+
Original Text: I enjoy reading books.
|
| 95 |
+
Token IDs Count: 5
|
| 96 |
+
Token Strings: ['I', 'Ġenjoy', 'Ġreading', 'Ġbooks', '.']
|
| 97 |
+
Decoded Text: I enjoy reading books.
|
| 98 |
+
Text Reconstruction: True
|
| 99 |
+
|
| 100 |
+
English Analysis:
|
| 101 |
+
Original Text: This is an example sentence.
|
| 102 |
+
Token IDs Count: 6
|
| 103 |
+
Token Strings: ['This', 'Ġis', 'Ġan', 'Ġexample', 'Ġsentence', '.']
|
| 104 |
+
Decoded Text: This is an example sentence.
|
| 105 |
+
Text Reconstruction: True
|
| 106 |
+
```
|
| 107 |
|
| 108 |

|
| 109 |
|