vishesh-t27 commited on
Commit
ac5d535
·
verified ·
1 Parent(s): 0e55d2a

updated Readme.md

Browse files
Files changed (1) hide show
  1. README.md +210 -1
README.md CHANGED
@@ -1,3 +1,212 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - hi
6
+ - mr
7
+ - ta
8
+ - te
9
+ - kn
10
+ - ml
11
+ - bn
12
+ - pa
13
+ - gu
14
+ - or
15
+ pipeline_tag: text-generation
16
+ library_name: transformers
17
  ---
18
+
19
+ # Nandi-Mini-500M-Early-Checkpoint
20
+
21
+ ## Introduction
22
+
23
+ Nandi-Mini-500M-Early-Checkpoint is an early-stage checkpoint from the upcoming **Nandi-Mini-500M** model family — a compact multilingual language model focused on strong efficiency, deployment flexibility, and Indic language support.
24
+
25
+ The model is being trained completely from scratch and is designed to deliver strong performance at low compute and memory budgets. This checkpoint is shared to provide an early look into the model’s scaling behavior and training progress.
26
+
27
+ Unlike many small-scale models optimized primarily for benchmark performance, Nandi-Mini is being built with practical downstream usability in mind — including fine-tuning, edge deployment, and enterprise inference workloads.
28
+
29
+ The broader Nandi family focuses on:
30
+
31
+ - Efficient multilingual modeling across English and Indic languages
32
+ - High performance per parameter
33
+ - Edge and on-prem deployment readiness
34
+ - Low-latency inference
35
+ - Strong tokenizer efficiency for Indic scripts
36
+
37
+ This release is an **early checkpoint** and not the final converged model. Performance is expected to improve further with continued training and scaling.
38
+
39
+ 📢 We will soon share detailed technical blogs covering:
40
+
41
+ - Architecture design choices
42
+ - Training setup and scaling insights
43
+ - Tokenization strategy
44
+ - Dataset composition
45
+ - Benchmark evaluations
46
+ - Deployment optimizations
47
+
48
+ Stay tuned!
49
+
50
+ ---
51
+
52
+ ## Model Overview
53
+
54
+ **Repository:** `FrontiersMind/Nandi-mini-500M-Early-Checkpoint`
55
+
56
+ ### Model Details
57
+
58
+ - Type: Causal Language Model
59
+ - Training Stage: Early Pretraining Checkpoint
60
+ - Parameters: ~500M
61
+ - Architecture: Transformer decoder
62
+ - Positional Encoding: RoPE
63
+ - Normalization: RMSNorm + QK Norm
64
+ - Activation: SwiGLU
65
+ - Attention: GQA + Shared KV
66
+ - Embeddings: Tied embeddings with factorized design
67
+ - Context Length: 2,048 tokens
68
+ - Vocabulary Size: 131,072
69
+
70
+
71
+ ### Architectural Highlights
72
+
73
+ Nandi-Mini-500M introduces several efficiency-focused architectural optimizations designed for compact yet capable language models.
74
+
75
+ #### Shared KV (Shared Key-Value Vectors)
76
+
77
+ One of the core ideas explored in Nandi-Mini is **Shared KV**, an efficient attention mechanism where Key and Value representations partially share learned vector space representations across attention computation.
78
+
79
+ This approach is designed to:
80
+
81
+ - Reduce memory overhead during inference
82
+ - Improve parameter efficiency
83
+ - Lower KV-cache footprint for long-context generation
84
+ - Enable faster deployment on resource-constrained hardware
85
+ - Maintain strong quality despite smaller compute budgets
86
+
87
+ Shared KV is part of our broader effort toward building deployable foundation models optimized for:
88
+
89
+ - Edge devices
90
+ - On-premise AI systems
91
+ - Low-latency enterprise inference
92
+ - Efficient multilingual serving
93
+
94
+ This is still an active research and optimization area within the Nandi model family, and we plan to share deeper technical details in upcoming engineering blogs.
95
+
96
+ ---
97
+
98
+ ## 🌍 Supported Languages
99
+
100
+ The model is trained on English and multiple Indic languages, including:
101
+
102
+ - Hindi
103
+ - Bengali
104
+ - Tamil
105
+ - Telugu
106
+ - Marathi
107
+ - Gujarati
108
+ - Kannada
109
+ - Malayalam
110
+ - Punjabi
111
+ - Odia
112
+
113
+ ---
114
+
115
+ # 📊 Benchmark Results
116
+
117
+ ## General Benchmarks
118
+
119
+ | Model | Budget (T Tokens) | HellaSwag | WinoGrande | OBQA | PIQA | GPQA | ARC-e | ARC-c | MMLU | Average |
120
+ |---|---|---|---|---|---|---|---|---|---|---|
121
+ | MobiLlama-0.5B-Base | 1.3 | 39.65 | 53.67 | 30.60 | 70.35 | 24.33 | 52.82 | 23.63 | 24.18 | 39.90 |
122
+ | Qwen-2-0.5B-Base | 12 | 49.01 | 57.69 | 33.20 | 68.98 | 27.23 | 54.79 | 25.42 | 44.06 | 45.05 |
123
+ | Qwen2.5-0.5B-Base | 18 | 52.16 | 56.82 | 35.40 | 70.29 | 24.10 | 64.64 | 29.86 | 47.41 | 47.59 |
124
+ | Qwen3-0.6B-Base | 36 | 53.77 | 59.19 | 34.40 | 70.29 | 30.80 | 65.44 | 33.78 | 50.34 | 49.75 |
125
+ | Qwen3.5-0.8B-Base | 36 | 54.87 | 60.54 | 35.80 | 70.02 | 31.25 | 70.50 | 38.23 | 52.73 | 51.74 |
126
+ | SmolLM-360M-Base | 0.6 | 53.33 | 57.22 | 37.60 | 70.56 | 21.20 | 70.24 | 33.27 | 24.92 | 46.04 |
127
+ | SmolLM2-360M-Base | 4 | 56.30 | 59.19 | 37.60 | 71.81 | 25.22 | 67.88 | 36.68 | 25.55 | 47.53 |
128
+ | **Nandi-Mini-500M-Early-Checkpoint** | **0.5** | **44.86** | **54.77** | **34.80** | **68.60** | **26.33** | **64.73** | **29.70** | **29.01** | **44.10** |
129
+
130
+
131
+ ---
132
+
133
+ ## Tokenization Fertility Score Across Languages
134
+
135
+ | Language | SmolLM3-3B | Qwen3-0.6B-Base | Sarvam-1 | Nandi-Mini-500M |
136
+ |-----------|------------|-----------------|----------|------------------|
137
+ | English | 1.17 | 1.16 | 1.32 | **1.18** |
138
+ | Bengali | 8.66 | 7.51 | 1.55 | **1.44** |
139
+ | Gujarati | 10.47 | 9.37 | 1.55 | **1.53** |
140
+ | Hindi | 2.71 | 5.14 | **1.25** | 1.32 |
141
+ | Kannada | 16.43 | 12.96 | 2.10 | **1.90** |
142
+ | Malayalam | 17.77 | 14.56 | 2.49 | **2.05** |
143
+ | Marathi | 3.73 | 6.70 | 1.55 | **1.55** |
144
+ | Oriya | 19.07 | 15.75 | **2.18** | 2.68 |
145
+ | Punjabi | 9.23 | 8.66 | 1.47 | **1.42** |
146
+ | Tamil | 13.56 | 10.93 | 2.06 | **2.05** |
147
+ | Telugu | 15.40 | 13.38 | 2.09 | **1.77** |
148
+ | Assamese | 9.26 | 8.13 | 4.31 | **1.51** |
149
+
150
+ ### Why Fertility Matters
151
+
152
+ Lower fertility scores indicate more efficient tokenization, meaning fewer tokens are needed to represent text in a language.
153
+
154
+ This leads to:
155
+
156
+ - Better context utilization
157
+ - Lower inference cost
158
+ - Reduced latency
159
+ - Improved multilingual efficiency
160
+
161
+ Nandi-Mini’s tokenizer is heavily optimized for Indic languages and demonstrates strong compression efficiency across several scripts.
162
+
163
+ ---
164
+
165
+ # 🚀 Usage
166
+
167
+ ```python
168
+ !pip install transformers
169
+
170
+ from transformers import AutoModelForCausalLM, AutoTokenizer
171
+ import torch
172
+
173
+ model_name = "FrontiersMind/Nandi-mini-500M-Early-Checkpoint"
174
+
175
+ tokenizer = AutoTokenizer.from_pretrained(
176
+ model_name,
177
+ trust_remote_code=True
178
+ )
179
+
180
+ model = AutoModelForCausalLM.from_pretrained(
181
+ model_name,
182
+ trust_remote_code=True,
183
+ device_map="auto",
184
+ torch_dtype=torch.bfloat16
185
+ ).eval()
186
+
187
+ prompt = """
188
+ The night was quiet and the streets were empty.
189
+ A single light flickered in the distance.
190
+ Someone was walking slowly, carrying a small bag. Suddenly,
191
+ """
192
+
193
+ model_inputs = tokenizer(
194
+ [prompt],
195
+ return_tensors="pt"
196
+ ).to(model.device)
197
+
198
+ outputs = model.generate(
199
+ **model_inputs,
200
+ max_new_tokens=64,
201
+ do_sample=True,
202
+ temperature=0.7,
203
+ top_p=0.95,
204
+ repetition_penalty=1.1
205
+ )
206
+
207
+ response = tokenizer.decode(
208
+ outputs[0],
209
+ skip_special_tokens=True
210
+ )
211
+
212
+ print(response)