Instructions to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/FrontiersMind/Nandi-Mini-600M-Early-Checkpoint
- SGLang
How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with Docker Model Runner:
docker model run hf.co/FrontiersMind/Nandi-Mini-600M-Early-Checkpoint
File size: 6,228 Bytes
0e55d2a ac5d535 0e55d2a ac5d535 ce078cb ac5d535 dbdbf22 ac5d535 e8e8af0 ac5d535 ce078cb ac5d535 3920cee ac5d535 b6aee65 ac5d535 3920cee ac5d535 918168d ac5d535 918168d ac5d535 918168d ac5d535 918168d 3920cee 777b4bf 3920cee 0f6bcc6 918168d ac5d535 ede3ae3 232ef3f a522301 ede3ae3 9a029fc ede3ae3 ac5d535 2a0a82f dbdbf22 ac5d535 ede3ae3 ac5d535 ede3ae3 ac5d535 ce078cb ac5d535 ede3ae3 ac5d535 ede3ae3 ac5d535 ede3ae3 ac5d535 f89364c ac5d535 a522301 ac5d535 a522301 ac5d535 fcad53f ac5d535 fcad53f ac5d535 a522301 08631f8 a522301 1ae17a8 ac5d535 a522301 f89364c a522301 d37afed a522301 ac5d535 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | ---
license: apache-2.0
language:
- en
- hi
- mr
- ta
- te
- kn
- ml
- bn
- pa
- gu
- or
pipeline_tag: text-generation
library_name: transformers
---
# Nandi-Mini-600M-Early-Checkpoint
## Introduction
Nandi-Mini-600M-Early-Checkpoint is an early-stage checkpoint (After **250 Billions tokens**) from the upcoming **Nandi-Mini-600M** model family, *this is not the final model*, a compact multilingual language model focused on strong efficiency, deployment flexibility, and Indic language support.
The model is being trained completely from scratch and is designed to deliver strong performance at low compute and memory budgets. This checkpoint is shared to provide an early look into the model’s scaling behavior and training progress.
This release is an **early checkpoint** and not the final converged model. Performance is expected to improve further with continued training and scaling.
📢 We will soon share technical blog ! Stay tuned!
---
### Architectural Highlights
Nandi-Mini-600M introduces several efficiency-focused architectural optimizations designed for compact yet capable language models.
#### Shared KV (Shared Key-Value Vectors)
Shared KV is one of the core architectural ideas explored in Nandi-Mini. Instead of computing separate Key and Value projections, both reuse a shared latent representation, while a lightweight Key normalization step is applied specifically for attention computation.
This design reduces KV-cache memory usage by ~50% during inference with only a small increase in compute overhead, since RoPE and Key normalization are applied dynamically during attention computation.
Nandi supports two KV cache modes:
```json
"kv_cache_mode": "shared"
```
Uses Shared KV, reducing KV-cache memory by ~50% with slightly higher compute overhead.
```json
"kv_cache_mode": "vanilla"
```
Uses standard separate Key-Value caching for maximum inference compatibility and lower compute overhead.
### KV-Cache Memory Comparison
<p align="center">
<img src="./shared_kv_cache_comparison_improved.png" width="650"/>
</p>
- Vanilla KV → Standard KV-cache memory usage
- Shared KV → ~50% lower KV-cache footprint
Shared KV is part of our broader focus on deployable foundation models optimized for:
- On-premise AI systems
- Memory-constrained deployments
- Edge devices
- Long-context inference workloads
This remains an active research area within the Nandi model family, and we plan to share deeper technical details in upcoming engineering blogs.
---
### Model Details
- Type: Causal Language Model
- Training Stage: Early Pretraining Checkpoint (**250 Billions tokens**)
- Parameters: ~600M
- Architecture: Transformer decoder
- Positional Encoding: RoPE
- Normalization: RMSNorm + QK Norm
- Activation: SwiGLU
- Attention: GQA + Shared KV
- Embeddings: Tied embeddings with factorized design
- Context length: 2,048 tokens (planned to be extended to 32,000 tokens)
- Vocabulary Size: 131,072
---
# 📊 Benchmark Results
This is not the final model, this is an early checkpoint. So the results are not final. Only 20% training is done.
## General Benchmarks
| Model | Trained Tokens | HellaSwag | WinoGrande | OBQA | PIQA | GPQA | ARC-e | ARC-c | MMLU | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| MobiLlama-0.5B-Base | 1.3 | 39.65 | 53.67 | 30.60 | 70.35 | 24.33 | 52.82 | 23.63 | 24.18 | 39.90 |
| Qwen-2-0.5B-Base | 12 | 49.01 | 57.69 | 33.20 | 68.98 | 27.23 | 54.79 | 25.42 | 44.06 | 45.05 |
| Qwen2.5-0.5B-Base | 18 | 52.16 | 56.82 | 35.40 | 70.29 | 24.10 | 64.64 | 29.86 | 47.41 | 47.59 |
| Qwen3-0.6B-Base | 36 | 53.77 | 59.19 | 34.40 | 70.29 | 30.80 | 65.44 | 33.78 | 50.34 | 49.75 |
| Qwen3.5-0.8B-Base | 36 | 54.87 | 60.54 | 35.80 | 70.02 | 31.25 | 70.50 | 38.23 | 52.73 | 51.74 |
| SmolLM-360M-Base | 0.6 | 53.33 | 57.22 | 37.60 | 70.56 | 21.20 | 70.24 | 33.27 | 24.92 | 46.04 |
| SmolLM2-360M-Base | 4 | 56.30 | 59.19 | 37.60 | 71.81 | 25.22 | 67.88 | 36.68 | 25.55 | 47.53 |
| **Nandi-Mini-600M-Early-Checkpoint-Base** | **0.2** | 44.86 | 54.77 | 34.80 | 68.60 | 26.33 | 64.73 | 29.70 | 29.01 | 44.10 |
---
## Tokenization Fertility Score Across Languages
| Language | SmolLM3-3B | Qwen3-0.6B-Base | Sarvam-1 | Nandi-Mini-600M |
|-----------|------------|-----------------|----------|------------------|
| English | 1.17 | 1.16 | 1.32 | **1.18** |
| Bengali | 8.66 | 7.51 | 1.55 | **1.44** |
| Gujarati | 10.47 | 9.37 | 1.55 | **1.53** |
| Hindi | 2.71 | 5.14 | **1.25** | 1.32 |
| Kannada | 16.43 | 12.96 | 2.10 | **1.90** |
| Malayalam | 17.77 | 14.56 | 2.49 | **2.05** |
| Marathi | 3.73 | 6.70 | 1.55 | **1.55** |
| Oriya | 19.07 | 15.75 | **2.18** | 2.68 |
| Punjabi | 9.23 | 8.66 | 1.47 | **1.42** |
| Tamil | 13.56 | 10.93 | 2.06 | **2.05** |
| Telugu | 15.40 | 13.38 | 2.09 | **1.77** |
| Assamese | 9.26 | 8.13 | 4.31 | **1.51** |
---
## 🌍 Supported Languages
The model is trained on English and a diverse set of Indic languages, including:
Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia
# 🚀 Usage
```python
!pip install transformers=='5.4.0'
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint"
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
dtype=torch.bfloat16
).to(device).eval()
#model.config.kv_cache_mode = "shared" # Use this one if wants to save 50% KV cache, but this will slight more compute
model.config.kv_cache_mode = "vanilla"
prompt = """The night was quiet and the streets were empty"""
model_inputs = tokenizer(
[prompt],
return_tensors="pt"
).to(model.device)
outputs = model.generate(
**model_inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.3,
top_k=20,
top_p=0.95,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
use_cache=True,
)
response = tokenizer.decode(
outputs[0],
skip_special_tokens=True
)
print(response) |