Instructions to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/FrontiersMind/Nandi-Mini-600M-Early-Checkpoint
- SGLang
How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with Docker Model Runner:
docker model run hf.co/FrontiersMind/Nandi-Mini-600M-Early-Checkpoint
Update README.md
Browse files
README.md
CHANGED
|
@@ -74,24 +74,27 @@ Nandi-Mini-500M introduces several efficiency-focused architectural optimization
|
|
| 74 |
|
| 75 |
#### Shared KV (Shared Key-Value Vectors)
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
This
|
| 80 |
|
| 81 |
-
|
| 82 |
-
- Improve parameter efficiency
|
| 83 |
-
- Lower KV-cache footprint for long-context generation
|
| 84 |
-
- Enable faster deployment on resource-constrained hardware
|
| 85 |
-
- Maintain strong quality despite smaller compute budgets
|
| 86 |
|
| 87 |
-
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
-
|
| 90 |
-
- On-premise AI systems
|
| 91 |
-
- Low-latency enterprise inference
|
| 92 |
-
- Efficient multilingual serving
|
| 93 |
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
---
|
| 97 |
|
|
|
|
| 74 |
|
| 75 |
#### Shared KV (Shared Key-Value Vectors)
|
| 76 |
|
| 77 |
+
Shared KV is one of the core architectural ideas explored in Nandi-Mini. Instead of storing separate Key and Value vectors, both share the same underlying representation, while a lightweight Key normalization step is applied specifically for attention computation.
|
| 78 |
|
| 79 |
+
This design reduces KV-cache memory usage by ~50% during inference with only a small increase in compute overhead, since Key normalization and RoPE transformations must be applied dynamically during attention computation.
|
| 80 |
|
| 81 |
+
Nandi supports two KV cache modes:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
+
```json
|
| 84 |
+
"kv_cache_mode": "shared"
|
| 85 |
+
```
|
| 86 |
|
| 87 |
+
Uses Shared KV, reducing KV-cache memory by ~50% with slightly higher compute overhead.
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
+
```json
|
| 90 |
+
"kv_cache_mode": "vanilla"
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
Uses standard separate Key-Value caching for maximum inference compatibility and lower compute overhead.
|
| 94 |
+
|
| 95 |
+
Shared KV is part of our broader focus on deployable foundation models optimized for edge devices, on-premise AI systems, low-latency enterprise inference, and efficient multilingual serving.
|
| 96 |
+
|
| 97 |
+
This remains an active research area within the Nandi model family, and we plan to share deeper technical details in upcoming engineering blogs.
|
| 98 |
|
| 99 |
---
|
| 100 |
|