Instructions to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/FrontiersMind/Nandi-Mini-600M-Early-Checkpoint

SGLang

How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with Docker Model Runner:
```
docker model run hf.co/FrontiersMind/Nandi-Mini-600M-Early-Checkpoint
```

vishesh-t27 commited on 8 days ago

Commit

29e0f98

verified ·

1 Parent(s): e8e8af0

Update configuration_nandi.py

Browse files

Files changed (1) hide show

configuration_nandi.py +3 -31

configuration_nandi.py CHANGED Viewed

@@ -1,17 +1,3 @@
-# Copyright 2026 RTA AI Labs. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 from transformers.configuration_utils import PretrainedConfig
@@ -97,26 +83,12 @@ class NandiConfig(PretrainedConfig):
         self.factorized_embedding = factorized_embedding
         self.embedding_rank = embedding_rank
         self.layer_sharing = layer_sharing
-        # Smoltron training loops over `layer_sharing_repeats` unconditionally
-        # (it does NOT check `layer_sharing`). Preserve the raw value here so
-        # the modeling code can honor it; the `layer_sharing` bool is now just
-        # metadata describing intent.
         self.layer_sharing_repeats = max(1, int(layer_sharing_repeats or 1))
         self.qk_norm = qk_norm
-        # `shared_kv` records that V was tied to K at pretraining time. In the
-        # HF model V is recomputed from `k_proj` at runtime (no `v_proj` module
-        # is materialised); see `NandiAttention.forward`.
         self.shared_kv = shared_kv
-        # `kv_cache_mode` controls the inference-time K/V cache strategy when
-        # `shared_kv=True`. Both modes produce identical outputs (numerical
-        # round-off only); they trade memory for compute:
-        #   "shared"  -> cache ONLY raw K (single tensor per layer). Each
-        #                decode step re-applies k_norm + RoPE to the full
-        #                cached raw K. Halves KV-cache memory.
-        #   "vanilla" -> cache post-norm post-RoPE K AND raw V (two tensors
-        #                per layer). k_norm + RoPE are applied only to the
-        #                current step's tokens. Standard HF behavior.
-        # Ignored when `shared_kv=False`. Defaults to "shared".
         if kv_cache_mode not in ("shared", "vanilla"):
             raise ValueError(
                 f"`kv_cache_mode` must be 'shared' or 'vanilla', got {kv_cache_mode!r}."

 from transformers.configuration_utils import PretrainedConfig
         self.factorized_embedding = factorized_embedding
         self.embedding_rank = embedding_rank
         self.layer_sharing = layer_sharing
         self.layer_sharing_repeats = max(1, int(layer_sharing_repeats or 1))
         self.qk_norm = qk_norm
         self.shared_kv = shared_kv
         if kv_cache_mode not in ("shared", "vanilla"):
             raise ValueError(
                 f"`kv_cache_mode` must be 'shared' or 'vanilla', got {kv_cache_mode!r}."