aksinghyaani commited on
Commit
918168d
·
verified ·
1 Parent(s): 880e01b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -13
README.md CHANGED
@@ -74,24 +74,27 @@ Nandi-Mini-500M introduces several efficiency-focused architectural optimization
74
 
75
  #### Shared KV (Shared Key-Value Vectors)
76
 
77
- One of the core ideas explored in Nandi-Mini is **Shared KV**, an efficient attention mechanism where Key and Value representations partially share learned vector space representations across attention computation.
78
 
79
- This approach is designed to:
80
 
81
- - Reduce memory overhead during inference
82
- - Improve parameter efficiency
83
- - Lower KV-cache footprint for long-context generation
84
- - Enable faster deployment on resource-constrained hardware
85
- - Maintain strong quality despite smaller compute budgets
86
 
87
- Shared KV is part of our broader effort toward building deployable foundation models optimized for:
 
 
88
 
89
- - Edge devices
90
- - On-premise AI systems
91
- - Low-latency enterprise inference
92
- - Efficient multilingual serving
93
 
94
- This is still an active research and optimization area within the Nandi model family, and we plan to share deeper technical details in upcoming engineering blogs.
 
 
 
 
 
 
 
 
95
 
96
  ---
97
 
 
74
 
75
  #### Shared KV (Shared Key-Value Vectors)
76
 
77
+ Shared KV is one of the core architectural ideas explored in Nandi-Mini. Instead of storing separate Key and Value vectors, both share the same underlying representation, while a lightweight Key normalization step is applied specifically for attention computation.
78
 
79
+ This design reduces KV-cache memory usage by ~50% during inference with only a small increase in compute overhead, since Key normalization and RoPE transformations must be applied dynamically during attention computation.
80
 
81
+ Nandi supports two KV cache modes:
 
 
 
 
82
 
83
+ ```json
84
+ "kv_cache_mode": "shared"
85
+ ```
86
 
87
+ Uses Shared KV, reducing KV-cache memory by ~50% with slightly higher compute overhead.
 
 
 
88
 
89
+ ```json
90
+ "kv_cache_mode": "vanilla"
91
+ ```
92
+
93
+ Uses standard separate Key-Value caching for maximum inference compatibility and lower compute overhead.
94
+
95
+ Shared KV is part of our broader focus on deployable foundation models optimized for edge devices, on-premise AI systems, low-latency enterprise inference, and efficient multilingual serving.
96
+
97
+ This remains an active research area within the Nandi model family, and we plan to share deeper technical details in upcoming engineering blogs.
98
 
99
  ---
100