aksinghyaani commited on
Commit
3920cee
·
verified ·
1 Parent(s): 22a013d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -37
README.md CHANGED
@@ -72,45 +72,11 @@ Stay tuned!
72
 
73
  Nandi-Mini-500M introduces several efficiency-focused architectural optimizations designed for compact yet capable language models.
74
 
75
-
76
- ### Shared KV KV-Cache Memory Comparison
77
-
78
- The following comparison illustrates the KV-cache memory reduction enabled by Shared KV mode.
79
-
80
- ```python
81
- import matplotlib.pyplot as plt
82
-
83
- modes = ["Vanilla KV", "Shared KV"]
84
- memory = [100, 50]
85
-
86
- plt.figure(figsize=(5,4))
87
- bars = plt.bar(modes, memory)
88
-
89
- plt.ylabel("Relative KV Cache Memory")
90
- plt.title("KV Cache Memory Usage")
91
-
92
- for bar, val in zip(bars, memory):
93
- plt.text(
94
- bar.get_x() + bar.get_width()/2,
95
- val + 2,
96
- f"{val}%",
97
- ha='center'
98
- )
99
-
100
- plt.ylim(0, 120)
101
- plt.show()
102
- ```
103
-
104
- Expected result:
105
-
106
- - Vanilla KV → 100% KV-cache memory
107
- - Shared KV → ~50% KV-cache memory
108
-
109
- Shared KV trades a small increase in compute overhead for significantly lower memory usage, since RoPE and Key normalization are applied dynamically during attention computation.
110
 
111
  Shared KV is one of the core architectural ideas explored in Nandi-Mini. Instead of storing separate Key and Value vectors, both share the same underlying representation, while a lightweight Key normalization step is applied specifically for attention computation.
112
 
113
- This design reduces KV-cache memory usage by ~50% during inference with only a small increase in compute overhead, since Key normalization and RoPE transformations must be applied dynamically during attention computation.
114
 
115
  Nandi supports two KV cache modes:
116
 
@@ -126,7 +92,21 @@ Uses Shared KV, reducing KV-cache memory by ~50% with slightly higher compute ov
126
 
127
  Uses standard separate Key-Value caching for maximum inference compatibility and lower compute overhead.
128
 
129
- Shared KV is part of our broader focus on deployable foundation models optimized for edge devices, on-premise AI systems, low-latency enterprise inference, and efficient multilingual serving.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
 
131
  This remains an active research area within the Nandi model family, and we plan to share deeper technical details in upcoming engineering blogs.
132
 
 
72
 
73
  Nandi-Mini-500M introduces several efficiency-focused architectural optimizations designed for compact yet capable language models.
74
 
75
+ #### Shared KV (Shared Key-Value Vectors)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  Shared KV is one of the core architectural ideas explored in Nandi-Mini. Instead of storing separate Key and Value vectors, both share the same underlying representation, while a lightweight Key normalization step is applied specifically for attention computation.
78
 
79
+ This design reduces KV-cache memory usage by ~50% during inference with only a small increase in compute overhead, since RoPE and Key normalization are applied dynamically during attention computation.
80
 
81
  Nandi supports two KV cache modes:
82
 
 
92
 
93
  Uses standard separate Key-Value caching for maximum inference compatibility and lower compute overhead.
94
 
95
+ ### KV-Cache Memory Comparison
96
+
97
+ <p align="center">
98
+ <img src="./assets/shared_kv_cache_comparison_improved.png" width="650"/>
99
+ </p>
100
+
101
+ - Vanilla KV → Standard KV-cache memory usage
102
+ - Shared KV → ~50% lower KV-cache footprint
103
+
104
+ Shared KV is part of our broader focus on deployable foundation models optimized for:
105
+
106
+ - Edge devices
107
+ - On-premise AI systems
108
+ - Low-latency enterprise inference
109
+ - Efficient multilingual serving
110
 
111
  This remains an active research area within the Nandi model family, and we plan to share deeper technical details in upcoming engineering blogs.
112