| --- |
| license: apache-2.0 |
| tags: |
| - pytorch |
| - transformer |
| - mamba |
| - hybrid |
| - matryoshka |
| - nanochat |
| - adaptive-compute |
| pipeline_tag: text-generation |
| --- |
| |
| # π Adamba: Adaptive Mamba |
|
|
| > **Ad**aptive **Mamba**: Elastic compute with dynamic Matryoshka scaling |
|
|
| **Project location [unixsysdev/adamba](https://github.com/unixsysdev/adamba)** |
|
|
| ## Available Checkpoints |
|
|
| | Variant | Parameters | Dim | Features | Status | Download | |
| |---------|------------|-----|----------|--------|----------| |
| | phase1_6b_base | 6.4B | 2048 | mamba_integration | β
| [Download](./checkpoints/phase1_6b_base.pt) | |
| | phase2_6b_matryoshka | 6.4B | 2048 | matryoshka, early_exit | β³ | β | |
| | phase3_9b_matryoshka | 9.3B | 2560 | matryoshka, early_exit | β³ | β | |
| | phase3_20b_matryoshka | 20B | 4096 | matryoshka, early_exit | β³ | β | |
| | sft_20b | 20B | 4096 | matryoshka, early_exit, sft | β³ | β | |
| | rl_20b | 20B | 4096 | matryoshka, early_exit, rl_agent | β³ | β | |
| |
| ## Architecture Overview |
| |
| Adamba combines three efficiency techniques: |
| |
| | Technique | Implementation | Purpose | |
| |-----------|----------------|---------| |
| | **Matryoshka (MRL)** | Width: 128 β 4096 per layer | Elastic compute | |
| | **Early Exit** | ConfidenceGate per layer | Skip when confident | |
| | **Static SSM** | Mamba at full dim | Stable memory backbone | |
| |
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β PROMPT β LayerDimPredictor β [dim per layer] β |
| β β |
| β Attention + MLP: Dynamic (Matryoshka sliced) β |
| β Mamba: Static (full dim) β |
| β β |
| β Gate > 0.95 β EXIT EARLY β |
| β Gate < 0.50 β EXPAND remaining layers β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
| |
| ## Training Pipeline |
| |
| ``` |
| nanochat-d32 (1.9B) |
| β Surgery (add 32 Mamba layers) |
| Phase 1: 6.4B (dim=2048) β Mamba integration |
| β Enable Matryoshka |
| Phase 2: 6.4B (dim=2048) β Full training |
| β Progressive expand |
| Phase 3: 9.3B β 20B (dim=4096) |
| β Fine-tuning |
| SFT: Instruction tuning |
| RL: Agent capabilities |
| ``` |
| |
| ## Model Details |
| |
| - **Base**: [karpathy/nanochat-d32](https://huggingface.co/karpathy/nanochat-d32) |
| - **Architecture**: 64 blocks (32 Attention + 32 Mamba interleaved) |
| - **Vocabulary**: 65,536 tokens |
| - **Matryoshka Dims**: [128, 256, 512, 1024, 2048, 4096] |
| |
| ## Usage |
| |
| ```python |
| # Coming soon - inference code |
| # See: https://github.com/unixsysdev/adamba |
| ``` |
| |
| ## Links |
| |
| - π **GitHub**: [unixsysdev/adamba](https://github.com/unixsysdev/adamba) |
| - π **Training**: [WandB](https://wandb.ai/dalletest123/nano-fractal) |
| |
| ## License |
| |
| Apache 2.0 |
| |