nyxia
/

Auron-279M

@@ -13,55 +13,30 @@ thumbnail: auron_banner.png
 ![Auron](auron_banner.png)
-# Auron-279M
-**Auron** — Chimera hybrid GDN-Attention language models with Ouroboros weight sharing.
-**Paper:** [Auron: Depth-Efficient Language Models via Hybrid Recurrent-Attention Weight Sharing](https://github.com/Fy-/Auron)
-**Code:** [github.com/Fy-/Auron](https://github.com/Fy-/Auron)
-## Architecture
-- **Type:** Chimera (ChimeraConfig)
-- **Dim:** 1024
-- **Layers:** 16 virtual
-- **Params:** 278,664,160 (279M)
-- **Vocab:** 151936 (Qwen 3 tokenizer)
-- **Context:** 2048 tokens
-- **Topology:** 4 unique bottom + 4×3 shared top
-- **GDN:Attn ratio:** 3:1 (every 4th layer is attention)
-- **Virtual equivalent:** ~557,328,320 params
-## Training Curves
-![Training Curves](training_curves.png)
-## Training
-- **Step:** 250,000
-- **Data:** Mixed (75% FineWeb-Edu, 18% StarCoder, 5% FineMath, 2% UltraChat)
-- **Optimizer:** Muon + AdamW (decoupled embedding LR)
-- **Schedule:** WSD (Warmup-Stable-Decay)
-## Usage
-```bash
-git clone https://github.com/Fy-/Auron && cd Auron && rye sync
-```
 ```python
 from ouro import load_model, generate
-model, tokenizer, device = load_model("nyxia/Auron-279M")
 generate(model, tokenizer, device, "The history of")
 ```
-## Sampling
-Default: T=0.7, top_k=20, top_p=0.9, rep_pen=1.0, presence_pen=1.5 (Ouroboros weight sharing requires presence penalty >= 1.5 to prevent attractor wells).
-## Links
-- **Paper:** [Auron: Depth-Efficient Language Models via Hybrid Recurrent-Attention Weight Sharing](https://github.com/Fy-/Auron/blob/master/Auron_chimera_topology_paper.pdf)
-- **Code:** [github.com/Fy-/Auron](https://github.com/Fy-/Auron)
-- **Models:** [huggingface.co/nyxia](https://huggingface.co/nyxia)
-Built by [Florian Gasquez](https://fyx.jp) ([@nyxia](https://huggingface.co/nyxia)). Part of the [Soulkyn](https://soulkyn.com) project.

 ![Auron](auron_banner.png)
+# Auron-279M (Archived)
+> **Note:** This model is part of a scaling study. The 279M achieved a final val_loss of **3.188** — virtually identical to the 4x larger 1.1B model (3.180), revealing a scaling wall in the Ouroboros weight-sharing mechanism. The **510M** model is the best-performing Chimera variant.
+>
+> **For inference and testing, use [Auron-510M](https://huggingface.co/nyxia/Auron-510M) (val_loss 3.035).**
+| Model | Params | Final Val Loss | Status |
+|-------|--------|---------------|--------|
+| Auron-279M | 279M | 3.188 | Archived |
+| **[Auron-510M](https://huggingface.co/nyxia/Auron-510M)** | **510M** | **3.035** | **Best** |
+| Auron-1.1B | 1.1B | 3.180 | Archived |
+**Paper:** [Auron](https://github.com/Fy-/Auron) | **Code:** [github.com/Fy-/Auron](https://github.com/Fy-/Auron) | **Blog:** [HuggingFace](https://huggingface.co/blog/nyxia/auron)
+## Architecture
+- **Type:** Chimera (4 bottom + 4×3 top = 16 virtual)
+- **Dim:** 1024, head_dim=64, expand_v=2
+- **Params:** 279M (123M unique + 155M embed)
+- **Trained:** 250K steps, 5B tokens, WSD schedule
 ```python
 from ouro import load_model, generate
+model, tokenizer, device = load_model("nyxia/Auron-510M")  # Use 510M
 generate(model, tokenizer, device, "The history of")
 ```
+Built by [Florian Gasquez](https://fyx.jp) ([@nyxia](https://huggingface.co/nyxia)). Part of [Soulkyn](https://soulkyn.com).