nyxia commited on
Commit
481c01e
·
verified ·
1 Parent(s): 074fbc0

Archive: scaling wall, redirect to 510M

Browse files
Files changed (1) hide show
  1. README.md +17 -42
README.md CHANGED
@@ -13,55 +13,30 @@ thumbnail: auron_banner.png
13
 
14
  ![Auron](auron_banner.png)
15
 
16
- # Auron-279M
17
 
18
- **Auron** — Chimera hybrid GDN-Attention language models with Ouroboros weight sharing.
 
 
19
 
20
- **Paper:** [Auron: Depth-Efficient Language Models via Hybrid Recurrent-Attention Weight Sharing](https://github.com/Fy-/Auron)
21
- **Code:** [github.com/Fy-/Auron](https://github.com/Fy-/Auron)
 
 
 
22
 
23
- ## Architecture
24
- - **Type:** Chimera (ChimeraConfig)
25
- - **Dim:** 1024
26
- - **Layers:** 16 virtual
27
- - **Params:** 278,664,160 (279M)
28
- - **Vocab:** 151936 (Qwen 3 tokenizer)
29
- - **Context:** 2048 tokens
30
- - **Topology:** 4 unique bottom + 4×3 shared top
31
- - **GDN:Attn ratio:** 3:1 (every 4th layer is attention)
32
- - **Virtual equivalent:** ~557,328,320 params
33
-
34
- ## Training Curves
35
-
36
- ![Training Curves](training_curves.png)
37
 
38
- ## Training
39
- - **Step:** 250,000
40
- - **Data:** Mixed (75% FineWeb-Edu, 18% StarCoder, 5% FineMath, 2% UltraChat)
41
- - **Optimizer:** Muon + AdamW (decoupled embedding LR)
42
- - **Schedule:** WSD (Warmup-Stable-Decay)
43
-
44
- ## Usage
45
-
46
- ```bash
47
- git clone https://github.com/Fy-/Auron && cd Auron && rye sync
48
- ```
49
 
50
  ```python
51
  from ouro import load_model, generate
52
-
53
- model, tokenizer, device = load_model("nyxia/Auron-279M")
54
  generate(model, tokenizer, device, "The history of")
55
  ```
56
 
57
- ## Sampling
58
-
59
- Default: T=0.7, top_k=20, top_p=0.9, rep_pen=1.0, presence_pen=1.5 (Ouroboros weight sharing requires presence penalty >= 1.5 to prevent attractor wells).
60
-
61
- ## Links
62
-
63
- - **Paper:** [Auron: Depth-Efficient Language Models via Hybrid Recurrent-Attention Weight Sharing](https://github.com/Fy-/Auron/blob/master/Auron_chimera_topology_paper.pdf)
64
- - **Code:** [github.com/Fy-/Auron](https://github.com/Fy-/Auron)
65
- - **Models:** [huggingface.co/nyxia](https://huggingface.co/nyxia)
66
-
67
- Built by [Florian Gasquez](https://fyx.jp) ([@nyxia](https://huggingface.co/nyxia)). Part of the [Soulkyn](https://soulkyn.com) project.
 
13
 
14
  ![Auron](auron_banner.png)
15
 
16
+ # Auron-279M (Archived)
17
 
18
+ > **Note:** This model is part of a scaling study. The 279M achieved a final val_loss of **3.188** virtually identical to the 4x larger 1.1B model (3.180), revealing a scaling wall in the Ouroboros weight-sharing mechanism. The **510M** model is the best-performing Chimera variant.
19
+ >
20
+ > **For inference and testing, use [Auron-510M](https://huggingface.co/nyxia/Auron-510M) (val_loss 3.035).**
21
 
22
+ | Model | Params | Final Val Loss | Status |
23
+ |-------|--------|---------------|--------|
24
+ | Auron-279M | 279M | 3.188 | Archived |
25
+ | **[Auron-510M](https://huggingface.co/nyxia/Auron-510M)** | **510M** | **3.035** | **Best** |
26
+ | Auron-1.1B | 1.1B | 3.180 | Archived |
27
 
28
+ **Paper:** [Auron](https://github.com/Fy-/Auron) | **Code:** [github.com/Fy-/Auron](https://github.com/Fy-/Auron) | **Blog:** [HuggingFace](https://huggingface.co/blog/nyxia/auron)
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
+ ## Architecture
31
+ - **Type:** Chimera (4 bottom + 4×3 top = 16 virtual)
32
+ - **Dim:** 1024, head_dim=64, expand_v=2
33
+ - **Params:** 279M (123M unique + 155M embed)
34
+ - **Trained:** 250K steps, 5B tokens, WSD schedule
 
 
 
 
 
 
35
 
36
  ```python
37
  from ouro import load_model, generate
38
+ model, tokenizer, device = load_model("nyxia/Auron-510M") # Use 510M
 
39
  generate(model, tokenizer, device, "The history of")
40
  ```
41
 
42
+ Built by [Florian Gasquez](https://fyx.jp) ([@nyxia](https://huggingface.co/nyxia)). Part of [Soulkyn](https://soulkyn.com).