---
license: apache-2.0
language:
  - en
library_name: transformers
tags:
  - text-generation
  - causal-lm
  - swarm-intelligence
  - multi-agent
  - pytorch
  - transformers
  - convergentintel
pipeline_tag: text-generation
model-index:
  - name: SAGI
    results: []
---

# SAGI - Swarm AGI Language Model

SAGI is a novel causal language model that integrates **swarm intelligence dynamics** with transformer architecture. The model treats cognition as a dynamic, adaptive system where multiple internal "agents" collaborate through differentiable routing, trust mechanisms, and shared memory.

## Model Description

| Property | Value |
|----------|-------|
| Parameters | 52.72M |
| Architecture | Transformer Decoder + Swarm Dynamics |
| Hidden Size | 512 |
| Layers | 6 |
| Attention Heads | 8 |
| Context Length | 2048 |
| Vocabulary | GPT-2 tokenizer (50,257 tokens) |

### Key Innovations

- **Differentiable Routing**: Continuous mixture-of-experts via attention (`DiffRouter`) instead of hard module selection
- **Adaptive Gating & Trust**: `MetaController` activates capacity under resource constraints; trust dynamics bias reliable components
- **Episodic + Semantic Memory**: Dual memory system with trainable retrieval utility
- **Curiosity Engine**: Injects novel goals when surprise is low, promoting exploration
- **Self-Model & Rollback**: Predicts state transitions and detects anomalies for self-correction
- **Resource Dynamics**: Soft conservation with learned converter; cognition consumes/recovers compute, memory, energy
- **Value Monitoring**: Tracks alignment to core values and freezes plasticity under drift

## How It Works

```
┌─────────────────────────────────────────────────────────┐
│                       SAGI Model                         │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────────┐      ┌─────────────────────────┐   │
│  │   Swarm-7 V2.2  │─────▶│  Swarm State S, T       │   │
│  │  (Cognitive     │      │  (Working Memory)       │   │
│  │   Dynamics)     │      └───────────┬─────────────┘   │
│  └────────▲────────┘                  │                 │
│           │                           ▼                 │
│           │              ┌─────────────────────────┐    │
│           │              │  Transformer Decoder    │    │
│           │              │  - Swarm-conditioned    │    │
│           │              │    attention & FFN      │    │
│           │              │  - RoPE embeddings      │    │
│           │              └───────────┬─────────────┘    │
│           │                          │                  │
│  ┌────────┴────────┐      ┌─────────────────────────┐   │
│  │   Observation   │◀─────│      LM Head            │   │
│  │   (from tokens) │      └─────────────────────────┘   │
│  └─────────────────┘                                    │
└─────────────────────────────────────────────────────────┘
```

The swarm processes observations derived from token embeddings, updating its internal state **S**. This state conditions the transformer's attention patterns and feed-forward activations via learned projections, creating bidirectional information flow between symbolic (tokens) and subsymbolic (swarm dynamics) processing.

## Usage

### Installation

```bash
pip install torch transformers datasets
```

### Quick Start

```python
from transformers import AutoTokenizer
from transformers import  AutoModelForCausalLM, AutoConfig

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("reaperdoesntknow/SAGI")
tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/SAGI")

# Generate text
model.eval()

prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.8,
    top_k=50,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Model Architecture Details

### Swarm Configuration

| Parameter | Value | Description |
|-----------|-------|-------------|
| `max_agents` | 20 | Number of internal cognitive agents |
| `dim_s` | 64 | State dimension |
| `dim_t` | 32 | Task/goal dimension |
| `dim_obs` | 48 | Observation dimension |
| `topk_route` | 5 | Sparse routing top-k |
| `K_thought_max` | 5 | Maximum thinking iterations per step |

### Resource Budgets

| Resource | Budget | Description |
|----------|--------|-------------|
| Compute | 60.0 | Compute budget per step |
| Memory | 20.0 | Memory capacity |
| Energy | 25.0 | Energy budget |

### Trust & Plasticity

- **Trust Learning Rate**: 0.07
- **Fast EMA (Plasticity)**: 0.10
- **Slow EMA (Consolidation)**: 0.002
- **Core Values**: `["truth", "safety", "efficiency"]`

## Limitations

- **Early Research Model**: This is an experimental architecture exploring swarm-transformer integration
- **Training Data**: Currently trained on TinyStories subset; may produce simple, story-like outputs
- **Compute Requirements**: Swarm dynamics add overhead compared to standard transformers
- **Generation Quality**: Model is undertrained; outputs may be repetitive or incoherent

## Intended Use

This model is intended for:
- Research into multi-agent cognitive architectures
- Exploration of dynamic, adaptive language models
- Educational purposes in understanding swarm intelligence + LLMs

Not intended for:
- Production applications
- Safety-critical systems
- Generation of factual content

## Training Details

- **Dataset**: TinyStories (subset)
- **Optimizer**: AdamW (lr=3e-4, betas=(0.9, 0.999), weight_decay=0.01)
- **Scheduler**: Cosine annealing
- **Precision**: FP32
- **Hardware**: CPU training (compatible with CUDA)

## Citation

```bibtex
@software{sagi2026,
  title={SAGI: Swarm AGI Language Model},
  author={Reaperdoesntknow},
  year={2026},
  url={https://huggingface.co/your-reaperdoesntknow/SAGI}
}
```

---

## Convergent Intelligence Portfolio

*Part of the [Standalone Models](https://huggingface.co/reaperdoesntknow) by [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*


#
## Mathematical Foundations: Discrepancy Calculus (DISC)

SAGI's swarm intelligence dynamics connect to Discrepancy Calculus through **Discrepancy Mechanics** (Ch. 16 of the DISC monograph) — a reformulation of dynamics that replaces Newton/Lagrange with four discrepancy laws:

- **DL0 (Co-Motion):** Agent kinematics via metric derivative and environment flow
- **DL1 (Discrepancy Energy):** $E_{\text{disc}}[f] = \frac{1}{2}\int w(x)(Df(x))^2 d\mu(x)$ — stability through bounded discrepancy energy
- **DL2 (Force as Discrepancy Gradient):** Trust routing gradients as Euler-Lagrange from discrepancy action
- **DL3 (Reciprocity):** Symplectic invariance preserved across agent interactions

The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies the local mismatch in each agent's contribution. The trust mechanism between agents is operationally a discrepancy energy minimization: agents whose outputs have high mutual discrepancy are weighted down; agents converging on shared structure are amplified.

Classical mechanics is recovered as a degenerate smooth limit of Discrepancy Mechanics — just as standard single-head attention is a degenerate limit of swarm routing.

Full theory: *"On the Formal Analysis of Discrepancy Calculus"* (Colca, 2026; Convergent Intelligence LLC: Research Division).

## Related Models

| Model | Downloads | Format |
|-------|-----------|--------|
| [SMOLM2Prover](https://huggingface.co/reaperdoesntknow/SMOLM2Prover) | 56 | HF |
| [SMOLM2Prover-GGUF](https://huggingface.co/reaperdoesntknow/SMOLM2Prover-GGUF) | 150 | GGUF |
| [DeepReasoning_1R](https://huggingface.co/reaperdoesntknow/DeepReasoning_1R) | 16 | HF |
| [S-AGI](https://huggingface.co/reaperdoesntknow/S-AGI) | 0 | HF |

### Top Models from Our Lab

| Model | Downloads |
|-------|-----------|
| [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 501 |
| [LFM2.5-1.2B-Distilled-SFT](https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT) | 342 |
| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 302 |
| [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) | 203 |
| [Qwen3-1.7B-Coder-Distilled-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF) | 194 |

**Total Portfolio: 49 models, 22,598 total downloads**


*Last updated: 2026-03-28 12:58 UTC*

<!-- CIX-CROSSLINK-START -->

---

## From the Convergent Intelligence Portfolio

**[DistilQwen Collection](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c)** — Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B → 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.

Top model: [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) — 508 downloads

Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)

*Convergent Intelligence LLC: Research Division*

<!-- CIX-CROSSLINK-END -->

## Discrepancy Calculus Foundation

This model is part of the [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow) portfolio. All models in this portfolio are developed under the Discrepancy Calculus (DISC) framework — a measure-theoretic approach to understanding and controlling the gap between what a model *should* produce and what it *actually* produces.

DISC treats training singularities (loss plateaus, mode collapse, catastrophic forgetting) not as failures to be smoothed over, but as **structural signals** that reveal the geometry of the learning problem. Key concepts:

- **Discrepancy Operator (D):** Measures the gap between expected and observed behavior at each training step
- **Jump Sets:** Boundaries where model behavior changes discontinuously — these are *features*, not bugs
- **Ghost Imprinting:** Teacher knowledge that transfers to student models through weight-space topology rather than explicit distillation signal

For the full mathematical treatment, see [Discrepancy Calculus: Foundations and Core Theory](https://huggingface.co/reaperdoesntknow/Discrepancy_Calculus) (DOI: 10.57967/hf/8194).

**Citation chain:** [Structure Over Scale](https://huggingface.co/reaperdoesntknow/Structure-Over-Scale) (DOI: 10.57967/hf/8165) → [Three Teachers to Dual Cognition](https://huggingface.co/reaperdoesntknow/DualMind_Methodolgy) (DOI: 10.57967/hf/8184) → [Discrepancy Calculus](https://huggingface.co/reaperdoesntknow/Discrepancy_Calculus) (DOI: 10.57967/hf/8194)