ANDREA-12M

Autonomous Neural Data Recipe for Education and Agency

A 12.8M parameter language model grown on a single RTX 4090 using a bandit-controlled curriculum. Part of the permacomputer project β€” open source, open data, open weights.

Model Details

Property Value
Parameters 12.8M
Architecture Transformer decoder, 384d/12h/6L
Embedding dim 384
Heads 12
Layers 6
Context 1024 tokens
Tokenizer Harris morpheme (2048 segments, 2305 vocab)
Training steps 43,587
Final SMMA loss 2.0
Best single-step loss 0.21
Training time ~72 hours
Hardware Single NVIDIA RTX 4090 (24GB VRAM, 1.4GB used)
CUDA engine microgpt_cuda.cu (custom, FP32)
Born 2026-03-21 12:53 UTC / 08:53 EST
License AGPL-3.0

Files

File Step Description
ANDREA-12M.bin 43,587 Final checkpoint (SMMA 2.0)
ANDREA-12M-best.bin 42,300 Best checkpoint (lowest loss during training)
harris_segments.json β€” Harris tokenizer segments (required for inference and fine-tuning)

Checkpoint format

Binary, little-endian: [int32 step][int32 n_params][n_params Γ— float32 weights][n_params Γ— float32 m][n_params Γ— float32 v]

  • Weights: model parameters (12.8M floats, ~49MB)
  • m: Adam first moment (same size)
  • v: Adam second moment (same size)
  • Total: ~147MB per checkpoint

Use either checkpoint to resume fine-tuning (weights + optimizer state preserved) or extract weights only for inference (first n_params floats after the 8-byte header).

Training Data

Trained on a curated mix of open conversational and educational data:

  • NousResearch/Hermes-3-Dataset (general, creative, roleplay) β€” 590K conversations
  • Dictionary β€” 88K word definitions distilled from Hermes 3 8B
  • Gutenberg β€” public domain literature (Project Gutenberg)
  • Additional: chat, smoltalk, oasst, dolly, IRC, repo-docs

Data mix controlled by a UCB1 multi-armed bandit with dice-based phase control. The bandit dynamically adjusts source weights during training based on per-source loss trajectories. Full curriculum specification in the white paper.

Training Recipe

  • Harris morpheme tokenizer (2048 segments)
  • Cosine LR schedule with warm restart at step 25K (0.0004 peak)
  • Phase-based bandit: 2 focus arms, 1d3 dice, source floors
  • Checkpoints every 100 steps, SIGTERM-safe
  • Per-source reward attribution, epoch penalty, coverage tracking

Capabilities

ANDREA-12M learns patterns, not facts. At 12.8M parameters it produces:

  • Correct Q&A turn structure (> question / < answer)
  • Definition-style responses
  • Multi-sentence outputs with plausible grammar
  • Instruction-following scaffolding ("explain", "define", "describe")

It does NOT produce factually accurate content β€” it's a pattern machine. Factual accuracy requires scaling to ANDREA-120M (planned).

Usage

# Inference via microgpt
from microgpt import load_model, generate_fast

model = load_model('ANDREA-12M.json')
results = generate_fast(model['state_dict'], model['uchars'], model['bos'],
                        384, 12, 6, 1024, prefix='> what is an apple? / <')
print(results[0][0])

White Paper

ANDREA-12M-WHITEPAPER.pdf β€” full technical paper covering architecture, bandit curriculum, data sources, training recipe, and results.

Source: whitepaper/ANDREA/WHITEPAPER.rst in the uncloseai-cli repository.

Citation

ANDREA: Autonomous Neural Data Recipe for Education and Agency
TimeHexOn, foxhop, russell@unturf
March 2026, permacomputer.com

License

AGPL-3.0. Code outlasts authors. Infrastructure outlasts builders.

● β—‹

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support