Nanosaur 250M

Nanosaur is a 250M parameter text-to-image model for illustrations. It consists of two trained components:

PS-VAE: A VAE that compresses DINOv3's representation while retaining the semantic content
DeCo: A diffusion transformer with a wide MLP head per patch similar to DDT

This model was trained in 42 hours on 1 GPU from scratch, setting a new standard for compute efficiency in T2I illustration training.

This model is intended for research purposes. It is not a general purpose T2I model. It is not a finetune of an existing model. Do not expect visual quality to match corporate models.

Prompt format: tags or natural language

This repo includes a gradio GUI for image generation, a tech report, and training scripts for the VAE and for the diffusion model

Included Checkpoints

Checkpoint	Path	Parameters	Description
PS-VAE	`vae_checkpoint.pt`	150M	VAE with DINOv3 encoder, 96-channel latent space, 16x spatial compression
DeCo	`diffusion_model_checkpoint.pth`	250M	Diffusion transformer with SPRINT and x-prediction

Text Encoder: Google Gemma 3 270M (downloaded from Hugging Face; you may need to agree to their terms to access the repo)

Model Architecture and Training Details

See TECH REPORT.md

Requirements

Python 3.12+
Linux
UV
CUDA-capable GPU (8GB for 1024x1024 inference. 24GB VRAM recommended for training.)

Install dependencies:

uv sync

Quick Start

Image Generation (Inference)

Launch the Gradio web interface:

uv run inference_diffusion_model.py

This provides a web UI at http://localhost:7860 for text-to-image generation.

VAE Training

The PS-VAE is trained in two stages:

S-VAE Stage: Train semantic encoder/decoder with frozen DINOv3
PS-VAE Stage: Fine-tune the full model including DINOv3 encoder

Train VAE

Note: this implements AFHQ 256x256 as an example dataset. The dataset should be replaced with an anime illustration dataset to replicate the model.

uv run train_vae.py

Diffusion Model Training

Step 1: Cache Dataset

Pre-compute VAE latents and text embeddings for faster training:

uv run cache_vae.py

This creates cache/ with pre-encoded image latents and text embeddings

Step 2: Train Diffusion Model

Note: this implements AFHQ 256x256 as an example dataset. The dataset should be replaced with an anime illustration dataset to replicate the model.

uv run train_diffusion_model.py

Monitor training:

tensorboard --logdir runs

Credits & References

Component	Source
VAE Encoder	Initialized from timm/vit_base_patch16_dinov3.lvd1689m
VAE Decoder	Trained from scratch, based on VA-VAE
Text Encoder	Google Gemma 3 270M
Flow Matching	Based on minRF
SIGReg Loss	From LeJEPA

VAE Training Recipe PS-VAE
DeCo Architecture: DeCo
SPRINT: SPRINT Implementation code from: SpeedrunDiT
X-prediction with V-loss: JiT
DINOv3: Self-supervised Vision Transformers (Meta AI)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for lopho/Nanosaur-250M

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Paper • 2512.17909 • Published Dec 19, 2025 • 37