Nanosaur 250M

Nanosaur is a 250M parameter text-to-image model for illustrations. It consists of two trained components:

  1. PS-VAE: A VAE that compresses DINOv3's representation while retaining the semantic content
  2. DeCo: A diffusion transformer with a wide MLP head per patch similar to DDT

This model was trained in 42 hours on 1 GPU from scratch, setting a new standard for compute efficiency in T2I illustration training.

This model is intended for research purposes. It is not a general purpose T2I model. It is not a finetune of an existing model. Do not expect visual quality to match corporate models.

Prompt format: tags or natural language

This repo includes a gradio GUI for image generation, a tech report, and training scripts for the VAE and for the diffusion model

Included Checkpoints

Checkpoint Path Parameters Description
PS-VAE vae_checkpoint.pt 150M VAE with DINOv3 encoder, 96-channel latent space, 16x spatial compression
DeCo diffusion_model_checkpoint.pth 250M Diffusion transformer with SPRINT and x-prediction

Text Encoder: Google Gemma 3 270M (downloaded from Hugging Face; you may need to agree to their terms to access the repo)

Model Architecture and Training Details

See TECH REPORT.md

Requirements

  • Python 3.12+
  • Linux
  • UV
  • CUDA-capable GPU (8GB for 1024x1024 inference. 24GB VRAM recommended for training.)

Install dependencies:

uv sync

Quick Start

Image Generation (Inference)

Launch the Gradio web interface:

uv run inference_diffusion_model.py

This provides a web UI at http://localhost:7860 for text-to-image generation.


VAE Training

The PS-VAE is trained in two stages:

  1. S-VAE Stage: Train semantic encoder/decoder with frozen DINOv3
  2. PS-VAE Stage: Fine-tune the full model including DINOv3 encoder

Train VAE

Note: this implements AFHQ 256x256 as an example dataset. The dataset should be replaced with an anime illustration dataset to replicate the model.

uv run train_vae.py

Diffusion Model Training

Step 1: Cache Dataset

Pre-compute VAE latents and text embeddings for faster training:

uv run cache_vae.py

This creates cache/ with pre-encoded image latents and text embeddings

Step 2: Train Diffusion Model

Note: this implements AFHQ 256x256 as an example dataset. The dataset should be replaced with an anime illustration dataset to replicate the model.

uv run train_diffusion_model.py

Monitor training:

tensorboard --logdir runs

Credits & References

Component Source
VAE Encoder Initialized from timm/vit_base_patch16_dinov3.lvd1689m
VAE Decoder Trained from scratch, based on VA-VAE
Text Encoder Google Gemma 3 270M
Flow Matching Based on minRF
SIGReg Loss From LeJEPA
  • VAE Training Recipe PS-VAE
  • DeCo Architecture: DeCo
  • SPRINT: SPRINT Implementation code from: SpeedrunDiT
  • X-prediction with V-loss: JiT
  • DINOv3: Self-supervised Vision Transformers (Meta AI)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for lopho/Nanosaur-250M