Toy-Diffusion: Flow Matching vs. DDPM Research

This repository contains the weights and experimental results for the Toy-Diffusion project. The goal of this research is to evaluate the effectiveness of Flow Matching (FM) compared to traditional DDPM modeling, specifically focusing on distribution shifting and "new concept" acquisition in latent space.

The dataset is available at aipracticecafe/anime-faces-256px

🔬 Research Context

While DDPMs utilize Stochastic Differential Equations (SDEs) to model distributions, Flow Matching utilizes an Ordinary Differential Equation (ODE) framework. We hypothesize that:

DDPM is more robust for general pre-training, as the Markovian chain allows for more diverse path exploration.
Flow Matching is superior for fine-tuning and distribution collapse toward specific styles or characters, as the unique ODE solution for every noise-data pair forces faster convergence and deterministic mapping.

🛠️ Technical Specifications

Model: 281M Parameter U-Net (ADM-based architecture).
Latent Space: Flux 2 VAE (Factor 8, 32 channels, 32x32 latent resolution).
Optimizer: AdamW8bit, $2 \times 10^{-4}$ LR, 5% warmup.
Precision: Training conducted in Mixed precision Bfloat16 .
Data: 48K curated Anime Face samples (256x256).

📂 Repository Structure

The weights are organized into three main experimental stages:

/standard: Baseline models trained on the full 48K sample dataset.
/fine (Exclusion Stage): Models trained while strictly excluding specific tags: ["souryuu asuka langley", "albedo (overlord)"]. This serves as the "base" to test how effectively the model can learn these concepts later.
/stage2 (Finetuning Stage): The /fine model finetuned specifically on the excluded subset. This is our primary benchmark for comparing how FM vs DDPM adapts to "new" knowledge.

📊 Key Findings (Latent Space)

Our experiments show that Flow Matching (FM) is clearly superior to DDPM when working in the Flux 2 latent space. FM achieves:

Significantly lower FID scores on training samples.
Higher qualitative sharpness in facial features (eyes, hair ornaments).
Faster adaptation to character-specific features during Stage 2 finetuning.

Note: FID scores in this project are calculated specifically against the training distribution to measure reconstruction and learning accuracy rather than general diversity.

🚀 Usage

The U-Net architecture code (unet.py) is included in this repository. To load a model:

import torch
from unet import UNetModel # Provided in repo

# Flux 2 VAE Latent: 32 channels, 32x32 size
model = UNetModel(
    in_channels=32,
    model_channels=256,
    out_channels=32,
    num_res_blocks=2,
    attention_resolutions=(16, 8),
    # ... other ADM params
)

state_dict = torch.load("standard/model_ema.pt", map_location="cpu")
model.load_state_dict(state_dict)

Content Warning & Rights

NSFW Content: This dataset contains sensitive and NSFW (Not Safe For Work) material. It is intended for research and generative modeling purposes.
Legal Disclaimer: I do not own the rights to any of the images in this dataset. All images are the property of their respective creators and were scraped from Danbooru. This dataset is provided for educational and research purposes under fair use.

Downloads last month: -