EgaraNet — Illustration Style Embedding Model

EgaraNet is an embedding model that encodes the artistic style of illustrations into 1024-dimensional L2-normalized vectors. It was trained on approximately 1.2 million illustrations from around 12,000 artists, learning to produce embeddings where illustrations by the same artist are close together in the vector space. Produced under Article 30-4 of the Japanese Copyright Act.

Model Description

Architecture DINOv3 ViT-L/16 backbone + StyleNet (Transposed Attention Transformer) head
Embedding Dim 1024
Input RGB images, any resolution (must be a multiple of 16)
Output L2-normalized style embedding vector
Training Data ~1.2M illustrations from ~12K artists

Architecture

  • Backbone: DINOv3 ViT-L/16 (frozen during training) — extracts rich visual features
  • StyleNet Head: 3 × Transposed Attention Transformer → Attention Pooling → Projection Head
    • Transposed Attention Transformer (TAT): Computes cross-covariance attention in channel space (C×C), discarding spatial information to isolate artistic style. Uses RMSNorm and SwiGLU FFN.

Usage with Transformers

import torch
from PIL import Image
from transformers import AutoModel
import torchvision.transforms as T

# Load model
model = AutoModel.from_pretrained("Columba1198/EgaraNet", trust_remote_code=True)
model.eval()

# Preprocess: MaxResizeMod16(512) + ImageNet normalization
def preprocess(image_path, max_size=512):
    img = Image.open(image_path).convert("RGB")
    w, h = img.size
    scale = max_size / max(w, h)
    new_w = max(16, round(int(w * scale) / 16) * 16)
    new_h = max(16, round(int(h * scale) / 16) * 16)
    img = img.resize((new_w, new_h), Image.BICUBIC)
    transform = T.Compose([
        T.ToTensor(),
        T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    return transform(img).unsqueeze(0)

# Extract style embedding
with torch.no_grad():
    pixel_values = preprocess("illustration.png")
    output = model(pixel_values=pixel_values)
    embedding = output.style_embedding  # [1, 1024], L2-normalized
    print(f"Embedding shape: {embedding.shape}")

Comparing Two Illustrations

import torch.nn.functional as F

with torch.no_grad():
    emb_a = model(pixel_values=preprocess("image_a.png")).style_embedding
    emb_b = model(pixel_values=preprocess("image_b.png")).style_embedding

similarity = F.cosine_similarity(emb_a, emb_b).item()
print(f"Style similarity: {similarity:.4f} ({similarity*100:.1f}%)")

Batch Inference

# Stack multiple preprocessed images
batch = torch.cat([preprocess(p) for p in image_paths], dim=0)  # [B, 3, H, W]
# Note: all images in the batch must have the same H, W
with torch.no_grad():
    embeddings = model(pixel_values=batch).style_embedding  # [B, 1024]

Input Requirements

  • Image format: RGB images (PNG, JPEG, WebP, etc.)
  • Resolution: Dynamic — the model accepts images of any resolution where height and width are multiples of 16. The recommended preprocessing is MaxResizeMod16(512) which scales the long edge to 512px while preserving aspect ratio and snapping both dimensions to multiples of 16.
  • Batch size: Dynamic. For batch inference, all images must have the same spatial dimensions. Process images with different aspect ratios individually or use padding.
  • Normalization: ImageNet statistics — mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]

ONNX Model

An INT8 dynamically quantized ONNX model is available for browser-based inference:

import onnxruntime as ort

session = ort.InferenceSession("onnx/egara_net_int8.onnx")
result = session.run(None, {"pixel_values": pixel_values_numpy})
embedding = result[0]  # [1, 1024], L2-normalized

Model Output

The model outputs an EgaraNetOutput dataclass with:

Field Shape Description
style_embedding [B, 1024] L2-normalized style embedding vector
backbone_features [B, N, 1024] Raw backbone features (optional, set output_backbone_features=True)

Intended Use

  • Style-based illustration search and retrieval
  • Clustering illustrations by artistic style
  • Analyzing style similarity between artists
  • Building style recommendation systems

Limitations

  • Optimized for 2D illustrations and digital art; may not perform well on photographs or 3D renders
  • Style embedding captures overall artistic style, not specific content or subject matter
  • Similarity scores are relative — calibrate thresholds for your specific use case

References

  • DINOv3: "DINOv3" arXiv:2508.10104
  • Style Transfer: "Image Style Transfer Using Convolutional Neural Networks" CVPR 2016
  • Restormer: "Restormer: Efficient Transformer for High-Resolution Image Restoration" arXiv:2111.09881
  • MHTA: "Multi-Head Transposed Attention Transformer for Sea Ice Segmentation in Sar Imagery" IGARSS 2024
  • MANIQA: "MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment" arXiv:2204.08958

Links

Downloads last month
241
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Columba1198/EgaraNet