EgaraNet — Illustration Style Embedding Model

EgaraNet is an embedding model that encodes the artistic style of illustrations into 1024-dimensional L2-normalized vectors. It was trained on approximately 1.2 million illustrations from around 12,000 artists, learning to produce embeddings where illustrations by the same artist are close together in the vector space. Produced under Article 30-4 of the Japanese Copyright Act.

Model Description


Architecture	DINOv3 ViT-L/16 backbone + StyleNet (Transposed Attention Transformer) head
Embedding Dim	1024
Input	RGB images, any resolution (must be a multiple of 16)
Output	L2-normalized style embedding vector
Training Data	~1.2M illustrations from ~12K artists

Architecture

Backbone: DINOv3 ViT-L/16 (frozen during training) — extracts rich visual features
StyleNet Head: 3 × Transposed Attention Transformer → Attention Pooling → Projection Head
- Transposed Attention Transformer (TAT): Computes cross-covariance attention in channel space (C×C), discarding spatial information to isolate artistic style. Uses RMSNorm and SwiGLU FFN.

Usage with Transformers

import torch
from PIL import Image
from transformers import AutoModel
import torchvision.transforms as T

# Load model
model = AutoModel.from_pretrained("Columba1198/EgaraNet", trust_remote_code=True)
model.eval()

# Preprocess: MaxResizeMod16(512) + ImageNet normalization
def preprocess(image_path, max_size=512):
    img = Image.open(image_path).convert("RGB")
    w, h = img.size
    scale = max_size / max(w, h)
    new_w = max(16, round(int(w * scale) / 16) * 16)
    new_h = max(16, round(int(h * scale) / 16) * 16)
    img = img.resize((new_w, new_h), Image.BICUBIC)
    transform = T.Compose([
        T.ToTensor(),
        T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    return transform(img).unsqueeze(0)

# Extract style embedding
with torch.no_grad():
    pixel_values = preprocess("illustration.png")
    output = model(pixel_values=pixel_values)
    embedding = output.style_embedding  # [1, 1024], L2-normalized
    print(f"Embedding shape: {embedding.shape}")

Comparing Two Illustrations

import torch.nn.functional as F

with torch.no_grad():
    emb_a = model(pixel_values=preprocess("image_a.png")).style_embedding
    emb_b = model(pixel_values=preprocess("image_b.png")).style_embedding

similarity = F.cosine_similarity(emb_a, emb_b).item()
print(f"Style similarity: {similarity:.4f} ({similarity*100:.1f}%)")

Batch Inference

# Stack multiple preprocessed images
batch = torch.cat([preprocess(p) for p in image_paths], dim=0)  # [B, 3, H, W]
# Note: all images in the batch must have the same H, W
with torch.no_grad():
    embeddings = model(pixel_values=batch).style_embedding  # [B, 1024]

Input Requirements

Image format: RGB images (PNG, JPEG, WebP, etc.)
Resolution: Dynamic — the model accepts images of any resolution where height and width are multiples of 16. The recommended preprocessing is MaxResizeMod16(512) which scales the long edge to 512px while preserving aspect ratio and snapping both dimensions to multiples of 16.
Batch size: Dynamic. For batch inference, all images must have the same spatial dimensions. Process images with different aspect ratios individually or use padding.
Normalization: ImageNet statistics — mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]

ONNX Model

An INT8 dynamically quantized ONNX model is available for browser-based inference:

import onnxruntime as ort

session = ort.InferenceSession("onnx/egara_net_int8.onnx")
result = session.run(None, {"pixel_values": pixel_values_numpy})
embedding = result[0]  # [1, 1024], L2-normalized

Model Output

The model outputs an EgaraNetOutput dataclass with:

Field	Shape	Description
`style_embedding`	`[B, 1024]`	L2-normalized style embedding vector
`backbone_features`	`[B, N, 1024]`	Raw backbone features (optional, set `output_backbone_features=True`)

Intended Use

Style-based illustration search and retrieval
Clustering illustrations by artistic style
Analyzing style similarity between artists
Building style recommendation systems

Limitations

Optimized for 2D illustrations and digital art; may not perform well on photographs or 3D renders
Style embedding captures overall artistic style, not specific content or subject matter
Similarity scores are relative — calibrate thresholds for your specific use case

References

DINOv3: "DINOv3" arXiv:2508.10104
Style Transfer: "Image Style Transfer Using Convolutional Neural Networks" CVPR 2016
Restormer: "Restormer: Efficient Transformer for High-Resolution Image Restoration" arXiv:2111.09881
MHTA: "Multi-Head Transposed Attention Transformer for Sea Ice Segmentation in Sar Imagery" IGARSS 2024
MANIQA: "MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment" arXiv:2204.08958

Papers for Columba1198/EgaraNet

DINOv3

Paper • 2508.10104 • Published Aug 13, 2025 • 305

MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment

Paper • 2204.08958 • Published Apr 19, 2022 • 1

Restormer: Efficient Transformer for High-Resolution Image Restoration

Paper • 2111.09881 • Published Nov 18, 2021 • 1

Columba1198
/

EgaraNet