DINOv3
Paper • 2508.10104 • Published • 305
EgaraNet is an embedding model that encodes the artistic style of illustrations into 1024-dimensional L2-normalized vectors. It was trained on approximately 1.2 million illustrations from around 12,000 artists, learning to produce embeddings where illustrations by the same artist are close together in the vector space. Produced under Article 30-4 of the Japanese Copyright Act.
| Architecture | DINOv3 ViT-L/16 backbone + StyleNet (Transposed Attention Transformer) head |
| Embedding Dim | 1024 |
| Input | RGB images, any resolution (must be a multiple of 16) |
| Output | L2-normalized style embedding vector |
| Training Data | ~1.2M illustrations from ~12K artists |
import torch
from PIL import Image
from transformers import AutoModel
import torchvision.transforms as T
# Load model
model = AutoModel.from_pretrained("Columba1198/EgaraNet", trust_remote_code=True)
model.eval()
# Preprocess: MaxResizeMod16(512) + ImageNet normalization
def preprocess(image_path, max_size=512):
img = Image.open(image_path).convert("RGB")
w, h = img.size
scale = max_size / max(w, h)
new_w = max(16, round(int(w * scale) / 16) * 16)
new_h = max(16, round(int(h * scale) / 16) * 16)
img = img.resize((new_w, new_h), Image.BICUBIC)
transform = T.Compose([
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
return transform(img).unsqueeze(0)
# Extract style embedding
with torch.no_grad():
pixel_values = preprocess("illustration.png")
output = model(pixel_values=pixel_values)
embedding = output.style_embedding # [1, 1024], L2-normalized
print(f"Embedding shape: {embedding.shape}")
import torch.nn.functional as F
with torch.no_grad():
emb_a = model(pixel_values=preprocess("image_a.png")).style_embedding
emb_b = model(pixel_values=preprocess("image_b.png")).style_embedding
similarity = F.cosine_similarity(emb_a, emb_b).item()
print(f"Style similarity: {similarity:.4f} ({similarity*100:.1f}%)")
# Stack multiple preprocessed images
batch = torch.cat([preprocess(p) for p in image_paths], dim=0) # [B, 3, H, W]
# Note: all images in the batch must have the same H, W
with torch.no_grad():
embeddings = model(pixel_values=batch).style_embedding # [B, 1024]
MaxResizeMod16(512) which scales the long edge to 512px while preserving aspect ratio and snapping both dimensions to multiples of 16.An INT8 dynamically quantized ONNX model is available for browser-based inference:
import onnxruntime as ort
session = ort.InferenceSession("onnx/egara_net_int8.onnx")
result = session.run(None, {"pixel_values": pixel_values_numpy})
embedding = result[0] # [1, 1024], L2-normalized
The model outputs an EgaraNetOutput dataclass with:
| Field | Shape | Description |
|---|---|---|
style_embedding |
[B, 1024] |
L2-normalized style embedding vector |
backbone_features |
[B, N, 1024] |
Raw backbone features (optional, set output_backbone_features=True) |