EUPE ViT-T/16

kittn/eupe_vitt16 is a Hugging Face transformers DINOv3 ViT conversion of facebook/EUPE-ViT-T.

Usage

import torch
import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

image = Image.open(
    requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw
).convert("RGB")

processor = AutoImageProcessor.from_pretrained("kittn/eupe_vitt16")
model = AutoModel.from_pretrained("kittn/eupe_vitt16").eval().to("cuda")

inputs = processor(images=image, return_tensors="pt", size={"height": 512, "width": 512}).to("cuda")

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    outputs = model(**inputs)

print("clstoken:", outputs.last_hidden_state[:, 0].shape)  # torch.Size([1, 192])
print("patchtokens:", outputs.last_hidden_state[:, 1 + model.config.num_register_tokens :].shape)  # torch.Size([1, 1024, 192])
print("pooler_output:", outputs.pooler_output.shape)  # torch.Size([1, 192])

last_hidden_state contains:

  • token 0: CLS token
  • tokens 1:5: 4 register tokens
  • remaining tokens: patch tokens

Architecture

  • Architecture: ViT-T/16
  • Hidden size: 192
  • Layers: 12
  • Attention heads: 3
  • Register tokens: 4
  • Patch size: 16

Minimal-loss inference

If you want to minimize the discrepancy versus the original EUPE inference path, prefer running the Hugging Face model on CUDA under torch.autocast("cuda", dtype=torch.bfloat16) rather than hard-casting the full model to bfloat16.

RoPE note

The stock Hugging Face DINOv3 implementation is internally correct and self-consistent, but it does not match the DINOv3 / EUPE reference implementations bitwise. The mismatch comes from the reference code persisting bf16-rounded RoPE periods in the checkpoint and computing angles as coords / periods, while Hugging Face reconstructs fp32 inv_freq from rope_theta and computes coords * inv_freq.

If you want bitwise equivalence with the DINOv3 / EUPE references, run the following after loading model in the example above. It patches the already-loaded Hugging Face model to use the exact bf16-rounded periods and the reference RoPE forward:

import math
from types import MethodType

rope = model.rope_embeddings
head_dim = model.config.hidden_size // model.config.num_attention_heads
periods = (rope.base ** (torch.arange(head_dim // 4, dtype=torch.float32, device=rope.inv_freq.device) * (4.0 / head_dim))).to(torch.bfloat16).to(torch.float32)
rope.register_buffer("periods", periods, persistent=False)


def forward(self, pixel_values):
    _, _, height, width = pixel_values.shape
    num_patches_h = height // self.config.patch_size
    num_patches_w = width // self.config.patch_size

    coords_h = torch.arange(0.5, num_patches_h, device=pixel_values.device, dtype=torch.float32) / num_patches_h
    coords_w = torch.arange(0.5, num_patches_w, device=pixel_values.device, dtype=torch.float32) / num_patches_w
    coords = torch.stack(torch.meshgrid(coords_h, coords_w, indexing="ij"), dim=-1).flatten(0, 1)
    coords = 2.0 * coords - 1.0

    angles = 2 * math.pi * coords[:, :, None] / self.periods[None, None, :]
    angles = angles.flatten(1, 2).tile(2)

    cos = torch.cos(angles).to(dtype=pixel_values.dtype)
    sin = torch.sin(angles).to(dtype=pixel_values.dtype)
    return cos, sin


rope.forward = MethodType(forward, rope)

This restores bitwise equivalence in both pure fp32 and bf16 autocast.

Downloads last month
95
Safetensors
Model size
5.49M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including kittn/eupe_vitt16