EUPE ViT-T/16
kittn/eupe_vitt16 is a Hugging Face transformers DINOv3 ViT conversion of facebook/EUPE-ViT-T.
Usage
import torch
import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
image = Image.open(
requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw
).convert("RGB")
processor = AutoImageProcessor.from_pretrained("kittn/eupe_vitt16")
model = AutoModel.from_pretrained("kittn/eupe_vitt16").eval().to("cuda")
inputs = processor(images=image, return_tensors="pt", size={"height": 512, "width": 512}).to("cuda")
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
outputs = model(**inputs)
print("clstoken:", outputs.last_hidden_state[:, 0].shape) # torch.Size([1, 192])
print("patchtokens:", outputs.last_hidden_state[:, 1 + model.config.num_register_tokens :].shape) # torch.Size([1, 1024, 192])
print("pooler_output:", outputs.pooler_output.shape) # torch.Size([1, 192])
last_hidden_state contains:
- token
0: CLS token - tokens
1:5: 4 register tokens - remaining tokens: patch tokens
Architecture
- Architecture: ViT-T/16
- Hidden size:
192 - Layers:
12 - Attention heads:
3 - Register tokens:
4 - Patch size:
16
Minimal-loss inference
If you want to minimize the discrepancy versus the original EUPE inference path, prefer running the Hugging Face model on CUDA under torch.autocast("cuda", dtype=torch.bfloat16) rather than hard-casting the full model to bfloat16.
RoPE note
The stock Hugging Face DINOv3 implementation is internally correct and self-consistent, but it does not match the DINOv3 / EUPE reference implementations bitwise. The mismatch comes from the reference code persisting bf16-rounded RoPE periods in the checkpoint and computing angles as coords / periods, while Hugging Face reconstructs fp32 inv_freq from rope_theta and computes coords * inv_freq.
If you want bitwise equivalence with the DINOv3 / EUPE references, run the following after loading model in the example above. It patches the already-loaded Hugging Face model to use the exact bf16-rounded periods and the reference RoPE forward:
import math
from types import MethodType
rope = model.rope_embeddings
head_dim = model.config.hidden_size // model.config.num_attention_heads
periods = (rope.base ** (torch.arange(head_dim // 4, dtype=torch.float32, device=rope.inv_freq.device) * (4.0 / head_dim))).to(torch.bfloat16).to(torch.float32)
rope.register_buffer("periods", periods, persistent=False)
def forward(self, pixel_values):
_, _, height, width = pixel_values.shape
num_patches_h = height // self.config.patch_size
num_patches_w = width // self.config.patch_size
coords_h = torch.arange(0.5, num_patches_h, device=pixel_values.device, dtype=torch.float32) / num_patches_h
coords_w = torch.arange(0.5, num_patches_w, device=pixel_values.device, dtype=torch.float32) / num_patches_w
coords = torch.stack(torch.meshgrid(coords_h, coords_w, indexing="ij"), dim=-1).flatten(0, 1)
coords = 2.0 * coords - 1.0
angles = 2 * math.pi * coords[:, :, None] / self.periods[None, None, :]
angles = angles.flatten(1, 2).tile(2)
cos = torch.cos(angles).to(dtype=pixel_values.dtype)
sin = torch.sin(angles).to(dtype=pixel_values.dtype)
return cos, sin
rope.forward = MethodType(forward, rope)
This restores bitwise equivalence in both pure fp32 and bf16 autocast.
- Downloads last month
- 95