EUPE ViT-B/16 ONNX Export

Corrected ONNX export of facebook/EUPE-ViT-B for use with ONNX Runtime and latent-inspector.

This bundle supersedes the earlier broken export. The current artifact is validated against the upstream PyTorch model on 5 sample images and also passes an input-independence gate.

Model

EUPE (Efficient Universal Perception Encoder) is a ViT-B/16 vision encoder trained with a proxy-distillation pipeline: a compact 86M student distilled from a large proxy teacher that aggregates multiple expert perception models.

Property	Value
Architecture	ViT-B/16
Parameters	86M
Embedding dimension	768
Layers / Heads	12 / 12
Patch size	16 px
Input size	224 x 224
Output tokens	197 (1 CLS + 196 patches)
Base checkpoint	facebook/EUPE-ViT-B
Paper	Zhu et al. 2026
Upstream code	facebookresearch/eupe
License	FAIR Research License

Export method

The corrected export path is:

Download EUPE-ViT-B.pt from the upstream Hugging Face repo.
Load the model through the official facebookresearch/eupe torch.hub entrypoint eupe_vitb16.
Call forward_features() and concatenate:
- x_norm_clstoken -> [B, 1, 768]
- x_norm_patchtokens -> [B, 196, 768]
- final output last_hidden_state -> [B, 197, 768]
Export with the legacy TorchScript ONNX path (dynamo=False).
Save as model.onnx + model.onnx_data.

The newer torch.export / dynamo=True ONNX exporter currently fails on EUPE during decomposition, so this artifact intentionally uses the legacy exporter until the upstream exporter bug is fixed.

Validation

Validation report: export.validation.json

The artifact was accepted with these gates:

CLS cosine >= 0.995
Patch cosine >= 0.99
CLS mean abs diff <= 0.03
Patch mean abs diff <= 0.05
CLS max abs diff <= 0.5
Patch max abs diff <= 5.0
Input-independence cosine < 0.85

Observed export result:

validation_passed = true
Worst CLS cosine across 5 images: 0.998392
Worst patch cosine across 5 images: 0.994251
Worst CLS mean abs diff: 0.022487
Worst patch mean abs diff: 0.030653
Input-independence cosine: 0.744812

Files

File	Description
`model.onnx`	ONNX graph
`model.onnx_data`	External tensor data
`export.validation.json`	PyTorch vs ONNX parity report for this export

Usage

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx")
pixel_values = np.zeros((1, 3, 224, 224), dtype=np.float32)
last_hidden_state = session.run(["last_hidden_state"], {"pixel_values": pixel_values})[0]

Output layout: token 0 is the CLS embedding and tokens 1-196 are patch embeddings on a 14x14 grid.

Citation

@misc{zhu2026eupe,
  title={Efficient Universal Perception Encoder},
  author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
  year={2026},
  eprint={2603.22387},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.22387},
}