SigLIP2 Giant 256px -- Pre-compiled for AWS Neuron (Inferentia2)

Pre-compiled SigLIP2 Giant vision encoder for AWS Inferentia2 / Trainium, traced with torch_neuronx.trace().

Model Details

Property Value
Base model google/siglip2-giant-opt-patch16-256
Parameters ~1.1B
Input (1, 3, 256, 256) -- batch size 1, 256x256 RGB
Output dict with pooler_output (1, 1536) and last_hidden_state (1, 256, 1536)
Compiled size 1.88 GB
SDK Neuron SDK 2.28 (torch-neuronx 2.9.0)
Compiler flags --auto-cast matmult --optlevel 3 --model-type unet-inference

Compiler Flags Explained

Flag Purpose
--auto-cast matmult Cast matrix multiplications to BF16 (~50% smaller model, 99.999% accuracy)
--optlevel 3 Maximum compiler optimization
--model-type unet-inference Memory layout optimizations that reduce SRAM spill (+6% throughput)

Note: The flag is matmult (two t's), not matmul.

Quick Start

import torch
import torch_neuronx
from huggingface_hub import hf_hub_download

# Download pre-compiled model
model_path = hf_hub_download(
    repo_id="jburtoft/siglip2-giant-256-neuron",
    filename="siglip2_giant_256_neuron.pt",
)

# Load (must import torch_neuronx before torch.jit.load)
model = torch.jit.load(model_path)
model.eval()

# Inference
inp = torch.randn(1, 3, 256, 256)
with torch.no_grad():
    output = model(inp)

print(output["pooler_output"].shape)      # (1, 1536)
print(output["last_hidden_state"].shape)   # (1, 256, 1536)

Performance (inf2.xlarge)

Config Throughput Latency (avg)
Single core 70.12 img/s 14.26 ms
DP=2 (both cores) 117.76 img/s 16.98 ms

Accuracy

Cosine similarity vs FP32 CPU reference (20 samples):

  • Pooler output: 0.999924--0.999963
  • Hidden states: 0.999943--0.999970

Compile It Yourself

The pre-compiled model works on any inf2 or trn2 instance. To recompile (requires inf2.8xlarge+ due to host RAM):

import torch
import torch_neuronx
from transformers import SiglipVisionModel

model = SiglipVisionModel.from_pretrained("google/siglip2-giant-opt-patch16-256")
model.eval()

model_neuron = torch_neuronx.trace(
    model,
    torch.randn(1, 3, 256, 256),
    compiler_workdir="./compile_workdir",
    compiler_args=[
        "--auto-cast", "matmult",
        "--optlevel", "3",
        "--model-type", "unet-inference",
    ],
)
torch.jit.save(model_neuron, "siglip2_giant_256_neuron.pt")

Compilation takes ~15 minutes on inf2.8xlarge. The resulting .pt file is portable across all inf2 and trn2 instance sizes.

inf2.xlarge users: Compilation OOMs on inf2.xlarge (only 4 vCPUs / ~16 GB host RAM). Use this pre-compiled model or compile on a larger instance.

SDK Compatibility

This model was compiled with Neuron SDK 2.28. It should work with SDK 2.28+. If you encounter loading errors on a different SDK version, recompile using the instructions above.

License

Apache 2.0 (same as the base model).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jburtoft/siglip2-giant-256-neuron

Finetuned
(1)
this model