SigLIP2 Giant 256px -- Pre-compiled for AWS Neuron (Inferentia2)

Pre-compiled SigLIP2 Giant vision encoder for AWS Inferentia2 / Trainium, traced with torch_neuronx.trace().

Model Details

Property	Value
Base model	`google/siglip2-giant-opt-patch16-256`
Parameters	~1.1B
Input	`(1, 3, 256, 256)` -- batch size 1, 256x256 RGB
Output	dict with `pooler_output` (1, 1536) and `last_hidden_state` (1, 256, 1536)
Compiled size	1.88 GB
SDK	Neuron SDK 2.28 (torch-neuronx 2.9.0)
Compiler flags	`--auto-cast matmult --optlevel 3 --model-type unet-inference`

Compiler Flags Explained

Flag	Purpose
`--auto-cast matmult`	Cast matrix multiplications to BF16 (~50% smaller model, 99.999% accuracy)
`--optlevel 3`	Maximum compiler optimization
`--model-type unet-inference`	Memory layout optimizations that reduce SRAM spill (+6% throughput)

Note: The flag is matmult (two t's), not matmul.

Quick Start

import torch
import torch_neuronx
from huggingface_hub import hf_hub_download

# Download pre-compiled model
model_path = hf_hub_download(
    repo_id="jburtoft/siglip2-giant-256-neuron",
    filename="siglip2_giant_256_neuron.pt",
)

# Load (must import torch_neuronx before torch.jit.load)
model = torch.jit.load(model_path)
model.eval()

# Inference
inp = torch.randn(1, 3, 256, 256)
with torch.no_grad():
    output = model(inp)

print(output["pooler_output"].shape)      # (1, 1536)
print(output["last_hidden_state"].shape)   # (1, 256, 1536)

Performance (inf2.xlarge)

Config	Throughput	Latency (avg)
Single core	70.12 img/s	14.26 ms
DP=2 (both cores)	117.76 img/s	16.98 ms

Accuracy

Cosine similarity vs FP32 CPU reference (20 samples):

Pooler output: 0.999924--0.999963
Hidden states: 0.999943--0.999970

Compile It Yourself

The pre-compiled model works on any inf2 or trn2 instance. To recompile (requires inf2.8xlarge+ due to host RAM):

import torch
import torch_neuronx
from transformers import SiglipVisionModel

model = SiglipVisionModel.from_pretrained("google/siglip2-giant-opt-patch16-256")
model.eval()

model_neuron = torch_neuronx.trace(
    model,
    torch.randn(1, 3, 256, 256),
    compiler_workdir="./compile_workdir",
    compiler_args=[
        "--auto-cast", "matmult",
        "--optlevel", "3",
        "--model-type", "unet-inference",
    ],
)
torch.jit.save(model_neuron, "siglip2_giant_256_neuron.pt")

Compilation takes ~15 minutes on inf2.8xlarge. The resulting .pt file is portable across all inf2 and trn2 instance sizes.

inf2.xlarge users: Compilation OOMs on inf2.xlarge (only 4 vCPUs / ~16 GB host RAM). Use this pre-compiled model or compile on a larger instance.

SDK Compatibility

This model was compiled with Neuron SDK 2.28. It should work with SDK 2.28+. If you encounter loading errors on a different SDK version, recompile using the instructions above.

License

Apache 2.0 (same as the base model).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jburtoft/siglip2-giant-256-neuron

Base model

google/siglip2-giant-opt-patch16-256

Finetuned

(1)

this model