Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
Paper • 2603.19209 • Published • 5
Released inference checkpoints for the paper Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders.
Paper: Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
These artifacts are intended to be used with the public inference code release in vlm-ssm-vision-encoders.
Each artifact contains exactly:
config.jsoncheckpoints/latest-checkpoint.ptUsage remains subject to the licenses and terms of the underlying pretrained components, including the Vicuna base model and the released vision backbones.
| Public ID | Family | Task | Artifact | Size (GiB) | SHA256 |
|---|---|---|---|---|---|
vit-s-in1k-224 |
vit | classification | vit-s-in1k-224.tar |
25.17 | 9359fc2e2bcd3a5afe5b801fcf99b964dad8becc398eac79ac118c9334340505 |
maxvit-t-in1k-224-s3 |
maxvit | classification | maxvit-t-in1k-224-s3.tar |
25.17 | 2b844750143b028f90bce4d96f00696713c15c0e332ccc7dac770fa485869f4e |
mambavision-b-in1k-224-s3 |
mambavision | classification | mambavision-b-in1k-224-s3.tar |
25.17 | c8cf2e870dbda8bd6d45b6388c4bc71e9ee8296b28cdf44c843d9e5916eeeeeb |
vmamba-s-in1k-224-s3 |
vmamba | classification | vmamba-s-in1k-224-s3.tar |
25.17 | 0b3febfb685975ea8b9e81ab3e9f2f9637b67b8cca92f52a7899ba5a1130108f |
vitdet-b-coco-1024 |
vitdet | detection | vitdet-b-coco-1024.tar |
25.18 | 15d4c2bc08c44c9fe731ca0fbdf7449833782fcc1daace1866765dd753fec1de |
vit-adapter-deit-b-ade20k-512 |
vit_adapter | segmentation | vit-adapter-deit-b-ade20k-512.tar |
25.18 | d0cbd0a1d698496bbed4f38ca05e50e8a0f2a07e265157b7053beb4c9f64a73f |
vmamba-s-coco-1333x800 |
vmamba | detection | vmamba-s-coco-1333x800.tar |
25.17 | e9629e577275c042b0a9c48e666c9af81fef7291b7d23ed2b47ca03b4b0fa118 |
vmamba-s-ade20k-512 |
vmamba | segmentation | vmamba-s-ade20k-512.tar |
25.17 | bb3eef2ba5f7b9abd2cdd0d459e88c06d05ab75d8250843a7c51c819a509b188 |
Training and evaluation code will be released separately.