VisionEncoder Checkpoints
Final model checkpoints from the VisionEncoder research project.
Training code: https://github.com/xiaomoguhz/VisionEncoder
Contents
Each directory corresponds to one training pipeline in the code repo:
| Directory | Training code |
|---|---|
declip_siglip2/spatial_align/ |
declip_siglip2/ โ DeCLIP spatial alignment distillation on SigLIP2 using DINOv2 / DINOv3 as teacher |
kd_mllm/s1_kd_pretrain/ |
ms-swift/kd_mllm/ stage-1 pretrain (ms-swift/run_s1.sh) |
kd_mllm/s1_siglip2_qwen3_4b/ |
ms-swift/kd_mllm/ stage-1, SigLIP2 + Qwen3-4B backbone |
kd_mllm/s2_siglip2_qwen3_4b_10pct/ |
ms-swift/kd_mllm/ stage-2 SFT on 10% data (run_s2.sh) |
self_refine/qwen3vl_2b_10pct/ |
ms-swift/self_refine/ โ register token injection + auto-calibrated GP threshold loss |
video_mllm_swift/s1_siglip2_qwen3_1.7b/ |
ms-swift/video_mllm/ stage-1 with SigLIP2 encoder |
video_mllm_swift/s1_declip_siglip2_qwen3_1.7b/ |
ms-swift/video_mllm/ stage-1 with DeCLIP-SigLIP2 encoder |
video_mllm_swift/s2_siglip2_qwen3_1.7b_10pct/ |
ms-swift/video_mllm/ stage-2 SFT, SigLIP2 |
video_mllm_swift/s2_declip_siglip2_qwen3_1.7b_10pct/ |
ms-swift/video_mllm/ stage-2 SFT, DeCLIP-SigLIP2 |
video_mllm_swift/s2_image_only_10pct/ |
Ablation: image-only stage-2 training |
ms-swift-data/ |
Not a checkpoint โ preprocessed SFT training data (ms-swift/data/) used by the pipelines above |
Related repositories
- Code: https://github.com/xiaomoguhz/VisionEncoder
- Evaluation data (~323 GB tarballs): https://huggingface.co/datasets/xiaomoguhzz/R3-Bench-data
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support