VisionEncoder / README.md
xiaomoguhzz's picture
docs: add README describing checkpoint layout and training code mapping
19cba99 verified
metadata
license: apache-2.0
library_name: transformers
tags:
  - vision-encoder
  - distillation
  - video-language
  - siglip2
  - dinov3

VisionEncoder Checkpoints

Final model checkpoints from the VisionEncoder research project.

Training code: https://github.com/xiaomoguhz/VisionEncoder

Contents

Each directory corresponds to one training pipeline in the code repo:

Directory Training code
declip_siglip2/spatial_align/ declip_siglip2/ — DeCLIP spatial alignment distillation on SigLIP2 using DINOv2 / DINOv3 as teacher
kd_mllm/s1_kd_pretrain/ ms-swift/kd_mllm/ stage-1 pretrain (ms-swift/run_s1.sh)
kd_mllm/s1_siglip2_qwen3_4b/ ms-swift/kd_mllm/ stage-1, SigLIP2 + Qwen3-4B backbone
kd_mllm/s2_siglip2_qwen3_4b_10pct/ ms-swift/kd_mllm/ stage-2 SFT on 10% data (run_s2.sh)
self_refine/qwen3vl_2b_10pct/ ms-swift/self_refine/ — register token injection + auto-calibrated GP threshold loss
video_mllm_swift/s1_siglip2_qwen3_1.7b/ ms-swift/video_mllm/ stage-1 with SigLIP2 encoder
video_mllm_swift/s1_declip_siglip2_qwen3_1.7b/ ms-swift/video_mllm/ stage-1 with DeCLIP-SigLIP2 encoder
video_mllm_swift/s2_siglip2_qwen3_1.7b_10pct/ ms-swift/video_mllm/ stage-2 SFT, SigLIP2
video_mllm_swift/s2_declip_siglip2_qwen3_1.7b_10pct/ ms-swift/video_mllm/ stage-2 SFT, DeCLIP-SigLIP2
video_mllm_swift/s2_image_only_10pct/ Ablation: image-only stage-2 training
ms-swift-data/ Not a checkpoint — preprocessed SFT training data (ms-swift/data/) used by the pipelines above

Related repositories