File size: 1,795 Bytes
19cba99 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | ---
license: apache-2.0
library_name: transformers
tags:
- vision-encoder
- distillation
- video-language
- siglip2
- dinov3
---
# VisionEncoder Checkpoints
Final model checkpoints from the **VisionEncoder** research project.
**Training code**: https://github.com/xiaomoguhz/VisionEncoder
## Contents
Each directory corresponds to one training pipeline in the code repo:
| Directory | Training code |
|---|---|
| `declip_siglip2/spatial_align/` | `declip_siglip2/` — DeCLIP spatial alignment distillation on SigLIP2 using DINOv2 / DINOv3 as teacher |
| `kd_mllm/s1_kd_pretrain/` | `ms-swift/kd_mllm/` stage-1 pretrain (`ms-swift/run_s1.sh`) |
| `kd_mllm/s1_siglip2_qwen3_4b/` | `ms-swift/kd_mllm/` stage-1, SigLIP2 + Qwen3-4B backbone |
| `kd_mllm/s2_siglip2_qwen3_4b_10pct/` | `ms-swift/kd_mllm/` stage-2 SFT on 10% data (`run_s2.sh`) |
| `self_refine/qwen3vl_2b_10pct/` | `ms-swift/self_refine/` — register token injection + auto-calibrated GP threshold loss |
| `video_mllm_swift/s1_siglip2_qwen3_1.7b/` | `ms-swift/video_mllm/` stage-1 with SigLIP2 encoder |
| `video_mllm_swift/s1_declip_siglip2_qwen3_1.7b/` | `ms-swift/video_mllm/` stage-1 with DeCLIP-SigLIP2 encoder |
| `video_mllm_swift/s2_siglip2_qwen3_1.7b_10pct/` | `ms-swift/video_mllm/` stage-2 SFT, SigLIP2 |
| `video_mllm_swift/s2_declip_siglip2_qwen3_1.7b_10pct/` | `ms-swift/video_mllm/` stage-2 SFT, DeCLIP-SigLIP2 |
| `video_mllm_swift/s2_image_only_10pct/` | Ablation: image-only stage-2 training |
| `ms-swift-data/` | Not a checkpoint — preprocessed SFT training data (`ms-swift/data/`) used by the pipelines above |
## Related repositories
- **Code**: https://github.com/xiaomoguhz/VisionEncoder
- **Evaluation data (~323 GB tarballs)**: https://huggingface.co/datasets/xiaomoguhzz/R3-Bench-data
|