File size: 4,566 Bytes
b590862 48dc748 b590862 48dc748 b590862 5811b69 b590862 bf14f00 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | ---
license: mit
library_name: open_clip
pipeline_tag: zero-shot-image-classification
tags:
- open_clip
- clip
- vision-language-model
- zero-shot-image-classification
- image-text-retrieval
- research
- long-tail
- datacomp
---
# DynamiCS ViT-B-16 on DataComp-DFN
## Model Details
This repository hosts two OpenCLIP-compatible PyTorch checkpoints for **DynamiCS**, a dynamic cluster-based data sampling method for efficient and long-tail-aware vision-language pre-training.
The checkpoints correspond to the `DataComp-DFN (130M)` results reported in the DynamiCS project repository and paper draft, using a **ViT-B/16** image encoder and the OpenCLIP text tower.
### Available checkpoints
| File | Samples Seen @ Resolution | Tokens | ImageNet-1K | Let It Wag! | GPU-hours |
| --- | --- | ---: | ---: | ---: | ---: |
| `DynamiCS-ViT-B-16-DataComp-DFN-130M-1.28B.pt` | `1.28B@112 + 128M@224` | 81 | 71.3 | 50.2 | 163 |
| `DynamiCS-ViT-B-16-DataComp-DFN-130M-2.56B.pt` | `2.56B@112 + 128M@224` | 81 | 72.6 | 52.0 | 299 |
### Model sources
- Code: `https://github.com/MingliangLiang3/DynamiCS`
- Implementation base: `https://github.com/mlfoundations/open_clip`
- Paper title: `Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training.`
## Intended Uses
These checkpoints are intended for:
- research on efficient vision-language model pre-training
- research on long-tail-aware data sampling and semantic balancing
- zero-shot image classification experiments
- image and text embedding extraction within the OpenCLIP framework
- benchmarking on long-tail evaluation datasets such as Let It Wag!
## How to Use
These files are stored as **training checkpoints**, not as Hub-native exported `open_clip_pytorch_model.bin` weights. They can be loaded with the DynamiCS/OpenCLIP codebase using `open_clip.load_checkpoint`, which extracts the `state_dict` automatically when needed.
```python
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-16')
open_clip.load_checkpoint(model, '/path/to/DynamiCS-ViT-B-16-DataComp-DFN-130M-2.56B.pt')
tokenizer = open_clip.get_tokenizer('ViT-B-16')
model.eval()
```
## Training Data
The checkpoints were trained on a **DataComp-DFN** subset derived from DataComp-Large and filtered with DFN. In the project paper, the accessible subset is described as approximately **130M** image-text pairs after accounting for unavailable or expired URLs.
DynamiCS computes per-sample sampling probabilities from semantic image clusters built with:
- DINOv2 ViT-B/16 image embeddings
- FAISS spherical k-means clustering
- post-clustering centroid refinement
- dynamic per-epoch cluster-based sampling
The exact web-scale training shards are not redistributed in this repository.
## Training Procedure
The training pipeline is based on OpenCLIP and the DynamiCS extensions in the GitHub repository.
### Core DynamiCS settings
- cluster count: `50k`
- centroid merge threshold: `0.70`
- cluster-scaling exponent: `alpha = 0.2`
- target sampling budget: `50%` of the accessible dataset per epoch
- image encoder: `ViT-B/16`
- maximum text length: `32`
### Optimization and hardware
- pre-training at `112x112`
- fine-tuning at `224x224`
- mixed precision: `amp_bf16`
- hardware: `2 nodes x 4 H100 GPUs` (8 GPUs total)
### Run variants in this repo
- `1.28B@112 + 128M@224`: lower-cost DynamiCS checkpoint
- `2.56B@112 + 128M@224`: longer-training DynamiCS checkpoint
## Evaluation
The primary reported metrics for these checkpoints are zero-shot top-1 classification on:
- **ImageNet-1K**
- **Let It Wag!** (a long-tail classification benchmark)
### Reported results
| Checkpoint | ImageNet-1K | Let It Wag! |
| --- | ---: | ---: |
| `DynamiCS-ViT-B-16-DataComp-DFN-130M-1.28B.pt` | 71.3 | 50.2 |
| `DynamiCS-ViT-B-16-DataComp-DFN-130M-2.56B.pt` | 72.6 | 52.0 |
These results are taken from the project repository and accompanying paper draft.
## License
The underlying code repository is released under the MIT License. Model users are responsible for ensuring that their use and any redistribution of checkpoints comply with the terms, restrictions, and policies associated with the underlying training data and their deployment context.
## Citation
```bibtex
@article{liang2026dynamics,
title={Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training},
author={Mingliang Liang and Zhuoran Liu and Arjen P. de Vries and Martha Larson},
journal={arXiv preprint arXiv:2604.27932},
year={2026}
}
``` |