Update README.md

bf14f00 verified 7 days ago

4.57 kB

license: mit
library_name: open_clip
pipeline_tag: zero-shot-image-classification
tags:
  - open_clip
  - clip
  - vision-language-model
  - zero-shot-image-classification
  - image-text-retrieval
  - research
  - long-tail
  - datacomp

DynamiCS ViT-B-16 on DataComp-DFN

Model Details

This repository hosts two OpenCLIP-compatible PyTorch checkpoints for DynamiCS, a dynamic cluster-based data sampling method for efficient and long-tail-aware vision-language pre-training.

The checkpoints correspond to the DataComp-DFN (130M) results reported in the DynamiCS project repository and paper draft, using a ViT-B/16 image encoder and the OpenCLIP text tower.

Available checkpoints

File	Samples Seen @ Resolution	Tokens	ImageNet-1K	Let It Wag!	GPU-hours
`DynamiCS-ViT-B-16-DataComp-DFN-130M-1.28B.pt`	`1.28B@112 + 128M@224`	81	71.3	50.2	163
`DynamiCS-ViT-B-16-DataComp-DFN-130M-2.56B.pt`	`2.56B@112 + 128M@224`	81	72.6	52.0	299

Model sources

Code: https://github.com/MingliangLiang3/DynamiCS
Implementation base: https://github.com/mlfoundations/open_clip
Paper title: Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training.

Intended Uses

These checkpoints are intended for:

research on efficient vision-language model pre-training
research on long-tail-aware data sampling and semantic balancing
zero-shot image classification experiments
image and text embedding extraction within the OpenCLIP framework
benchmarking on long-tail evaluation datasets such as Let It Wag!

How to Use

These files are stored as training checkpoints, not as Hub-native exported open_clip_pytorch_model.bin weights. They can be loaded with the DynamiCS/OpenCLIP codebase using open_clip.load_checkpoint, which extracts the state_dict automatically when needed.

import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-16')
open_clip.load_checkpoint(model, '/path/to/DynamiCS-ViT-B-16-DataComp-DFN-130M-2.56B.pt')
tokenizer = open_clip.get_tokenizer('ViT-B-16')
model.eval()

Training Data

The checkpoints were trained on a DataComp-DFN subset derived from DataComp-Large and filtered with DFN. In the project paper, the accessible subset is described as approximately 130M image-text pairs after accounting for unavailable or expired URLs.

DynamiCS computes per-sample sampling probabilities from semantic image clusters built with:

DINOv2 ViT-B/16 image embeddings
FAISS spherical k-means clustering
post-clustering centroid refinement
dynamic per-epoch cluster-based sampling

The exact web-scale training shards are not redistributed in this repository.

Training Procedure

The training pipeline is based on OpenCLIP and the DynamiCS extensions in the GitHub repository.

Core DynamiCS settings

cluster count: 50k
centroid merge threshold: 0.70
cluster-scaling exponent: alpha = 0.2
target sampling budget: 50% of the accessible dataset per epoch
image encoder: ViT-B/16
maximum text length: 32

Optimization and hardware

pre-training at 112x112
fine-tuning at 224x224
mixed precision: amp_bf16
hardware: 2 nodes x 4 H100 GPUs (8 GPUs total)

Run variants in this repo

1.28B@112 + 128M@224: lower-cost DynamiCS checkpoint
2.56B@112 + 128M@224: longer-training DynamiCS checkpoint

Evaluation

The primary reported metrics for these checkpoints are zero-shot top-1 classification on:

ImageNet-1K
Let It Wag! (a long-tail classification benchmark)

Reported results

Checkpoint	ImageNet-1K	Let It Wag!
`DynamiCS-ViT-B-16-DataComp-DFN-130M-1.28B.pt`	71.3	50.2
`DynamiCS-ViT-B-16-DataComp-DFN-130M-2.56B.pt`	72.6	52.0

These results are taken from the project repository and accompanying paper draft.

License

The underlying code repository is released under the MIT License. Model users are responsible for ensuring that their use and any redistribution of checkpoints comply with the terms, restrictions, and policies associated with the underlying training data and their deployment context.

Citation

@article{liang2026dynamics,
  title={Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training},
  author={Mingliang Liang and Zhuoran Liu and Arjen P. de Vries and Martha Larson},
  journal={arXiv preprint arXiv:2604.27932},
  year={2026}
}