Mingliang Liang commited on
Commit
b590862
·
verified ·
1 Parent(s): 3d6ccaa

Add Hugging Face model card

Browse files
Files changed (1) hide show
  1. README.md +140 -0
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: open_clip
4
+ pipeline_tag: zero-shot-image-classification
5
+ tags:
6
+ - open_clip
7
+ - clip
8
+ - vision-language-model
9
+ - zero-shot-image-classification
10
+ - image-text-retrieval
11
+ - research
12
+ - long-tail
13
+ - datacomp
14
+ ---
15
+
16
+ # DynamiCS ViT-B-16 on DataComp-DFN
17
+
18
+ ## Model Details
19
+
20
+ This repository hosts two OpenCLIP-compatible PyTorch checkpoints for **DynamiCS**, a dynamic cluster-based data sampling method for efficient and long-tail-aware vision-language pre-training.
21
+
22
+ The checkpoints correspond to the `DataComp-DFN (130M)` results reported in the DynamiCS project repository and paper draft, using a **ViT-B/16** image encoder and the OpenCLIP text tower.
23
+
24
+ ### Available checkpoints
25
+
26
+ | File | Samples Seen @ Resolution | Tokens | ImageNet-1K | Let It Wag! | GPU-hours |
27
+ | --- | --- | ---: | ---: | ---: | ---: |
28
+ | `DynamiCS-ViT-B-16-DataComp-DFN-130M-1.28B.pt` | `1.28B@112 + 128M@224` | 81 | 71.3 | 50.2 | 163 |
29
+ | `DynamiCS-ViT-B-16-DataComp-DFN-130M-2.56B.pt` | `2.56B@112 + 128M@224` | 81 | 72.6 | 52.0 | 299 |
30
+
31
+ ### Model sources
32
+
33
+ - Code: `https://github.com/MingliangLiang3/DynamiCS`
34
+ - Implementation base: `https://github.com/mlfoundations/open_clip`
35
+ - Paper title: `Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training`
36
+
37
+ ## Intended Uses
38
+
39
+ These checkpoints are intended for:
40
+
41
+ - research on efficient vision-language model pre-training
42
+ - research on long-tail-aware data sampling and semantic balancing
43
+ - zero-shot image classification experiments
44
+ - image and text embedding extraction within the OpenCLIP framework
45
+ - benchmarking on long-tail evaluation datasets such as Let It Wag!
46
+
47
+ ### Out-of-scope use
48
+
49
+ These checkpoints are not intended for:
50
+
51
+ - safety-critical or high-risk decision making
52
+ - surveillance or biometric identification
53
+ - medical, legal, or financial decisions
54
+ - production use without additional evaluation, monitoring, and risk assessment
55
+
56
+ ## How to Use
57
+
58
+ These files are stored as **training checkpoints** (`epoch_1.pt` after fine-tuning), not as Hub-native exported `open_clip_pytorch_model.bin` weights. They can be loaded with the DynamiCS/OpenCLIP codebase using `open_clip.load_checkpoint`, which extracts the `state_dict` automatically when needed.
59
+
60
+ ```python
61
+ import open_clip
62
+
63
+ model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-16')
64
+ open_clip.load_checkpoint(model, '/path/to/DynamiCS-ViT-B-16-DataComp-DFN-130M-2.56B.pt')
65
+ tokenizer = open_clip.get_tokenizer('ViT-B-16')
66
+ model.eval()
67
+ ```
68
+
69
+ ## Training Data
70
+
71
+ The checkpoints were trained on a **DataComp-DFN** subset derived from DataComp-Large and filtered with DFN-2B. In the project paper, the accessible subset is described as approximately **130M** image-text pairs after accounting for unavailable or expired URLs.
72
+
73
+ DynamiCS computes per-sample sampling probabilities from semantic image clusters built with:
74
+
75
+ - DINOv2 ViT-B/16 image embeddings
76
+ - FAISS spherical k-means clustering
77
+ - post-clustering centroid refinement
78
+ - dynamic per-epoch cluster-based sampling
79
+
80
+ The exact web-scale training shards are not redistributed in this repository.
81
+
82
+ ## Training Procedure
83
+
84
+ The training pipeline is based on OpenCLIP and the DynamiCS extensions in the GitHub repository.
85
+
86
+ ### Core DynamiCS settings
87
+
88
+ - cluster count: `50k`
89
+ - centroid merge threshold: `0.70`
90
+ - cluster-scaling exponent: `alpha = 0.2`
91
+ - target sampling budget: `50%` of the accessible dataset per epoch
92
+ - image encoder: `ViT-B/16`
93
+ - maximum text length: `32`
94
+
95
+ ### Optimization and hardware
96
+
97
+ - pre-training at `112x112`
98
+ - fine-tuning at `224x224`
99
+ - mixed precision: `amp_bf16`
100
+ - hardware: `2 nodes x 4 H100 GPUs` (8 GPUs total)
101
+
102
+ ### Run variants in this repo
103
+
104
+ - `1.28B@112 + 128M@224`: lower-cost DynamiCS checkpoint
105
+ - `2.56B@112 + 128M@224`: longer-training DynamiCS checkpoint
106
+
107
+ ## Evaluation
108
+
109
+ The primary reported metrics for these checkpoints are zero-shot top-1 classification on:
110
+
111
+ - **ImageNet-1K**
112
+ - **Let It Wag!** (a long-tail classification benchmark)
113
+
114
+ ### Reported results
115
+
116
+ | Checkpoint | ImageNet-1K | Let It Wag! |
117
+ | --- | ---: | ---: |
118
+ | `DynamiCS-ViT-B-16-DataComp-DFN-130M-1.28B.pt` | 71.3 | 50.2 |
119
+ | `DynamiCS-ViT-B-16-DataComp-DFN-130M-2.56B.pt` | 72.6 | 52.0 |
120
+
121
+ These results are taken from the project repository and accompanying paper draft.
122
+
123
+ ## Limitations and Biases
124
+
125
+ Like other CLIP-style models trained on large-scale web data, these checkpoints may:
126
+
127
+ - reflect social, geographic, cultural, and language biases present in web-scale image-text corpora
128
+ - underperform on domains that differ substantially from the training distribution
129
+ - produce incorrect or overconfident predictions for rare, ambiguous, or sensitive concepts
130
+ - improve long-tail benchmark performance without guaranteeing fairness or robustness across all subpopulations
131
+
132
+ Users should evaluate the checkpoints carefully on their own tasks before any downstream use.
133
+
134
+ ## License
135
+
136
+ The underlying code repository is released under the MIT License. Model users are responsible for ensuring that their use and any redistribution of checkpoints comply with the terms, restrictions, and policies associated with the underlying training data and their deployment context.
137
+
138
+ ## Citation
139
+
140
+ If you use these checkpoints, please cite the DynamiCS project and OpenCLIP. Bibliographic metadata for the DynamiCS paper can be added here once the final publication details are available.