simoswish commited on
Commit
e7766c6
·
verified ·
1 Parent(s): a15c4ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +227 -209
README.md CHANGED
@@ -1,209 +1,227 @@
1
- # Detector Training, file: Detector_Training.ipynb
2
-
3
- ## Overview
4
-
5
- This notebook presents a structured ablation study on the transferability of **YOLO26** to pedestrian detection on the **PRW** (Person Re-identification in the Wild) dataset.
6
- It constitutes the **detection stage** of a two-stage person search pipeline, where high recall is a primary design objective since any missed detection propagates as an unrecoverable error to the downstream Re-ID module.
7
-
8
- The study evaluates the impact of:
9
- - CrowdHuman intermediate pre-training vs. direct COCO initialisation
10
- - Full fine-tuning vs. partial fine-tuning (frozen backbone layers 0-9)
11
- - Model scale: YOLO26-Small vs. YOLO26-Large
12
-
13
- ## Requirements
14
-
15
- ### Platform
16
-
17
- The notebook was developed and executed on **Kaggle** with a **dual NVIDIA T4 GPU** configuration (16 GB VRAM each). Multi-GPU training is enabled by default via `cfg.use_multi_gpu = True`.
18
- To run on a single GPU or CPU, set this flag to `False` in the `Config` class.
19
-
20
- ### Dependencies
21
-
22
- All required packages are installed in the first cell:
23
- - ultralytics, opencv-python-headless, scipy, pandas, tqdm
24
-
25
-
26
- ### Input Datasets
27
-
28
- Before running the notebook, add the following two datasets as input sources in the Kaggle session (Notebook -> Input -> Add Input):
29
-
30
- | Dataset | Kaggle URL | Version |
31
- |---|---|---|
32
- | CrowdHuman | https://www.kaggle.com/datasets/leducnhuan/crowdhuman | 1 |
33
- | PRW -- Person Re-identification in the Wild | https://www.kaggle.com/datasets/edoardomerli/prw-person-re-identification-in-the-wild | 1 |
34
-
35
- The dataset paths are pre-configured in the `Config` class and match the default Kaggle mount locations. No manual path editing is required.
36
-
37
- ## Notebook Structure
38
-
39
- | Phase | Description |
40
- |---|---|
41
- | 0 - Baseline | YOLO26-Small COCO zero-shot eval on PRW |
42
- | 1 - CrowdHuman Pre-training | Fine-tune Small on CrowdHuman, then eval on PRW |
43
- | 2 - Strategy Comparison | Full FT vs. Partial FT (freeze backbone layers 0-9) |
44
- | 3 - Scale-up | Best strategy applied to YOLO26-Large |
45
- | Final | Cross-model comparison: metrics, params, GFLOPs, speed |
46
-
47
- ## Outputs
48
-
49
- All results are saved under `/kaggle/working/yolo_ablation/`:
50
- - `all_results.json` -- incremental results registry, persisted after each run
51
- - `all_results.csv` -- final summary table sorted by mAP@0.5
52
- - `plots/` -- all generated figures (bar charts, radar charts, training curves)
53
-
54
- Model checkpoints are saved under `/kaggle/working/yolo_runs/{run_name}/weights/`.
55
-
56
- ## Key Results
57
-
58
- | Model | Strategy | mAP@0.5 (%) | Recall (%) | Params (M) | Latency (ms) |
59
- |---|---|---|---|---|---|
60
- | YOLO26-Large | Full FT | 96.24 | 91.38 | 26.2 | 27.88 |
61
- | YOLO26-Small | Full FT | 94.96 | 89.32 | 9.9 | 7.72 |
62
- | YOLO26-Small | Partial FT | 94.91 | 89.33 | 9.9 | 7.65 |
63
- | YOLO26-Small | Zero-shot (CH) | 88.23 | 83.41 | 9.9 | 6.85 |
64
- | YOLO26-Small | Zero-shot (COCO) | 85.82 | 79.42 | 10.0 | 6.29 |
65
-
66
- The Large model achieves the highest recall (91.38%) at the cost of 3.6x higher latency.
67
- The Small model with full or partial fine-tuning offers a competitive alternative for latency-constrained deployments, with recall above 89% at under 8 ms per image.
68
-
69
- ## Reproducibility
70
-
71
- A global seed (42) is applied to Python, NumPy, PyTorch, and CUDA. All training cells are idempotent: if a valid checkpoint already exists, training is skipped and the existing weights are used.
72
- Results can be fully regenerated from saved checkpoints without re-running any training phase.
73
-
74
- ## Estimated Runtime
75
-
76
- Full end-to-end execution (dataset conversion, all training phases, evaluation, and plotting) takes approximately **8-10 hours** on a Kaggle dual T4 session, depending on dataset I/O speed and early stopping behaviour.
77
-
78
- # Re-ID Training, file: ReIdentificator_Training.ipynb
79
-
80
-
81
- ## Overview
82
-
83
-
84
- This notebook presents a structured ablation study on the fine-tuning of **PersonViT** for person re-identification on the **Person Re-identification in the Wild** (PRW) dataset.
85
- It constitutes the **Re-ID stage** of a two-stage person search pipeline, where the input is a set of pedestrian crops produced by the upstream YOLO26 detector and the output is an L2-normalised embedding used for cosine-similarity gallery retrieval.
86
-
87
- The study follows a **small-first, scale-up** design: the full ablation over fine-tuning strategies and loss functions is conducted on the lightweight **ViT-Small** (22 M parameters) to minimise GPU time, and only the winning configuration is then replicated on **ViT-Base** (86 M parameters). This reduces total compute by approximately 3x compared to running the ablation directly on ViT-Base, while preserving full causal interpretability of each experimental variable.
88
-
89
- The study evaluates the impact of:
90
- - Source domain selection (Duke, Market-1501, MSMT17, Occluded-Duke) in zero-shot evaluation
91
- - Fine-tuning strategy: Full FT vs. Partial FT vs. Freeze+Retrain
92
- - Metric learning loss function: TripletMarginLoss vs. ArcFaceLoss vs. NTXentLoss
93
- - Model scale: ViT-Small vs. ViT-Base
94
-
95
-
96
- ## Requirements
97
-
98
-
99
- ### Platform
100
-
101
- The notebook was developed and executed on **Kaggle** with a single **NVIDIA T4 GPU** (16 GB VRAM). Mixed-precision training (fp16) is enabled by default via `cfg.use_amp = True`.
102
-
103
-
104
- ### Dependencies
105
-
106
-
107
- All required packages are installed in the first cell:
108
- - albumentations, opencv-python-headless, scipy
109
- - torchmetrics, timm, einops, yacs
110
- - pytorch-metric-learning
111
- - thop
112
-
113
-
114
- ### Input Datasets and Models
115
-
116
-
117
- Before running the notebook, add the following dataset and model sources as inputs in the Kaggle session (Notebook -> Input -> Add Input):
118
-
119
- | Resource | Type | Kaggle URL | Version |
120
- |---|---|---|---|
121
- | PRW -- Person Re-identification in the Wild | Dataset | [https://www.kaggle.com/datasets/edoardomerli/prw-person-re-identification-in-the-wild](https://www.kaggle.com/datasets/edoardomerli/prw-person-re-identification-in-the-wild) | 1 |
122
- | PersonViT Pre-trained Weights | Model | [https://www.kaggle.com/models/simonerimondi/personvit](https://www.kaggle.com/models/simonerimondi/personvit) | 4 |
123
-
124
- The dataset and model paths are pre-configured in the `Config` class and match the default Kaggle mount locations. No manual path editing is required.
125
-
126
- The PersonViT model code is cloned at runtime from a personal GitHub fork ([github.com/simoswish02/PersonViT](https://github.com/simoswish02/PersonViT)) that fixes an import error present in the original upstream repository.
127
-
128
-
129
- ## Notebook Structure
130
-
131
-
132
- | Phase | Description |
133
- |---|---|
134
- | 0 - Pretrained Baselines | Zero-shot evaluation of all four ViT-Small checkpoints on PRW |
135
- | 1 - Strategy Comparison | Full FT vs. Partial FT vs. Freeze+Retrain, ArcFace loss fixed (ViT-Small) |
136
- | 2 - Loss Comparison | TripletMarginLoss vs. ArcFaceLoss vs. NTXentLoss, best strategy fixed (ViT-Small) |
137
- | 3 - Scale-Up | Winning configuration replicated on ViT-Base (embedding dim 768) |
138
- | Final | Cross-model comparison: metrics, params, GFLOPs, latency |
139
-
140
-
141
- ## Outputs
142
-
143
-
144
- All results are saved under `./evaluation_results/`:
145
- - `all_results.json` -- incremental results registry, persisted after each run
146
- - `all_results.csv` -- final summary table sorted by mAP
147
- - `plots/` -- all generated figures (bar charts, radar charts, training curves, Small vs. Base delta table)
148
-
149
- Model checkpoints are saved under `/kaggle/working/personvit_finetuning/`.
150
-
151
-
152
- ## Key Results
153
-
154
-
155
- | Run | Strategy | Loss | mAP (%) | Rank-1 (%) | Params (M) | Latency (ms) |
156
- |---|---|---|---|---|---|---|
157
- | vit_base_full_triplet | Full FT | Triplet | 85.65 | 94.51 | 86.5 | 11.80 |
158
- | full_triplet | Full FT | Triplet | 81.50 | 93.44 | 22.0 | 7.02 |
159
- | full_ntxent | Full FT | NTXent | 80.77 | 93.15 | 22.0 | 7.47 |
160
- | full_arcface | Full FT | ArcFace | 78.10 | 93.39 | 22.0 | 7.34 |
161
- | freeze_arcface | Freeze+Retrain | ArcFace | 75.64 | 93.19 | 22.0 | 7.32 |
162
- | partial_arcface | Partial FT | ArcFace | 75.62 | 93.00 | 22.0 | 7.56 |
163
- | market1501 (zero-shot) | Pretrained | -- | 75.26 | 92.90 | 21.6 | 7.62 |
164
-
165
- ViT-Base with Full FT and TripletMarginLoss achieves the best mAP (85.65%) at the cost of 1.68x higher latency compared to the best ViT-Small run. ViT-Small with Full FT and TripletMarginLoss is the recommended alternative for throughput-constrained deployments, with mAP of 81.50% at 7.02 ms per image.
166
-
167
-
168
- ## Reproducibility
169
-
170
-
171
- A global seed (42) is applied to Python, NumPy, PyTorch, and CUDA. All training cells are idempotent: if a run key is already present in `RESULTS`, training is skipped and the existing entry is used directly. Results can be fully restored from `all_results.json` after a kernel restart without re-running any training phase.
172
-
173
-
174
- ## Estimated Runtime
175
-
176
-
177
- Full end-to-end execution (all four phases, evaluation, and plotting) takes approximately **15-20 hours** on a Kaggle single T4 session, depending on dataset I/O speed and Kaggle session availability.
178
-
179
-
180
- ### References
181
-
182
- [1] Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., and Tian, Q.
183
- "Person Re-identification in the Wild."
184
- *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2017.
185
- [https://arxiv.org/abs/1604.02531](https://arxiv.org/abs/1604.02531)
186
-
187
- [2] Hu, B., Wang, X., and Liu, W.
188
- "PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification."
189
- *Machine Vision and Applications*, 2025. DOI: 10.1007/s00138-025-01659-y
190
- [https://arxiv.org/abs/2408.05398](https://arxiv.org/abs/2408.05398)
191
-
192
- [3] hustvl.
193
- "PersonViT — Official GitHub Repository."
194
- [https://github.com/hustvl/PersonViT](https://github.com/hustvl/PersonViT)
195
-
196
- [4] lakeAGI.
197
- "PersonViT Pre-trained Weights (ViT-Base and ViT-Small)."
198
- *Hugging Face Model Hub*, 2024.
199
- [https://huggingface.co/lakeAGI/PersonViTReID/tree/main](https://huggingface.co/lakeAGI/PersonViTReID/tree/main)
200
-
201
- [5] He, S., Luo, H., Wang, P., Wang, F., Li, H., and Jiang, W.
202
- "TransReID: Transformer-based Object Re-Identification."
203
- *IEEE International Conference on Computer Vision (ICCV)*, 2021.
204
- [https://arxiv.org/abs/2102.04378](https://arxiv.org/abs/2102.04378)
205
-
206
- [6] Musgrave, K., Belongie, S., and Lim, S.
207
- "PyTorch Metric Learning."
208
- *arXiv preprint arXiv:2008.09164*, 2020.
209
- [https://arxiv.org/abs/2008.09164](https://arxiv.org/abs/2008.09164)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc
3
+ datasets:
4
+ - sshao0516/CrowdHuman
5
+ - aveocr/Market-1501-v15.09.15.zip
6
+ language:
7
+ - en
8
+ base_model:
9
+ - lakeAGI/PersonViT
10
+ - Ultralytics/YOLO26
11
+ tags:
12
+ - person_search
13
+ - PRW
14
+ - Tranformer
15
+ - PersonViT
16
+ - YOLO26
17
+ - ablation
18
+ ---
19
+ # Detector Training, file: Detector_Training.ipynb
20
+
21
+ ## Overview
22
+
23
+ This notebook presents a structured ablation study on the transferability of **YOLO26** to pedestrian detection on the **PRW** (Person Re-identification in the Wild) dataset.
24
+ It constitutes the **detection stage** of a two-stage person search pipeline, where high recall is a primary design objective since any missed detection propagates as an unrecoverable error to the downstream Re-ID module.
25
+
26
+ The study evaluates the impact of:
27
+ - CrowdHuman intermediate pre-training vs. direct COCO initialisation
28
+ - Full fine-tuning vs. partial fine-tuning (frozen backbone layers 0-9)
29
+ - Model scale: YOLO26-Small vs. YOLO26-Large
30
+
31
+ ## Requirements
32
+
33
+ ### Platform
34
+
35
+ The notebook was developed and executed on **Kaggle** with a **dual NVIDIA T4 GPU** configuration (16 GB VRAM each). Multi-GPU training is enabled by default via `cfg.use_multi_gpu = True`.
36
+ To run on a single GPU or CPU, set this flag to `False` in the `Config` class.
37
+
38
+ ### Dependencies
39
+
40
+ All required packages are installed in the first cell:
41
+ - ultralytics, opencv-python-headless, scipy, pandas, tqdm
42
+
43
+
44
+ ### Input Datasets
45
+
46
+ Before running the notebook, add the following two datasets as input sources in the Kaggle session (Notebook -> Input -> Add Input):
47
+
48
+ | Dataset | Kaggle URL | Version |
49
+ |---|---|---|
50
+ | CrowdHuman | https://www.kaggle.com/datasets/leducnhuan/crowdhuman | 1 |
51
+ | PRW -- Person Re-identification in the Wild | https://www.kaggle.com/datasets/edoardomerli/prw-person-re-identification-in-the-wild | 1 |
52
+
53
+ The dataset paths are pre-configured in the `Config` class and match the default Kaggle mount locations. No manual path editing is required.
54
+
55
+ ## Notebook Structure
56
+
57
+ | Phase | Description |
58
+ |---|---|
59
+ | 0 - Baseline | YOLO26-Small COCO zero-shot eval on PRW |
60
+ | 1 - CrowdHuman Pre-training | Fine-tune Small on CrowdHuman, then eval on PRW |
61
+ | 2 - Strategy Comparison | Full FT vs. Partial FT (freeze backbone layers 0-9) |
62
+ | 3 - Scale-up | Best strategy applied to YOLO26-Large |
63
+ | Final | Cross-model comparison: metrics, params, GFLOPs, speed |
64
+
65
+ ## Outputs
66
+
67
+ All results are saved under `/kaggle/working/yolo_ablation/`:
68
+ - `all_results.json` -- incremental results registry, persisted after each run
69
+ - `all_results.csv` -- final summary table sorted by mAP@0.5
70
+ - `plots/` -- all generated figures (bar charts, radar charts, training curves)
71
+
72
+ Model checkpoints are saved under `/kaggle/working/yolo_runs/{run_name}/weights/`.
73
+
74
+ ## Key Results
75
+
76
+ | Model | Strategy | mAP@0.5 (%) | Recall (%) | Params (M) | Latency (ms) |
77
+ |---|---|---|---|---|---|
78
+ | YOLO26-Large | Full FT | 96.24 | 91.38 | 26.2 | 27.88 |
79
+ | YOLO26-Small | Full FT | 94.96 | 89.32 | 9.9 | 7.72 |
80
+ | YOLO26-Small | Partial FT | 94.91 | 89.33 | 9.9 | 7.65 |
81
+ | YOLO26-Small | Zero-shot (CH) | 88.23 | 83.41 | 9.9 | 6.85 |
82
+ | YOLO26-Small | Zero-shot (COCO) | 85.82 | 79.42 | 10.0 | 6.29 |
83
+
84
+ The Large model achieves the highest recall (91.38%) at the cost of 3.6x higher latency.
85
+ The Small model with full or partial fine-tuning offers a competitive alternative for latency-constrained deployments, with recall above 89% at under 8 ms per image.
86
+
87
+ ## Reproducibility
88
+
89
+ A global seed (42) is applied to Python, NumPy, PyTorch, and CUDA. All training cells are idempotent: if a valid checkpoint already exists, training is skipped and the existing weights are used.
90
+ Results can be fully regenerated from saved checkpoints without re-running any training phase.
91
+
92
+ ## Estimated Runtime
93
+
94
+ Full end-to-end execution (dataset conversion, all training phases, evaluation, and plotting) takes approximately **8-10 hours** on a Kaggle dual T4 session, depending on dataset I/O speed and early stopping behaviour.
95
+
96
+ # Re-ID Training, file: ReIdentificator_Training.ipynb
97
+
98
+
99
+ ## Overview
100
+
101
+
102
+ This notebook presents a structured ablation study on the fine-tuning of **PersonViT** for person re-identification on the **Person Re-identification in the Wild** (PRW) dataset.
103
+ It constitutes the **Re-ID stage** of a two-stage person search pipeline, where the input is a set of pedestrian crops produced by the upstream YOLO26 detector and the output is an L2-normalised embedding used for cosine-similarity gallery retrieval.
104
+
105
+ The study follows a **small-first, scale-up** design: the full ablation over fine-tuning strategies and loss functions is conducted on the lightweight **ViT-Small** (22 M parameters) to minimise GPU time, and only the winning configuration is then replicated on **ViT-Base** (86 M parameters). This reduces total compute by approximately 3x compared to running the ablation directly on ViT-Base, while preserving full causal interpretability of each experimental variable.
106
+
107
+ The study evaluates the impact of:
108
+ - Source domain selection (Duke, Market-1501, MSMT17, Occluded-Duke) in zero-shot evaluation
109
+ - Fine-tuning strategy: Full FT vs. Partial FT vs. Freeze+Retrain
110
+ - Metric learning loss function: TripletMarginLoss vs. ArcFaceLoss vs. NTXentLoss
111
+ - Model scale: ViT-Small vs. ViT-Base
112
+
113
+
114
+ ## Requirements
115
+
116
+
117
+ ### Platform
118
+
119
+ The notebook was developed and executed on **Kaggle** with a single **NVIDIA T4 GPU** (16 GB VRAM). Mixed-precision training (fp16) is enabled by default via `cfg.use_amp = True`.
120
+
121
+
122
+ ### Dependencies
123
+
124
+
125
+ All required packages are installed in the first cell:
126
+ - albumentations, opencv-python-headless, scipy
127
+ - torchmetrics, timm, einops, yacs
128
+ - pytorch-metric-learning
129
+ - thop
130
+
131
+
132
+ ### Input Datasets and Models
133
+
134
+
135
+ Before running the notebook, add the following dataset and model sources as inputs in the Kaggle session (Notebook -> Input -> Add Input):
136
+
137
+ | Resource | Type | Kaggle URL | Version |
138
+ |---|---|---|---|
139
+ | PRW -- Person Re-identification in the Wild | Dataset | [https://www.kaggle.com/datasets/edoardomerli/prw-person-re-identification-in-the-wild](https://www.kaggle.com/datasets/edoardomerli/prw-person-re-identification-in-the-wild) | 1 |
140
+ | PersonViT Pre-trained Weights | Model | [https://www.kaggle.com/models/simonerimondi/personvit](https://www.kaggle.com/models/simonerimondi/personvit) | 4 |
141
+
142
+ The dataset and model paths are pre-configured in the `Config` class and match the default Kaggle mount locations. No manual path editing is required.
143
+
144
+ The PersonViT model code is cloned at runtime from a personal GitHub fork ([github.com/simoswish02/PersonViT](https://github.com/simoswish02/PersonViT)) that fixes an import error present in the original upstream repository.
145
+
146
+
147
+ ## Notebook Structure
148
+
149
+
150
+ | Phase | Description |
151
+ |---|---|
152
+ | 0 - Pretrained Baselines | Zero-shot evaluation of all four ViT-Small checkpoints on PRW |
153
+ | 1 - Strategy Comparison | Full FT vs. Partial FT vs. Freeze+Retrain, ArcFace loss fixed (ViT-Small) |
154
+ | 2 - Loss Comparison | TripletMarginLoss vs. ArcFaceLoss vs. NTXentLoss, best strategy fixed (ViT-Small) |
155
+ | 3 - Scale-Up | Winning configuration replicated on ViT-Base (embedding dim 768) |
156
+ | Final | Cross-model comparison: metrics, params, GFLOPs, latency |
157
+
158
+
159
+ ## Outputs
160
+
161
+
162
+ All results are saved under `./evaluation_results/`:
163
+ - `all_results.json` -- incremental results registry, persisted after each run
164
+ - `all_results.csv` -- final summary table sorted by mAP
165
+ - `plots/` -- all generated figures (bar charts, radar charts, training curves, Small vs. Base delta table)
166
+
167
+ Model checkpoints are saved under `/kaggle/working/personvit_finetuning/`.
168
+
169
+
170
+ ## Key Results
171
+
172
+
173
+ | Run | Strategy | Loss | mAP (%) | Rank-1 (%) | Params (M) | Latency (ms) |
174
+ |---|---|---|---|---|---|---|
175
+ | vit_base_full_triplet | Full FT | Triplet | 85.65 | 94.51 | 86.5 | 11.80 |
176
+ | full_triplet | Full FT | Triplet | 81.50 | 93.44 | 22.0 | 7.02 |
177
+ | full_ntxent | Full FT | NTXent | 80.77 | 93.15 | 22.0 | 7.47 |
178
+ | full_arcface | Full FT | ArcFace | 78.10 | 93.39 | 22.0 | 7.34 |
179
+ | freeze_arcface | Freeze+Retrain | ArcFace | 75.64 | 93.19 | 22.0 | 7.32 |
180
+ | partial_arcface | Partial FT | ArcFace | 75.62 | 93.00 | 22.0 | 7.56 |
181
+ | market1501 (zero-shot) | Pretrained | -- | 75.26 | 92.90 | 21.6 | 7.62 |
182
+
183
+ ViT-Base with Full FT and TripletMarginLoss achieves the best mAP (85.65%) at the cost of 1.68x higher latency compared to the best ViT-Small run. ViT-Small with Full FT and TripletMarginLoss is the recommended alternative for throughput-constrained deployments, with mAP of 81.50% at 7.02 ms per image.
184
+
185
+
186
+ ## Reproducibility
187
+
188
+
189
+ A global seed (42) is applied to Python, NumPy, PyTorch, and CUDA. All training cells are idempotent: if a run key is already present in `RESULTS`, training is skipped and the existing entry is used directly. Results can be fully restored from `all_results.json` after a kernel restart without re-running any training phase.
190
+
191
+
192
+ ## Estimated Runtime
193
+
194
+
195
+ Full end-to-end execution (all four phases, evaluation, and plotting) takes approximately **15-20 hours** on a Kaggle single T4 session, depending on dataset I/O speed and Kaggle session availability.
196
+
197
+
198
+ # References
199
+
200
+ [1] Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., and Tian, Q.
201
+ "Person Re-identification in the Wild."
202
+ *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2017.
203
+ [https://arxiv.org/abs/1604.02531](https://arxiv.org/abs/1604.02531)
204
+
205
+ [2] Hu, B., Wang, X., and Liu, W.
206
+ "PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification."
207
+ *Machine Vision and Applications*, 2025. DOI: 10.1007/s00138-025-01659-y
208
+ [https://arxiv.org/abs/2408.05398](https://arxiv.org/abs/2408.05398)
209
+
210
+ [3] hustvl.
211
+ "PersonViT — Official GitHub Repository."
212
+ [https://github.com/hustvl/PersonViT](https://github.com/hustvl/PersonViT)
213
+
214
+ [4] lakeAGI.
215
+ "PersonViT Pre-trained Weights (ViT-Base and ViT-Small)."
216
+ *Hugging Face Model Hub*, 2024.
217
+ [https://huggingface.co/lakeAGI/PersonViTReID/tree/main](https://huggingface.co/lakeAGI/PersonViTReID/tree/main)
218
+
219
+ [5] He, S., Luo, H., Wang, P., Wang, F., Li, H., and Jiang, W.
220
+ "TransReID: Transformer-based Object Re-Identification."
221
+ *IEEE International Conference on Computer Vision (ICCV)*, 2021.
222
+ [https://arxiv.org/abs/2102.04378](https://arxiv.org/abs/2102.04378)
223
+
224
+ [6] Musgrave, K., Belongie, S., and Lim, S.
225
+ "PyTorch Metric Learning."
226
+ *arXiv preprint arXiv:2008.09164*, 2020.
227
+ [https://arxiv.org/abs/2008.09164](https://arxiv.org/abs/2008.09164)