Elliot89 commited on
Commit
25589b2
Β·
verified Β·
1 Parent(s): 9b58add

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +185 -41
README.md CHANGED
@@ -10,44 +10,188 @@ pinned: false
10
  license: mit
11
  ---
12
 
13
- # Universal Cross-Domain Vision Model
14
-
15
- A BiomedCLIP-powered vision model that classifies images across **medical** and **sports** domains using multi-modal attention fusion.
16
-
17
- ## How to deploy to Hugging Face Spaces
18
-
19
- 1. Create a new Space at https://huggingface.co/new-space
20
- - SDK: **Gradio**
21
- - Visibility: Public or Private
22
-
23
- 2. Upload these files to the Space repository:
24
- ```
25
- app.py
26
- requirements.txt
27
- README_HF_SPACES.md ← rename this to README.md in the Space
28
- ```
29
-
30
- 3. Upload your checkpoint:
31
- ```
32
- universal_vision_checkpoints/best_model_phase1.pt
33
- ```
34
- > For large files (>1 GB) use Git LFS:
35
- > ```bash
36
- > git lfs install
37
- > git lfs track "*.pt"
38
- > git add .gitattributes
39
- > ```
40
-
41
- 4. Set the environment variable in Space Settings β†’ Variables:
42
- ```
43
- CHECKPOINT_PATH = universal_vision_checkpoints/best_model_phase1.pt
44
- ```
45
-
46
- 5. The Space will build automatically. First build takes ~5 minutes.
47
-
48
- ## Classes
49
-
50
- | Domain | Classes |
51
- |----------|---------|
52
- | Medical | Normal, Pneumonia, COVID-19, Tuberculosis, Cardiomegaly, Rib Fracture, Lung Mass, Pleural Effusion |
53
- | Sports | Running, Jumping, Swimming, Cycling, Tennis, Football |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  license: mit
11
  ---
12
 
13
+ # πŸ₯🎾 Universal Cross-Domain Vision Model
14
+
15
+ A multi-backbone vision model that classifies images across **medical X-ray pathologies** and **sports action** domains using fine-tuned multi-modal attention fusion on top of four pretrained encoders.
16
+
17
+ [![Hugging Face Space](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model)
18
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
19
+
20
+ ---
21
+
22
+ ## 🧠 Model Architecture
23
+
24
+ The model fuses features from four pretrained backbone encoders through a learned multi-head attention fusion layer:
25
+
26
+ | Backbone | Source | Output Dim |
27
+ |---|---|---|
28
+ | BiomedCLIP ViT-B/16 | `microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224` | 512 |
29
+ | ViT-B/16 | `timm` (ImageNet pretrained) | 512 |
30
+ | ResNet-50 | `timm` (ImageNet pretrained) | 512 |
31
+ | EfficientNet-B0 | `timm` (ImageNet pretrained) | 512 |
32
+
33
+ Each backbone's features are projected to a shared 512-dim space, then fused via an 8-head attention transformer block. The final classifier head outputs 14 class probabilities with an uncertainty estimate.
34
+
35
+ ```
36
+ Image β†’ [BiomedCLIP, ViT-B/16, ResNet-50, EfficientNet-B0]
37
+ β†’ Projection Adapters (per backbone)
38
+ β†’ 8-Head Attention Fusion
39
+ β†’ Classifier β†’ 14 classes + Uncertainty estimate
40
+ ```
41
+
42
+ ---
43
+
44
+ ## 🏷️ Classes
45
+
46
+ | Domain | Classes |
47
+ |---|---|
48
+ | πŸ₯ Medical (X-ray) | Normal, Pneumonia, COVID-19, Tuberculosis, Cardiomegaly, Rib Fracture, Lung Mass, Pleural Effusion |
49
+ | 🎾 Sports | Running, Jumping, Swimming, Cycling, Tennis, Football |
50
+
51
+ ---
52
+
53
+ ## πŸš€ Running the Demo
54
+
55
+ ### Option 1 β€” Hugging Face Spaces (live)
56
+
57
+ Visit the live demo β€” no setup needed:
58
+
59
+ πŸ‘‰ **https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model**
60
+
61
+ Upload any image and click **Classify**.
62
+
63
+ ### Option 2 β€” Run locally
64
+
65
+ **Requirements:** Python 3.9+, ~4 GB RAM (CPU) or GPU recommended
66
+
67
+ ```bash
68
+ # 1. Clone this repo
69
+ git clone https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model
70
+ cd Universal_Cross-Domain_Vision_Model
71
+
72
+ # 2. Install dependencies
73
+ pip install -r requirements.txt
74
+
75
+ # 3. Launch
76
+ python app.py
77
+ # Opens at http://localhost:7860
78
+ ```
79
+
80
+ ### Option 3 β€” REST API
81
+
82
+ ```bash
83
+ # Start the API server
84
+ uvicorn api:app --host 0.0.0.0 --port 8000
85
+
86
+ # Classify an image file
87
+ curl -X POST http://localhost:8000/predict -F "file=@your_image.jpg"
88
+
89
+ # Classify from URL
90
+ curl -X POST http://localhost:8000/predict/url \
91
+ -H "Content-Type: application/json" \
92
+ -d '{"url": "https://example.com/xray.jpg"}'
93
+ ```
94
+
95
+ Interactive API docs at **http://localhost:8000/docs**
96
+
97
+ ### Option 4 β€” Google Colab
98
+
99
+ Open `colab_deploy.ipynb` in Colab, set runtime to **T4 GPU**, and run all cells.
100
+
101
+ ---
102
+
103
+ ## πŸ“¦ Repository Structure
104
+
105
+ ```
106
+ β”œβ”€β”€ app.py # Gradio web demo (main entry point)
107
+ β”œβ”€β”€ api.py # FastAPI REST inference server
108
+ β”œβ”€β”€ requirements.txt # Python dependencies
109
+ β”œβ”€β”€ head_weights.pt # Fine-tuned fusion + classifier weights (~25 MB)
110
+ β”œβ”€β”€ extract_head.py # Utility: extract head weights from full checkpoint
111
+ β”œβ”€β”€ colab_deploy.ipynb # One-click Google Colab notebook
112
+ └── README.md # This file
113
+ ```
114
+
115
+ > **Note on weights:** The four backbone encoders (~1 GB total) are downloaded
116
+ > automatically from Hugging Face Hub at first startup and cached. Only the
117
+ > fine-tuned head (`head_weights.pt`, ~25 MB) is stored in this repo.
118
+
119
+ ---
120
+
121
+ ## πŸ”§ Training Details
122
+
123
+ | Setting | Value |
124
+ |---|---|
125
+ | Base model | BiomedCLIP (Microsoft), pretrained on PMC-15M medical image-text pairs |
126
+ | Additional backbones | ViT-B/16, ResNet-50, EfficientNet-B0 (ImageNet pretrained via timm) |
127
+ | Medical data | Synthesized X-ray images across 8 pathology classes |
128
+ | Sports data | Stanford40 action recognition dataset |
129
+ | Fusion | 8-head multi-head attention, 512-dim embedding space |
130
+ | Optimizer | AdamW with cosine annealing LR schedule |
131
+ | Regularization | Dropout (0.2), domain adversarial training |
132
+
133
+ ---
134
+
135
+ ## πŸ“‹ API Response Format
136
+
137
+ ```json
138
+ {
139
+ "top_prediction": {
140
+ "label": "Pneumonia",
141
+ "confidence": 0.412
142
+ },
143
+ "predictions": [
144
+ { "label": "Pneumonia", "confidence": 0.412 },
145
+ { "label": "Normal", "confidence": 0.238 },
146
+ { "label": "COVID-19", "confidence": 0.134 },
147
+ { "label": "Tuberculosis", "confidence": 0.089 },
148
+ { "label": "Cardiomegaly", "confidence": 0.061 },
149
+ { "label": "Running", "confidence": 0.044 },
150
+ { "label": "Lung Mass", "confidence": 0.031 },
151
+ { "label": "Pleural Effusion","confidence": 0.021 }
152
+ ]
153
+ }
154
+ ```
155
+
156
+ ---
157
+
158
+ ## βš™οΈ Environment Variables
159
+
160
+ | Variable | Default | Description |
161
+ |---|---|---|
162
+ | `PORT` | `7860` (Gradio) / `8000` (API) | Server port |
163
+
164
+ ---
165
+
166
+ ## πŸ› οΈ Troubleshooting
167
+
168
+ **Slow first startup** β€” The four backbones (~1 GB total) are downloaded from HF Hub on first run and cached. On HF Spaces this happens automatically during the build phase.
169
+
170
+ **`head_weights.pt` not found** β€” The app still runs but uses random weights for the fusion and classifier layers. Predictions will not reflect actual training. Upload `head_weights.pt` to the repo to enable real predictions.
171
+
172
+ **Out of memory** β€” The model runs on CPU if no GPU is detected. If memory is tight, reduce image resolution or comment out extra backbones in `app.py`.
173
+
174
+ **Regenerating `head_weights.pt` from the original checkpoint** β€” If you have `best_model_phase1.pt`, run:
175
+
176
+ ```bash
177
+ python extract_head.py
178
+ ```
179
+
180
+ This strips the large backbone weights (which are loaded from HF Hub) and saves only the fine-tuned layers (~25 MB) as `head_weights.pt`.
181
+
182
+ ---
183
+
184
+ ## πŸ“„ License
185
+
186
+ MIT β€” see [https://opensource.org/licenses/MIT](https://opensource.org/licenses/MIT)
187
+
188
+ ---
189
+
190
+ ## πŸ™ Acknowledgements
191
+
192
+ - [Microsoft BiomedCLIP](https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224) β€” vision-language model pretrained on 15M medical image-text pairs from PubMed Central
193
+ - [Stanford40](http://vision.stanford.edu/Datasets/40actions.html) β€” sports and human action recognition dataset
194
+ - [timm](https://github.com/huggingface/pytorch-image-models) β€” PyTorch Image Models library
195
+ - [open_clip](https://github.com/mlfoundations/open_clip) β€” open source CLIP implementation
196
+ - [Gradio](https://gradio.app) β€” web demo framework
197
+ - [FastAPI](https://fastapi.tiangolo.com) β€” REST API framework