File size: 6,665 Bytes
6d327cd 07c2bbf 6d327cd a5a19ee 6d327cd 07c2bbf 6d327cd 25589b2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | ---
title: Universal Cross-Domain Vision Model
emoji: π₯πΎ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
license: mit
---
# π₯πΎ Universal Cross-Domain Vision Model
A multi-backbone vision model that classifies images across **medical X-ray pathologies** and **sports action** domains using fine-tuned multi-modal attention fusion on top of four pretrained encoders.
[](https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model)
[](https://opensource.org/licenses/MIT)
---
## π§ Model Architecture
The model fuses features from four pretrained backbone encoders through a learned multi-head attention fusion layer:
| Backbone | Source | Output Dim |
|---|---|---|
| BiomedCLIP ViT-B/16 | `microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224` | 512 |
| ViT-B/16 | `timm` (ImageNet pretrained) | 512 |
| ResNet-50 | `timm` (ImageNet pretrained) | 512 |
| EfficientNet-B0 | `timm` (ImageNet pretrained) | 512 |
Each backbone's features are projected to a shared 512-dim space, then fused via an 8-head attention transformer block. The final classifier head outputs 14 class probabilities with an uncertainty estimate.
```
Image β [BiomedCLIP, ViT-B/16, ResNet-50, EfficientNet-B0]
β Projection Adapters (per backbone)
β 8-Head Attention Fusion
β Classifier β 14 classes + Uncertainty estimate
```
---
## π·οΈ Classes
| Domain | Classes |
|---|---|
| π₯ Medical (X-ray) | Normal, Pneumonia, COVID-19, Tuberculosis, Cardiomegaly, Rib Fracture, Lung Mass, Pleural Effusion |
| πΎ Sports | Running, Jumping, Swimming, Cycling, Tennis, Football |
---
## π Running the Demo
### Option 1 β Hugging Face Spaces (live)
Visit the live demo β no setup needed:
π **https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model**
Upload any image and click **Classify**.
### Option 2 β Run locally
**Requirements:** Python 3.9+, ~4 GB RAM (CPU) or GPU recommended
```bash
# 1. Clone this repo
git clone https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model
cd Universal_Cross-Domain_Vision_Model
# 2. Install dependencies
pip install -r requirements.txt
# 3. Launch
python app.py
# Opens at http://localhost:7860
```
### Option 3 β REST API
```bash
# Start the API server
uvicorn api:app --host 0.0.0.0 --port 8000
# Classify an image file
curl -X POST http://localhost:8000/predict -F "file=@your_image.jpg"
# Classify from URL
curl -X POST http://localhost:8000/predict/url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/xray.jpg"}'
```
Interactive API docs at **http://localhost:8000/docs**
### Option 4 β Google Colab
Open `colab_deploy.ipynb` in Colab, set runtime to **T4 GPU**, and run all cells.
---
## π¦ Repository Structure
```
βββ app.py # Gradio web demo (main entry point)
βββ api.py # FastAPI REST inference server
βββ requirements.txt # Python dependencies
βββ head_weights.pt # Fine-tuned fusion + classifier weights (~25 MB)
βββ extract_head.py # Utility: extract head weights from full checkpoint
βββ colab_deploy.ipynb # One-click Google Colab notebook
βββ README.md # This file
```
> **Note on weights:** The four backbone encoders (~1 GB total) are downloaded
> automatically from Hugging Face Hub at first startup and cached. Only the
> fine-tuned head (`head_weights.pt`, ~25 MB) is stored in this repo.
---
## π§ Training Details
| Setting | Value |
|---|---|
| Base model | BiomedCLIP (Microsoft), pretrained on PMC-15M medical image-text pairs |
| Additional backbones | ViT-B/16, ResNet-50, EfficientNet-B0 (ImageNet pretrained via timm) |
| Medical data | Synthesized X-ray images across 8 pathology classes |
| Sports data | Stanford40 action recognition dataset |
| Fusion | 8-head multi-head attention, 512-dim embedding space |
| Optimizer | AdamW with cosine annealing LR schedule |
| Regularization | Dropout (0.2), domain adversarial training |
---
## π API Response Format
```json
{
"top_prediction": {
"label": "Pneumonia",
"confidence": 0.412
},
"predictions": [
{ "label": "Pneumonia", "confidence": 0.412 },
{ "label": "Normal", "confidence": 0.238 },
{ "label": "COVID-19", "confidence": 0.134 },
{ "label": "Tuberculosis", "confidence": 0.089 },
{ "label": "Cardiomegaly", "confidence": 0.061 },
{ "label": "Running", "confidence": 0.044 },
{ "label": "Lung Mass", "confidence": 0.031 },
{ "label": "Pleural Effusion","confidence": 0.021 }
]
}
```
---
## βοΈ Environment Variables
| Variable | Default | Description |
|---|---|---|
| `PORT` | `7860` (Gradio) / `8000` (API) | Server port |
---
## π οΈ Troubleshooting
**Slow first startup** β The four backbones (~1 GB total) are downloaded from HF Hub on first run and cached. On HF Spaces this happens automatically during the build phase.
**`head_weights.pt` not found** β The app still runs but uses random weights for the fusion and classifier layers. Predictions will not reflect actual training. Upload `head_weights.pt` to the repo to enable real predictions.
**Out of memory** β The model runs on CPU if no GPU is detected. If memory is tight, reduce image resolution or comment out extra backbones in `app.py`.
**Regenerating `head_weights.pt` from the original checkpoint** β If you have `best_model_phase1.pt`, run:
```bash
python extract_head.py
```
This strips the large backbone weights (which are loaded from HF Hub) and saves only the fine-tuned layers (~25 MB) as `head_weights.pt`.
---
## π License
MIT β see [https://opensource.org/licenses/MIT](https://opensource.org/licenses/MIT)
---
## π Acknowledgements
- [Microsoft BiomedCLIP](https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224) β vision-language model pretrained on 15M medical image-text pairs from PubMed Central
- [Stanford40](http://vision.stanford.edu/Datasets/40actions.html) β sports and human action recognition dataset
- [timm](https://github.com/huggingface/pytorch-image-models) β PyTorch Image Models library
- [open_clip](https://github.com/mlfoundations/open_clip) β open source CLIP implementation
- [Gradio](https://gradio.app) β web demo framework
- [FastAPI](https://fastapi.tiangolo.com) β REST API framework
|