title: Universal Cross-Domain Vision Model
emoji: π₯πΎ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
license: mit
π₯πΎ Universal Cross-Domain Vision Model
A multi-backbone vision model that classifies images across medical X-ray pathologies and sports action domains using fine-tuned multi-modal attention fusion on top of four pretrained encoders.
π§ Model Architecture
The model fuses features from four pretrained backbone encoders through a learned multi-head attention fusion layer:
| Backbone | Source | Output Dim |
|---|---|---|
| BiomedCLIP ViT-B/16 | microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 |
512 |
| ViT-B/16 | timm (ImageNet pretrained) |
512 |
| ResNet-50 | timm (ImageNet pretrained) |
512 |
| EfficientNet-B0 | timm (ImageNet pretrained) |
512 |
Each backbone's features are projected to a shared 512-dim space, then fused via an 8-head attention transformer block. The final classifier head outputs 14 class probabilities with an uncertainty estimate.
Image β [BiomedCLIP, ViT-B/16, ResNet-50, EfficientNet-B0]
β Projection Adapters (per backbone)
β 8-Head Attention Fusion
β Classifier β 14 classes + Uncertainty estimate
π·οΈ Classes
| Domain | Classes |
|---|---|
| π₯ Medical (X-ray) | Normal, Pneumonia, COVID-19, Tuberculosis, Cardiomegaly, Rib Fracture, Lung Mass, Pleural Effusion |
| πΎ Sports | Running, Jumping, Swimming, Cycling, Tennis, Football |
π Running the Demo
Option 1 β Hugging Face Spaces (live)
Visit the live demo β no setup needed:
π https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model
Upload any image and click Classify.
Option 2 β Run locally
Requirements: Python 3.9+, ~4 GB RAM (CPU) or GPU recommended
# 1. Clone this repo
git clone https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model
cd Universal_Cross-Domain_Vision_Model
# 2. Install dependencies
pip install -r requirements.txt
# 3. Launch
python app.py
# Opens at http://localhost:7860
Option 3 β REST API
# Start the API server
uvicorn api:app --host 0.0.0.0 --port 8000
# Classify an image file
curl -X POST http://localhost:8000/predict -F "file=@your_image.jpg"
# Classify from URL
curl -X POST http://localhost:8000/predict/url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/xray.jpg"}'
Interactive API docs at http://localhost:8000/docs
Option 4 β Google Colab
Open colab_deploy.ipynb in Colab, set runtime to T4 GPU, and run all cells.
π¦ Repository Structure
βββ app.py # Gradio web demo (main entry point)
βββ api.py # FastAPI REST inference server
βββ requirements.txt # Python dependencies
βββ head_weights.pt # Fine-tuned fusion + classifier weights (~25 MB)
βββ extract_head.py # Utility: extract head weights from full checkpoint
βββ colab_deploy.ipynb # One-click Google Colab notebook
βββ README.md # This file
Note on weights: The four backbone encoders (~1 GB total) are downloaded automatically from Hugging Face Hub at first startup and cached. Only the fine-tuned head (
head_weights.pt, ~25 MB) is stored in this repo.
π§ Training Details
| Setting | Value |
|---|---|
| Base model | BiomedCLIP (Microsoft), pretrained on PMC-15M medical image-text pairs |
| Additional backbones | ViT-B/16, ResNet-50, EfficientNet-B0 (ImageNet pretrained via timm) |
| Medical data | Synthesized X-ray images across 8 pathology classes |
| Sports data | Stanford40 action recognition dataset |
| Fusion | 8-head multi-head attention, 512-dim embedding space |
| Optimizer | AdamW with cosine annealing LR schedule |
| Regularization | Dropout (0.2), domain adversarial training |
π API Response Format
{
"top_prediction": {
"label": "Pneumonia",
"confidence": 0.412
},
"predictions": [
{ "label": "Pneumonia", "confidence": 0.412 },
{ "label": "Normal", "confidence": 0.238 },
{ "label": "COVID-19", "confidence": 0.134 },
{ "label": "Tuberculosis", "confidence": 0.089 },
{ "label": "Cardiomegaly", "confidence": 0.061 },
{ "label": "Running", "confidence": 0.044 },
{ "label": "Lung Mass", "confidence": 0.031 },
{ "label": "Pleural Effusion","confidence": 0.021 }
]
}
βοΈ Environment Variables
| Variable | Default | Description |
|---|---|---|
PORT |
7860 (Gradio) / 8000 (API) |
Server port |
π οΈ Troubleshooting
Slow first startup β The four backbones (~1 GB total) are downloaded from HF Hub on first run and cached. On HF Spaces this happens automatically during the build phase.
head_weights.pt not found β The app still runs but uses random weights for the fusion and classifier layers. Predictions will not reflect actual training. Upload head_weights.pt to the repo to enable real predictions.
Out of memory β The model runs on CPU if no GPU is detected. If memory is tight, reduce image resolution or comment out extra backbones in app.py.
Regenerating head_weights.pt from the original checkpoint β If you have best_model_phase1.pt, run:
python extract_head.py
This strips the large backbone weights (which are loaded from HF Hub) and saves only the fine-tuned layers (~25 MB) as head_weights.pt.
π License
MIT β see https://opensource.org/licenses/MIT
π Acknowledgements
- Microsoft BiomedCLIP β vision-language model pretrained on 15M medical image-text pairs from PubMed Central
- Stanford40 β sports and human action recognition dataset
- timm β PyTorch Image Models library
- open_clip β open source CLIP implementation
- Gradio β web demo framework
- FastAPI β REST API framework