Spaces:

Elliot89
/

Universal_Cross-Domain_Vision_Model

Running

File size: 6,665 Bytes

6d327cd
 
07c2bbf
 
 
6d327cd
a5a19ee
6d327cd
 
07c2bbf
6d327cd
 
25589b2

---
title: Universal Cross-Domain Vision Model
emoji: 🏥🎾
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
license: mit
---

# 🏥🎾 Universal Cross-Domain Vision Model

A multi-backbone vision model that classifies images across **medical X-ray pathologies** and **sports action** domains using fine-tuned multi-modal attention fusion on top of four pretrained encoders.

[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

## 🧠 Model Architecture

The model fuses features from four pretrained backbone encoders through a learned multi-head attention fusion layer:

| Backbone | Source | Output Dim |
|---|---|---|
| BiomedCLIP ViT-B/16 | `microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224` | 512 |
| ViT-B/16 | `timm` (ImageNet pretrained) | 512 |
| ResNet-50 | `timm` (ImageNet pretrained) | 512 |
| EfficientNet-B0 | `timm` (ImageNet pretrained) | 512 |

Each backbone's features are projected to a shared 512-dim space, then fused via an 8-head attention transformer block. The final classifier head outputs 14 class probabilities with an uncertainty estimate.

```
Image → [BiomedCLIP, ViT-B/16, ResNet-50, EfficientNet-B0]
      → Projection Adapters (per backbone)
      → 8-Head Attention Fusion
      → Classifier → 14 classes + Uncertainty estimate
```

---

## 🏷️ Classes

| Domain | Classes |
|---|---|
| 🏥 Medical (X-ray) | Normal, Pneumonia, COVID-19, Tuberculosis, Cardiomegaly, Rib Fracture, Lung Mass, Pleural Effusion |
| 🎾 Sports | Running, Jumping, Swimming, Cycling, Tennis, Football |

---

## 🚀 Running the Demo

### Option 1 — Hugging Face Spaces (live)

Visit the live demo — no setup needed:

👉 **https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model**

Upload any image and click **Classify**.

### Option 2 — Run locally

**Requirements:** Python 3.9+, ~4 GB RAM (CPU) or GPU recommended

```bash
# 1. Clone this repo
git clone https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model
cd Universal_Cross-Domain_Vision_Model

# 2. Install dependencies
pip install -r requirements.txt

# 3. Launch
python app.py
# Opens at http://localhost:7860
```

### Option 3 — REST API

```bash
# Start the API server
uvicorn api:app --host 0.0.0.0 --port 8000

# Classify an image file
curl -X POST http://localhost:8000/predict -F "file=@your_image.jpg"

# Classify from URL
curl -X POST http://localhost:8000/predict/url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/xray.jpg"}'
```

Interactive API docs at **http://localhost:8000/docs**

### Option 4 — Google Colab

Open `colab_deploy.ipynb` in Colab, set runtime to **T4 GPU**, and run all cells.

---

## 📦 Repository Structure

```
├── app.py                  # Gradio web demo (main entry point)
├── api.py                  # FastAPI REST inference server
├── requirements.txt        # Python dependencies
├── head_weights.pt         # Fine-tuned fusion + classifier weights (~25 MB)
├── extract_head.py         # Utility: extract head weights from full checkpoint
├── colab_deploy.ipynb      # One-click Google Colab notebook
└── README.md               # This file
```

> **Note on weights:** The four backbone encoders (~1 GB total) are downloaded
> automatically from Hugging Face Hub at first startup and cached. Only the
> fine-tuned head (`head_weights.pt`, ~25 MB) is stored in this repo.

---

## 🔧 Training Details

| Setting | Value |
|---|---|
| Base model | BiomedCLIP (Microsoft), pretrained on PMC-15M medical image-text pairs |
| Additional backbones | ViT-B/16, ResNet-50, EfficientNet-B0 (ImageNet pretrained via timm) |
| Medical data | Synthesized X-ray images across 8 pathology classes |
| Sports data | Stanford40 action recognition dataset |
| Fusion | 8-head multi-head attention, 512-dim embedding space |
| Optimizer | AdamW with cosine annealing LR schedule |
| Regularization | Dropout (0.2), domain adversarial training |

---

## 📋 API Response Format

```json
{
  "top_prediction": {
    "label": "Pneumonia",
    "confidence": 0.412
  },
  "predictions": [
    { "label": "Pneumonia",       "confidence": 0.412 },
    { "label": "Normal",          "confidence": 0.238 },
    { "label": "COVID-19",        "confidence": 0.134 },
    { "label": "Tuberculosis",    "confidence": 0.089 },
    { "label": "Cardiomegaly",    "confidence": 0.061 },
    { "label": "Running",         "confidence": 0.044 },
    { "label": "Lung Mass",       "confidence": 0.031 },
    { "label": "Pleural Effusion","confidence": 0.021 }
  ]
}
```

---

## ⚙️ Environment Variables

| Variable | Default | Description |
|---|---|---|
| `PORT` | `7860` (Gradio) / `8000` (API) | Server port |

---

## 🛠️ Troubleshooting

**Slow first startup** — The four backbones (~1 GB total) are downloaded from HF Hub on first run and cached. On HF Spaces this happens automatically during the build phase.

**`head_weights.pt` not found** — The app still runs but uses random weights for the fusion and classifier layers. Predictions will not reflect actual training. Upload `head_weights.pt` to the repo to enable real predictions.

**Out of memory** — The model runs on CPU if no GPU is detected. If memory is tight, reduce image resolution or comment out extra backbones in `app.py`.

**Regenerating `head_weights.pt` from the original checkpoint** — If you have `best_model_phase1.pt`, run:

```bash
python extract_head.py
```

This strips the large backbone weights (which are loaded from HF Hub) and saves only the fine-tuned layers (~25 MB) as `head_weights.pt`.

---

## 📄 License

MIT — see [https://opensource.org/licenses/MIT](https://opensource.org/licenses/MIT)

---

## 🙏 Acknowledgements

- [Microsoft BiomedCLIP](https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224) — vision-language model pretrained on 15M medical image-text pairs from PubMed Central
- [Stanford40](http://vision.stanford.edu/Datasets/40actions.html) — sports and human action recognition dataset
- [timm](https://github.com/huggingface/pytorch-image-models) — PyTorch Image Models library
- [open_clip](https://github.com/mlfoundations/open_clip) — open source CLIP implementation
- [Gradio](https://gradio.app) — web demo framework
- [FastAPI](https://fastapi.tiangolo.com) — REST API framework