Spaces:

Elliot89
/

Universal_Cross-Domain_Vision_Model

Running

App Files Files Community

Universal_Cross-Domain_Vision_Model / README.md

Elliot89

Upload README.md

25589b2 verified 1 day ago

preview code

raw

history blame contribute delete

6.67 kB

metadata

title: Universal Cross-Domain Vision Model
emoji: 🏥🎾
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
license: mit

🏥🎾 Universal Cross-Domain Vision Model

A multi-backbone vision model that classifies images across medical X-ray pathologies and sports action domains using fine-tuned multi-modal attention fusion on top of four pretrained encoders.

🧠 Model Architecture

The model fuses features from four pretrained backbone encoders through a learned multi-head attention fusion layer:

Backbone	Source	Output Dim
BiomedCLIP ViT-B/16	`microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224`	512
ViT-B/16	`timm` (ImageNet pretrained)	512
ResNet-50	`timm` (ImageNet pretrained)	512
EfficientNet-B0	`timm` (ImageNet pretrained)	512

Each backbone's features are projected to a shared 512-dim space, then fused via an 8-head attention transformer block. The final classifier head outputs 14 class probabilities with an uncertainty estimate.

Image → [BiomedCLIP, ViT-B/16, ResNet-50, EfficientNet-B0]
      → Projection Adapters (per backbone)
      → 8-Head Attention Fusion
      → Classifier → 14 classes + Uncertainty estimate

🏷️ Classes

Domain	Classes
🏥 Medical (X-ray)	Normal, Pneumonia, COVID-19, Tuberculosis, Cardiomegaly, Rib Fracture, Lung Mass, Pleural Effusion
🎾 Sports	Running, Jumping, Swimming, Cycling, Tennis, Football

🚀 Running the Demo

Option 1 — Hugging Face Spaces (live)

Visit the live demo — no setup needed:

👉 https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model

Upload any image and click Classify.

Option 2 — Run locally

Requirements: Python 3.9+, ~4 GB RAM (CPU) or GPU recommended

# 1. Clone this repo
git clone https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model
cd Universal_Cross-Domain_Vision_Model

# 2. Install dependencies
pip install -r requirements.txt

# 3. Launch
python app.py
# Opens at http://localhost:7860

Option 3 — REST API

# Start the API server
uvicorn api:app --host 0.0.0.0 --port 8000

# Classify an image file
curl -X POST http://localhost:8000/predict -F "file=@your_image.jpg"

# Classify from URL
curl -X POST http://localhost:8000/predict/url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/xray.jpg"}'

Interactive API docs at http://localhost:8000/docs

Option 4 — Google Colab

Open colab_deploy.ipynb in Colab, set runtime to T4 GPU, and run all cells.

📦 Repository Structure

├── app.py                  # Gradio web demo (main entry point)
├── api.py                  # FastAPI REST inference server
├── requirements.txt        # Python dependencies
├── head_weights.pt         # Fine-tuned fusion + classifier weights (~25 MB)
├── extract_head.py         # Utility: extract head weights from full checkpoint
├── colab_deploy.ipynb      # One-click Google Colab notebook
└── README.md               # This file

Note on weights: The four backbone encoders (~1 GB total) are downloaded automatically from Hugging Face Hub at first startup and cached. Only the fine-tuned head (head_weights.pt, ~25 MB) is stored in this repo.

🔧 Training Details

Setting	Value
Base model	BiomedCLIP (Microsoft), pretrained on PMC-15M medical image-text pairs
Additional backbones	ViT-B/16, ResNet-50, EfficientNet-B0 (ImageNet pretrained via timm)
Medical data	Synthesized X-ray images across 8 pathology classes
Sports data	Stanford40 action recognition dataset
Fusion	8-head multi-head attention, 512-dim embedding space
Optimizer	AdamW with cosine annealing LR schedule
Regularization	Dropout (0.2), domain adversarial training

📋 API Response Format

{
  "top_prediction": {
    "label": "Pneumonia",
    "confidence": 0.412
  },
  "predictions": [
    { "label": "Pneumonia",       "confidence": 0.412 },
    { "label": "Normal",          "confidence": 0.238 },
    { "label": "COVID-19",        "confidence": 0.134 },
    { "label": "Tuberculosis",    "confidence": 0.089 },
    { "label": "Cardiomegaly",    "confidence": 0.061 },
    { "label": "Running",         "confidence": 0.044 },
    { "label": "Lung Mass",       "confidence": 0.031 },
    { "label": "Pleural Effusion","confidence": 0.021 }
  ]
}

⚙️ Environment Variables

Variable	Default	Description
`PORT`	`7860` (Gradio) / `8000` (API)	Server port

🛠️ Troubleshooting

Slow first startup — The four backbones (~1 GB total) are downloaded from HF Hub on first run and cached. On HF Spaces this happens automatically during the build phase.

head_weights.pt not found — The app still runs but uses random weights for the fusion and classifier layers. Predictions will not reflect actual training. Upload head_weights.pt to the repo to enable real predictions.

Out of memory — The model runs on CPU if no GPU is detected. If memory is tight, reduce image resolution or comment out extra backbones in app.py.

Regenerating head_weights.pt from the original checkpoint — If you have best_model_phase1.pt, run:

python extract_head.py

This strips the large backbone weights (which are loaded from HF Hub) and saves only the fine-tuned layers (~25 MB) as head_weights.pt.

📄 License

MIT — see https://opensource.org/licenses/MIT

🙏 Acknowledgements

Microsoft BiomedCLIP — vision-language model pretrained on 15M medical image-text pairs from PubMed Central
Stanford40 — sports and human action recognition dataset
timm — PyTorch Image Models library
open_clip — open source CLIP implementation
Gradio — web demo framework
FastAPI — REST API framework