Spaces:

Elliot89
/

Universal_Cross-Domain_Vision_Model

Running

App Files Files Community

Elliot89 commited on 1 day ago

Commit

25589b2

verified ·

1 Parent(s): 9b58add

Upload README.md

Browse files

Files changed (1) hide show

README.md +185 -41

README.md CHANGED Viewed

@@ -10,44 +10,188 @@ pinned: false
 license: mit
 ---
-# Universal Cross-Domain Vision Model
-A BiomedCLIP-powered vision model that classifies images across **medical** and **sports** domains using multi-modal attention fusion.
-## How to deploy to Hugging Face Spaces
-1. Create a new Space at https://huggingface.co/new-space
-   - SDK: **Gradio**
-   - Visibility: Public or Private
-2. Upload these files to the Space repository:
-   ```
-   app.py
-   requirements.txt
-   README_HF_SPACES.md   ← rename this to README.md in the Space
-   ```
-3. Upload your checkpoint:
-   ```
-   universal_vision_checkpoints/best_model_phase1.pt
-   ```
-   > For large files (>1 GB) use Git LFS:
-   > ```bash
-   > git lfs install
-   > git lfs track "*.pt"
-   > git add .gitattributes
-   > ```
-4. Set the environment variable in Space Settings → Variables:
-   ```
-   CHECKPOINT_PATH = universal_vision_checkpoints/best_model_phase1.pt
-   ```
-5. The Space will build automatically. First build takes ~5 minutes.
-## Classes
-| Domain   | Classes |
-|----------|---------|
-| Medical  | Normal, Pneumonia, COVID-19, Tuberculosis, Cardiomegaly, Rib Fracture, Lung Mass, Pleural Effusion |
-| Sports   | Running, Jumping, Swimming, Cycling, Tennis, Football |

 license: mit
 ---
+# 🏥🎾 Universal Cross-Domain Vision Model
+A multi-backbone vision model that classifies images across **medical X-ray pathologies** and **sports action** domains using fine-tuned multi-modal attention fusion on top of four pretrained encoders.
+[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+---
+## 🧠 Model Architecture
+The model fuses features from four pretrained backbone encoders through a learned multi-head attention fusion layer:
+| Backbone | Source | Output Dim |
+|---|---|---|
+| BiomedCLIP ViT-B/16 | `microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224` | 512 |
+| ViT-B/16 | `timm` (ImageNet pretrained) | 512 |
+| ResNet-50 | `timm` (ImageNet pretrained) | 512 |
+| EfficientNet-B0 | `timm` (ImageNet pretrained) | 512 |
+Each backbone's features are projected to a shared 512-dim space, then fused via an 8-head attention transformer block. The final classifier head outputs 14 class probabilities with an uncertainty estimate.
+```
+Image → [BiomedCLIP, ViT-B/16, ResNet-50, EfficientNet-B0]
+      → Projection Adapters (per backbone)
+      → 8-Head Attention Fusion
+      → Classifier → 14 classes + Uncertainty estimate
+```
+---
+## 🏷️ Classes
+| Domain | Classes |
+|---|---|
+| 🏥 Medical (X-ray) | Normal, Pneumonia, COVID-19, Tuberculosis, Cardiomegaly, Rib Fracture, Lung Mass, Pleural Effusion |
+| 🎾 Sports | Running, Jumping, Swimming, Cycling, Tennis, Football |
+---
+## 🚀 Running the Demo
+### Option 1 — Hugging Face Spaces (live)
+Visit the live demo — no setup needed:
+👉 **https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model**
+Upload any image and click **Classify**.
+### Option 2 — Run locally
+**Requirements:** Python 3.9+, ~4 GB RAM (CPU) or GPU recommended
+```bash
+# 1. Clone this repo
+git clone https://huggingface.co/spaces/Elliot89/Universal_Cross-Domain_Vision_Model
+cd Universal_Cross-Domain_Vision_Model
+# 2. Install dependencies
+pip install -r requirements.txt
+# 3. Launch
+python app.py
+# Opens at http://localhost:7860
+```
+### Option 3 — REST API
+```bash
+# Start the API server
+uvicorn api:app --host 0.0.0.0 --port 8000
+# Classify an image file
+curl -X POST http://localhost:8000/predict -F "file=@your_image.jpg"
+# Classify from URL
+curl -X POST http://localhost:8000/predict/url \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://example.com/xray.jpg"}'
+```
+Interactive API docs at **http://localhost:8000/docs**
+### Option 4 — Google Colab
+Open `colab_deploy.ipynb` in Colab, set runtime to **T4 GPU**, and run all cells.
+---
+## 📦 Repository Structure
+```
+├── app.py                  # Gradio web demo (main entry point)
+├── api.py                  # FastAPI REST inference server
+├── requirements.txt        # Python dependencies
+├── head_weights.pt         # Fine-tuned fusion + classifier weights (~25 MB)
+├── extract_head.py         # Utility: extract head weights from full checkpoint
+├── colab_deploy.ipynb      # One-click Google Colab notebook
+└── README.md               # This file
+```
+> **Note on weights:** The four backbone encoders (~1 GB total) are downloaded
+> automatically from Hugging Face Hub at first startup and cached. Only the
+> fine-tuned head (`head_weights.pt`, ~25 MB) is stored in this repo.
+---
+## 🔧 Training Details
+| Setting | Value |
+|---|---|
+| Base model | BiomedCLIP (Microsoft), pretrained on PMC-15M medical image-text pairs |
+| Additional backbones | ViT-B/16, ResNet-50, EfficientNet-B0 (ImageNet pretrained via timm) |
+| Medical data | Synthesized X-ray images across 8 pathology classes |
+| Sports data | Stanford40 action recognition dataset |
+| Fusion | 8-head multi-head attention, 512-dim embedding space |
+| Optimizer | AdamW with cosine annealing LR schedule |
+| Regularization | Dropout (0.2), domain adversarial training |
+---
+## 📋 API Response Format
+```json
+{
+  "top_prediction": {
+    "label": "Pneumonia",
+    "confidence": 0.412
+  },
+  "predictions": [
+    { "label": "Pneumonia",       "confidence": 0.412 },
+    { "label": "Normal",          "confidence": 0.238 },
+    { "label": "COVID-19",        "confidence": 0.134 },
+    { "label": "Tuberculosis",    "confidence": 0.089 },
+    { "label": "Cardiomegaly",    "confidence": 0.061 },
+    { "label": "Running",         "confidence": 0.044 },
+    { "label": "Lung Mass",       "confidence": 0.031 },
+    { "label": "Pleural Effusion","confidence": 0.021 }
+  ]
+}
+```
+---
+## ⚙️ Environment Variables
+| Variable | Default | Description |
+|---|---|---|
+| `PORT` | `7860` (Gradio) / `8000` (API) | Server port |
+---
+## 🛠️ Troubleshooting
+**Slow first startup** — The four backbones (~1 GB total) are downloaded from HF Hub on first run and cached. On HF Spaces this happens automatically during the build phase.
+**`head_weights.pt` not found** — The app still runs but uses random weights for the fusion and classifier layers. Predictions will not reflect actual training. Upload `head_weights.pt` to the repo to enable real predictions.
+**Out of memory** — The model runs on CPU if no GPU is detected. If memory is tight, reduce image resolution or comment out extra backbones in `app.py`.
+**Regenerating `head_weights.pt` from the original checkpoint** — If you have `best_model_phase1.pt`, run:
+```bash
+python extract_head.py
+```
+This strips the large backbone weights (which are loaded from HF Hub) and saves only the fine-tuned layers (~25 MB) as `head_weights.pt`.
+---
+## 📄 License
+MIT — see [https://opensource.org/licenses/MIT](https://opensource.org/licenses/MIT)
+---
+## 🙏 Acknowledgements
+- [Microsoft BiomedCLIP](https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224) — vision-language model pretrained on 15M medical image-text pairs from PubMed Central
+- [Stanford40](http://vision.stanford.edu/Datasets/40actions.html) — sports and human action recognition dataset
+- [timm](https://github.com/huggingface/pytorch-image-models) — PyTorch Image Models library
+- [open_clip](https://github.com/mlfoundations/open_clip) — open source CLIP implementation
+- [Gradio](https://gradio.app) — web demo framework
+- [FastAPI](https://fastapi.tiangolo.com) — REST API framework