Update README.md

8120d7f verified about 18 hours ago

9.04 kB

library_name: transformers
pipeline_tag: feature-extraction
tags:
  - medical
  - vision-language
  - clip
  - various modalities
  - conceptclip
  - explainable-ai
  - multimodal
license: mit
language:
  - en
extra_gated_heading: Responsible access request for ConceptCLIP
extra_gated_description: >-
  Please acknowledge the intended-use and safety notice below before accessing
  the model files.
extra_gated_button_content: Acknowledge and access
extra_gated_prompt: >-
  By requesting access to this repository, you acknowledge that ConceptCLIP is a
  research model for medical image understanding and explainability. It is not a
  medical device and must not be used as the sole basis for diagnosis,
  treatment, triage, or other clinical decision-making. You are responsible for
  validating performance, safety, and regulatory compliance before any
  real-world deployment.
extra_gated_fields:
  Affiliation: text
  Country: country
  Intended use:
    type: select
    options:
      - Research
      - Education
      - Benchmarking / evaluation
      - Internal product evaluation
      - Other
  I understand this model is not a medical device: checkbox
  I will not use this model as the sole basis for diagnosis or treatment: checkbox
  I will keep appropriate human oversight for any medical use case: checkbox
  I will comply with the MIT license and applicable laws / institutional policies: checkbox

Model Card for ConceptCLIP

Model Details

Model Description

ConceptCLIP is a concept-enhanced biomedical vision-language foundation model designed for diverse medical image modalities. It is built to support both strong predictive performance and clinically meaningful concept-level explainability.

Developed by: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen
Model type: Vision-Language Pre-trained Model (medical specialized)
Language(s): English (text), multi-modal (medical imaging)
License: MIT
Base architecture: SigLIP-ViT-400M-16 image encoder + PubMedBERT text encoder
Training recipe: Global image-text alignment (IT-Align) + local region-concept alignment (RC-Align)

ConceptCLIP is intended to help researchers study trustworthy medical AI, especially settings where performance and interpretability both matter.

Model Sources

Repository: https://github.com/JerrryNie/ConceptCLIP
Paper: https://arxiv.org/abs/2501.15579
Model Hub: https://huggingface.co/JerrryNie/ConceptCLIP
Associated pretraining dataset card: https://huggingface.co/datasets/JerrryNie/MedConcept-23M
Associated retrieval benchmark card: https://huggingface.co/datasets/JerrryNie/pmc9k

Gated Access Notice

This repository uses gated access with contact sharing and an additional acknowledgment form.

The goal of the gate is not to restrict normal academic or industrial research usage under the repository license. Instead, it is meant to make users explicitly acknowledge the safety-sensitive nature of medical AI and the need for responsible use.

By requesting access, users confirm that they understand:

this repository is for research, benchmarking, education, and responsible model development
the model is not a medical device
the model must not be used as the sole basis for diagnosis, treatment, or triage
any real-world deployment requires local validation, human oversight, and regulatory / institutional compliance

Intended Uses

Direct Use

Zero-shot medical image classification
Cross-modal retrieval
Zero-shot concept annotation
Feature extraction for pathology whole-slide image analysis
Feature extraction for medical report generation research
Feature extraction for medical visual question answering pipelines

Downstream Use

Fine-tuning for medical imaging tasks such as CT, MRI, X-ray, ultrasound, pathology, dermatology, and fundus analysis
Building concept bottleneck models or other interpretable pipelines
Studying concept-grounded medical explainability
Internal evaluation of multimodal medical foundation models
Medical AI education, training, and benchmarking

Out-of-Scope Use

Direct clinical diagnosis without task-specific validation
Autonomous patient-care decision making
Emergency or high-stakes use without qualified human oversight
General-purpose non-medical computer vision claims without further validation
Replacing clinician judgment

Bias, Risks, and Limitations

Training data may contain demographic, institutional, publication, and modality biases.
Performance can vary across hospitals, scanners, acquisition protocols, disease prevalence, and patient populations.
Concept-level explanations can be useful, but they do not guarantee causal correctness.
Local region-concept alignment may sometimes focus on artifacts or spurious visual cues.
The model should not be treated as a validated clinical system merely because it performs strongly on benchmarks.

Recommendations

Validate outputs with clinical experts before any medical deployment.
Perform task-specific validation on in-domain data.
Audit subgroup performance and failure modes before real-world use.
Keep a qualified human in the loop for any medical workflow.
Document all local adaptations, fine-tuning, and evaluation settings.

How to Get Started with the Model

from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image

model = AutoModel.from_pretrained("JerrryNie/ConceptCLIP", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("JerrryNie/ConceptCLIP", trust_remote_code=True)

image = Image.open("example_data/chest_X-ray.jpg").convert("RGB")
labels = ["chest X-ray", "brain MRI", "skin lesion"]
texts = [f"a medical image of {label}" for label in labels]

inputs = processor(
    images=image,
    text=texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
).to(model.device)

with torch.no_grad():
    outputs = model(**inputs)
    logits = (
        outputs["logit_scale"]
        * outputs["image_features"]
        @ outputs["text_features"].t()
    ).softmax(dim=-1)[0]

print({label: f"{prob:.2%}" for label, prob in zip(labels, logits)})

Training Details

Training Data

ConceptCLIP is trained with large-scale biomedical image-text-concept triplets associated with the MedConcept-23M project.

Training Procedure

Built on a CLIP-style biomedical vision-language framework
Uses IT-Align for global image-text alignment
Uses RC-Align for fine-grained region-concept alignment
Designed to learn both transferable visual features and interpretable concept-level grounding

Training Hyperparameters

Image encoder: SigLIP-ViT-400M-16
Text encoder: PubMedBERT
Training regime: Mixed precision training
Batch size: 12,288 without RC-Align, 6,144 with RC-Align
Learning rate: 5e-4 without RC-Align, 3e-4 with RC-Align

Evaluation

Testing Data & Metrics

ConceptCLIP has been evaluated on a broad set of medical image understanding tasks, including:

medical image diagnosis
cross-modal retrieval
medical visual question answering
medical report generation
pathology whole-slide image analysis
concept annotation and explainable AI

Representative evaluation metrics include AUC, Recall@k, accuracy-based metrics, report-generation metrics, and C-index.

Responsible Release Notes

This release is meant to support open research on trustworthy and interpretable biomedical foundation models.

Although the repository license is MIT, users remain responsible for ensuring that any downstream use complies with local law, institutional policy, data governance requirements, and medical-device or clinical-deployment regulations where applicable.

Citation

BibTeX:

@article{nie2025conceptclip,
  title={An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training},
  author={Nie, Yuxiang and He, Sunan and Bie, Yequan and Wang, Yihui and Chen, Zhixuan and Yang, Shu and Cai, Zhiyuan and Wang, Hongmei and Wang, Xi and Luo, Luyang and Wu, Mingxiang and Wu, Xian and Chan, Ronald Cheong Kin and Lau, Yuk Ming and Zheng, Yefeng and Rajpurkar, Pranav and Chen, Hao},
  journal={arXiv preprint arXiv:2501.15579},
  year={2025}
}

APA:

Nie, Y., He, S., Bie, Y., Wang, Y., Chen, Z., Yang, S., Cai, Z., Wang, H., Wang, X., Luo, L., Wu, M., Wu, X., Chan, R. C. K., Lau, Y. M., Zheng, Y., Rajpurkar, P., & Chen, H. (2025). An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training. arXiv preprint arXiv:2501.15579.

Contact

Primary contact: Yuxiang Nie — ynieae@connect.ust.hk
For code issues and reproducibility questions, please prefer the GitHub repository issue tracker.