ConceptCLIP / README.md
JerrryNie's picture
Update README.md
8120d7f verified
metadata
library_name: transformers
pipeline_tag: feature-extraction
tags:
  - medical
  - vision-language
  - clip
  - various modalities
  - conceptclip
  - explainable-ai
  - multimodal
license: mit
language:
  - en
extra_gated_heading: Responsible access request for ConceptCLIP
extra_gated_description: >-
  Please acknowledge the intended-use and safety notice below before accessing
  the model files.
extra_gated_button_content: Acknowledge and access
extra_gated_prompt: >-
  By requesting access to this repository, you acknowledge that ConceptCLIP is a
  research model for medical image understanding and explainability. It is not a
  medical device and must not be used as the sole basis for diagnosis,
  treatment, triage, or other clinical decision-making. You are responsible for
  validating performance, safety, and regulatory compliance before any
  real-world deployment.
extra_gated_fields:
  Affiliation: text
  Country: country
  Intended use:
    type: select
    options:
      - Research
      - Education
      - Benchmarking / evaluation
      - Internal product evaluation
      - Other
  I understand this model is not a medical device: checkbox
  I will not use this model as the sole basis for diagnosis or treatment: checkbox
  I will keep appropriate human oversight for any medical use case: checkbox
  I will comply with the MIT license and applicable laws / institutional policies: checkbox

Model Card for ConceptCLIP

Model Details

Model Description

ConceptCLIP is a concept-enhanced biomedical vision-language foundation model designed for diverse medical image modalities. It is built to support both strong predictive performance and clinically meaningful concept-level explainability.

  • Developed by: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen
  • Model type: Vision-Language Pre-trained Model (medical specialized)
  • Language(s): English (text), multi-modal (medical imaging)
  • License: MIT
  • Base architecture: SigLIP-ViT-400M-16 image encoder + PubMedBERT text encoder
  • Training recipe: Global image-text alignment (IT-Align) + local region-concept alignment (RC-Align)

ConceptCLIP is intended to help researchers study trustworthy medical AI, especially settings where performance and interpretability both matter.

Model Sources

Gated Access Notice

This repository uses gated access with contact sharing and an additional acknowledgment form.

The goal of the gate is not to restrict normal academic or industrial research usage under the repository license. Instead, it is meant to make users explicitly acknowledge the safety-sensitive nature of medical AI and the need for responsible use.

By requesting access, users confirm that they understand:

  • this repository is for research, benchmarking, education, and responsible model development
  • the model is not a medical device
  • the model must not be used as the sole basis for diagnosis, treatment, or triage
  • any real-world deployment requires local validation, human oversight, and regulatory / institutional compliance

Intended Uses

Direct Use

  • Zero-shot medical image classification
  • Cross-modal retrieval
  • Zero-shot concept annotation
  • Feature extraction for pathology whole-slide image analysis
  • Feature extraction for medical report generation research
  • Feature extraction for medical visual question answering pipelines

Downstream Use

  • Fine-tuning for medical imaging tasks such as CT, MRI, X-ray, ultrasound, pathology, dermatology, and fundus analysis
  • Building concept bottleneck models or other interpretable pipelines
  • Studying concept-grounded medical explainability
  • Internal evaluation of multimodal medical foundation models
  • Medical AI education, training, and benchmarking

Out-of-Scope Use

  • Direct clinical diagnosis without task-specific validation
  • Autonomous patient-care decision making
  • Emergency or high-stakes use without qualified human oversight
  • General-purpose non-medical computer vision claims without further validation
  • Replacing clinician judgment

Bias, Risks, and Limitations

  • Training data may contain demographic, institutional, publication, and modality biases.
  • Performance can vary across hospitals, scanners, acquisition protocols, disease prevalence, and patient populations.
  • Concept-level explanations can be useful, but they do not guarantee causal correctness.
  • Local region-concept alignment may sometimes focus on artifacts or spurious visual cues.
  • The model should not be treated as a validated clinical system merely because it performs strongly on benchmarks.

Recommendations

  • Validate outputs with clinical experts before any medical deployment.
  • Perform task-specific validation on in-domain data.
  • Audit subgroup performance and failure modes before real-world use.
  • Keep a qualified human in the loop for any medical workflow.
  • Document all local adaptations, fine-tuning, and evaluation settings.

How to Get Started with the Model

from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image

model = AutoModel.from_pretrained("JerrryNie/ConceptCLIP", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("JerrryNie/ConceptCLIP", trust_remote_code=True)

image = Image.open("example_data/chest_X-ray.jpg").convert("RGB")
labels = ["chest X-ray", "brain MRI", "skin lesion"]
texts = [f"a medical image of {label}" for label in labels]

inputs = processor(
    images=image,
    text=texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
).to(model.device)

with torch.no_grad():
    outputs = model(**inputs)
    logits = (
        outputs["logit_scale"]
        * outputs["image_features"]
        @ outputs["text_features"].t()
    ).softmax(dim=-1)[0]

print({label: f"{prob:.2%}" for label, prob in zip(labels, logits)})

Training Details

Training Data

ConceptCLIP is trained with large-scale biomedical image-text-concept triplets associated with the MedConcept-23M project.

Training Procedure

  • Built on a CLIP-style biomedical vision-language framework
  • Uses IT-Align for global image-text alignment
  • Uses RC-Align for fine-grained region-concept alignment
  • Designed to learn both transferable visual features and interpretable concept-level grounding

Training Hyperparameters

  • Image encoder: SigLIP-ViT-400M-16
  • Text encoder: PubMedBERT
  • Training regime: Mixed precision training
  • Batch size: 12,288 without RC-Align, 6,144 with RC-Align
  • Learning rate: 5e-4 without RC-Align, 3e-4 with RC-Align

Evaluation

Testing Data & Metrics

ConceptCLIP has been evaluated on a broad set of medical image understanding tasks, including:

  • medical image diagnosis
  • cross-modal retrieval
  • medical visual question answering
  • medical report generation
  • pathology whole-slide image analysis
  • concept annotation and explainable AI

Representative evaluation metrics include AUC, Recall@k, accuracy-based metrics, report-generation metrics, and C-index.

Responsible Release Notes

This release is meant to support open research on trustworthy and interpretable biomedical foundation models.

Although the repository license is MIT, users remain responsible for ensuring that any downstream use complies with local law, institutional policy, data governance requirements, and medical-device or clinical-deployment regulations where applicable.

Citation

BibTeX:

@article{nie2025conceptclip,
  title={An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training},
  author={Nie, Yuxiang and He, Sunan and Bie, Yequan and Wang, Yihui and Chen, Zhixuan and Yang, Shu and Cai, Zhiyuan and Wang, Hongmei and Wang, Xi and Luo, Luyang and Wu, Mingxiang and Wu, Xian and Chan, Ronald Cheong Kin and Lau, Yuk Ming and Zheng, Yefeng and Rajpurkar, Pranav and Chen, Hao},
  journal={arXiv preprint arXiv:2501.15579},
  year={2025}
}

APA:

Nie, Y., He, S., Bie, Y., Wang, Y., Chen, Z., Yang, S., Cai, Z., Wang, H., Wang, X., Luo, L., Wu, M., Wu, X., Chan, R. C. K., Lau, Y. M., Zheng, Y., Rajpurkar, P., & Chen, H. (2025). An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training. arXiv preprint arXiv:2501.15579.

Contact

  • Primary contact: Yuxiang Nie — ynieae@connect.ust.hk
  • For code issues and reproducibility questions, please prefer the GitHub repository issue tracker.