BUILT WITH LLAMA!
Model Card for Llama-3.2-11B-CXR
Llama-3.2-11B-CXR is my first attempt at fine tuning a general purpose open-weights vision-language model for chest X-ray structured report generation. The model has been fine-tuned to generate radiological reports in a structured JSON format.
{
"Support devices": "None.",
"Cardiomediastinum": "Within normal limits.",
"Lungs": "Lungs are clear.",
"Pleura": "No pleural effusion or pneumothorax.",
"Skeleton": "No acute findings.",
"Upper abdomen": "No acute findings."
}
Model Details
Model Description
These are adapters for meta-llama/Llama-3.2-11B-Vision-Instruct, obtained through supervised fine-tuning (SFT) with low-rank adapters (LoRA) using a custom subset of publicly available frontal chest x-rays from the romprr/CXR_BioXAi_Hackathon_2024 dataset.
- Developed, funded and shared by: Nakul Gupta
- Model type: Multi-modal Large Language Model
- Language(s) (NLP): SFT was done in English language, although base model supports additional languages.
- License: Llama 3.2 Community License (a custom, commercial license agreement).
- Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct
Uses
This model is SOLELY intended for research and development purposes. It is by no means ready or meant for clinical use, nor has it been validated in a clinical setting.
Out-of-Scope Use
This model has NOT been validated for clinical use or evaluated by any regulatory bodies and may experience hallucinations as well as missed findings. It is intended for research and developmental use ONLY. The models outputs are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. All model outputs require independent verification and further investigation through established scientific research and development methodologies.
Bias, Risks, and Limitations
Results and model outputs are heavily dependent upon the specific prompt/instruction as well as inferencing techniques (temperature, top_p, min_p, etc.). The model has been optimized only for single-turn, single-image evaluation. The model may also suffer from data contamination/leakage, where the model may have been exposed to evaluation data during pre-training of fine-tuning, which may lead to overestimation of its true capabilities. Therefore, the model requires validation on datasets specific to each individual's/institutions use case.
How to Get Started with the Model
Use the code below to get started with the model.
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
base_model = "meta-llama/Llama-3.2-11B-Vision-Instruct"
adapter_id = "DeepRadiology/Llama-3.2-11B-CXR"
model = AutoModelForVision2Seq.from_pretrained(
base_model,
device_map='auto',
torch_dtype=torch.bfloat16,
)
adapter_name = model.load_adapter(adapter_id)
model.active_adapters = adapter_name
processor = AutoProcessor.from_pretrained(base_model)
image = Image.open("cxr.jpeg") # replace with your own example image
instruction = """You are an expert chest radiologist. Describe accurately what you see in this image. Use a \
structured report template with fields for: Support devices, Cardiomediastinum, Lungs, Pleura, Skeleton, and Upper \
abdomen. If there are no support devices, then report "None." for that field, if there are no pertinent \
Cardiomediastinal findings, report "Within normal limits." for that field. If there are no abnormal lung findings \
report "Lungs are clear." If there are no pertinent pleural findings, report "No pleural effusion or pneumothorax." \
For all other fields, if there are no pertinent findings, report "No acute findings." You must always generate a report\
with the required fields."""
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
image,
input_text,
add_special_tokens=False,
return_tensors="pt"
).to(model.device)
output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, min_p=0.1)
print(processor.decode(output[0]))
Training Details
Training Data
romprr/CXR_BioXAi_Hackathon_2024.
Training Procedure
Preprocessing
Dataset was filtered using meta-llama/Llama-3.3-70B-Instruct to remove reports with references to priors (although this was not 100% successful). The remaining free-text reports were then converted into a structured report format, again using meta-llama/Llama-3.3-70B-Instruct. The final training set was approximately 33k x-rays.
Training Hyperparameters
args = SFTConfig(
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
gradient_checkpointing=False,
optim='adamw_torch_fused',
learning_rate=2e-5,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=16,
bias="none",
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "fc1", "fc2"],
modules_to_save=['lm_head', 'embed_tokens'],
task_type="CAUSAL_LM"
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
Training Time
4d 20h 53s with 3x Nvidia RTX 3090's
Evaluation
Testing Data
Evaluation was performed using publically available IU-Xray and MIMIC-CXR datasets, using 'test' splits and frontal x-rays only as defined by RexRank.
Metrics
BLEU-2, BLEU-4, ROUGE_L, METEOR, and Radgraph F1 (simple, partial, and complete) metrics were assessed, and compared to the baseline model as well as medgemma-4b-it. The same user prompt/instruction was used as was used for SFT. Llama-based models were inferenced with temperature=0.1 and min_p=0.1, medgemma was inferenced with greedy sampling (do_sample=False).
Results
Llama-3.2-11B-CXR performs similarly to slightly better than the medical specific MedGemma 4b, besting it in Radgraph (RG) metrics on both datasets.
| Model | Dataset | BLEU-2 | BLEU-4 | ROUGE-L | METEOR | RG-Simple | RG-Partial | RG-Complete |
|---|---|---|---|---|---|---|---|---|
| Llama-3.2-11B-Vision-Instruct | IU-Xray | 0.0474 | 0.0137 | 0.1787 | 0.2027 | 0.2065 | 0.1867 | 0.1095 |
| Llama-3.2-11B-CXR | IU-Xray | 0.0485 | 0.0099 | 0.2464 | 0.1132 | 0.2555 | 0.2438 | 0.1743 |
| MedGemma-4b-it | IU-Xray | 0.0627 | 0.0184 | 0.2024 | 0.2172 | 0.2327 | 0.2196 | 0.1831 |
| Llama-3.2-11B-Vision-Instruct | MIMIC-CXR | 0.0631 | 0.0159 | 0.1460 | 0.1574 | 0.1340 | 0.1201 | 0.0800 |
| Llama-3.2-11B-CXR | MIMIC-CXR | 0.0890 | 0.0242 | 0.1896 | 0.1568 | 0.1853 | 0.1677 | 0.1130 |
| MedGemma-4b-it | MIMIC-CXR | 0.0794 | 0.0218 | 0.1724 | 0.1887 | 0.1772 | 0.1597 | 0.1141 |
Summary
Presenting Llama-3.2-CXR-11B, a multi-modal open-weights vision language model (VLM) fine-tuned for chest x-ray report generation! The primary goal of this exercise was to demonstrate the potential for general purpose VLM's to be re-purposed for medical imaging tasks on consumer grade hardware with publicly available datasets.
Model Card Contact
Data Citations
MIMIC-CXR:
Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2019). MIMIC-CXR Database (version 2.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/C2JT1Q
Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6, 317 (2019). https://doi.org/10.1038/s41597-019-0322-0
IU-Xray:
Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, Thoma GR, McDonald CJ. Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc. 2016 Mar;23(2):304-10. doi: 10.1093/jamia/ocv080. Epub 2015 Jul 1. PMID: 26133894; PMCID: PMC5009925.
Model tree for DeepRadiology/Llama-3.2-11B-CXR
Base model
meta-llama/Llama-3.2-11B-Vision-Instruct