| # ViX-Ray — Fine-tuned Medical Vision-Language Models |
|
|
| Fine-tuned weights for Vietnamese chest X-ray report generation across 3 clinical tasks and 6 model architectures. |
|
|
| > **Best overall performance**: Qwen2-VL-7B across all 3 tasks. |
|
|
| --- |
|
|
| ## Tasks |
|
|
| | # | Task | Description | |
| |---|------|-------------| |
| | 1 | `finding` | Generate radiology **findings** from a chest X-ray image | |
| | 2 | `impression` | Generate the clinical **impression** (final diagnosis) from a chest X-ray image | |
| | 3 | `multi` | **Multi-turn dialogue** — findings → impression via conversation history | |
|
|
| --- |
|
|
| ## Models |
|
|
| | Key | Base model | Size | |
| |-----|-----------|------| |
| | `Intern` | InternVL2.5-1B | 1B | |
| | `Vintern` | Vintern-1B-v3.5 | 1B | |
| | `Qwen2B` | Qwen2-VL-2B-Instruct | 2B | |
| | `Qwen7B` | Qwen2-VL-7B-Instruct ⭐ | 7B | |
| | `MiniCPM` | MiniCPM-V-2_6 | 8B | |
| | `LaVy` | LaVy-Instruct | 7B | |
| |
| --- |
| |
| ## Quick Start |
| |
| ### 1. Install |
| |
| ```bash |
| pip install huggingface_hub transformers torch torchvision pillow |
| ``` |
| |
| For Qwen models, also install: |
| ```bash |
| pip install qwen-vl-utils |
| ``` |
| |
| For Intern / Vintern models, also install: |
| ```bash |
| pip install decord |
| ``` |
| |
| For MiniCPM, pin versions: |
| ```bash |
| pip install Pillow==10.1.0 torch==2.1.2 torchvision==0.16.2 transformers==4.40.0 sentencepiece==0.1.99 decord |
| ``` |
| |
| --- |
| |
| ### 2. Download a model zip |
| |
| ```bash |
| # task : finding | impression | multi |
| # model : Intern | Vintern | Qwen2B | Qwen7B | MiniCPM | LaVy |
|
|
| huggingface-cli download presencesw/ViX-Ray <task>/<Model>.zip \ |
| --repo-type model --local-dir ./ |
| ``` |
| |
| Example — download the best model for finding: |
| ```bash |
| huggingface-cli download presencesw/ViX-Ray finding/Qwen7B.zip \ |
| --repo-type model --local-dir ./ |
| ``` |
|
|
| Download all models at once: |
| ```bash |
| huggingface-cli download presencesw/ViX-Ray \ |
| --repo-type model --local-dir ./vix_ray_models |
| ``` |
|
|
| --- |
|
|
| ### 3. Unzip |
|
|
| ```bash |
| unzip <task>/<Model>.zip -d ./models/<task>/ |
| # result: ./models/<task>/<Model>/ |
| ``` |
|
|
| Or in Python: |
| ```python |
| import zipfile |
| with zipfile.ZipFile("<task>/<Model>.zip") as zf: |
| zf.extractall("./models/<task>/") |
| ``` |
|
|
| --- |
|
|
| ### 4. Load & infer |
|
|
| Set `model_path = "./models/<task>/<Model>"` then use the snippet for your model family. |
|
|
| #### Qwen2-VL (Qwen2B / Qwen7B) |
|
|
| ```python |
| from transformers import Qwen2VLForConditionalGeneration, AutoProcessor |
| from qwen_vl_utils import process_vision_info |
| import torch |
| |
| model_path = "./models/<task>/<Model>" |
| |
| model = Qwen2VLForConditionalGeneration.from_pretrained( |
| model_path, torch_dtype="auto", device_map="auto" |
| ) |
| processor = AutoProcessor.from_pretrained(model_path) |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image", "image": "your_image.jpg"}, |
| {"type": "text", "text": "Mô tả hình ảnh X-quang ngực này."}, |
| ], |
| } |
| ] |
| |
| text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| image_inputs, video_inputs = process_vision_info(messages) |
| inputs = processor(text=[text], images=image_inputs, videos=video_inputs, |
| padding=True, return_tensors="pt").to("cuda") |
| |
| generated_ids = model.generate(**inputs, max_new_tokens=512) |
| generated_ids_trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, generated_ids)] |
| print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]) |
| ``` |
|
|
| #### InternVL / Vintern (Intern / Vintern) |
|
|
| ```python |
| import torch |
| import torchvision.transforms as T |
| from PIL import Image |
| from torchvision.transforms.functional import InterpolationMode |
| from transformers import AutoModel, AutoTokenizer |
| |
| model_path = "./models/<task>/<Model>" |
| |
| model = AutoModel.from_pretrained( |
| model_path, torch_dtype=torch.bfloat16, |
| low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True |
| ).eval().cuda() |
| tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False) |
| |
| MEAN, STD = (0.485, 0.456, 0.406), (0.229, 0.224, 0.225) |
| transform = T.Compose([ |
| T.Lambda(lambda img: img.convert("RGB")), |
| T.Resize((448, 448), interpolation=InterpolationMode.BICUBIC), |
| T.ToTensor(), |
| T.Normalize(mean=MEAN, std=STD), |
| ]) |
| |
| pixel_values = transform(Image.open("your_image.jpg")).unsqueeze(0).to(torch.bfloat16).cuda() |
| response = model.chat(tokenizer, pixel_values, "<image>\nMô tả hình ảnh X-quang ngực này.", |
| dict(max_new_tokens=512, do_sample=True)) |
| print(response) |
| ``` |
|
|
| #### MiniCPM-V |
|
|
| ```python |
| import torch |
| from PIL import Image |
| from transformers import AutoModel, AutoTokenizer |
| |
| model_path = "./models/<task>/<Model>" |
| |
| model = AutoModel.from_pretrained( |
| model_path, trust_remote_code=True, |
| attn_implementation="sdpa", torch_dtype=torch.bfloat16 |
| ).eval().cuda() |
| tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
| |
| image = Image.open("your_image.jpg").convert("RGB") |
| msgs = [{"role": "user", "content": [image, "Mô tả hình ảnh X-quang ngực này."]}] |
| print(model.chat(image=None, msgs=msgs, tokenizer=tokenizer)) |
| ``` |
|
|
| #### LaVy |
|
|
| ```python |
| import torch |
| from PIL import Image |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| |
| model_path = "./models/<task>/<Model>" |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True |
| ) |
| processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) |
| |
| inputs = processor( |
| images=Image.open("your_image.jpg").convert("RGB"), |
| text="Mô tả hình ảnh X-quang ngực này.", |
| return_tensors="pt" |
| ).to("cuda") |
| |
| outputs = model.generate(**inputs, max_new_tokens=512) |
| print(processor.batch_decode(outputs, skip_special_tokens=True)[0]) |
| ``` |
|
|
| --- |
|
|
| ## Multi-turn (Task 3) |
|
|
| For the `multi` task, pass conversation history between turns: |
|
|
| ```python |
| # Turn 1 — findings |
| response1 = ... # run inference as above |
| |
| # Turn 2 — impression (append assistant turn then ask) |
| messages.append({"role": "assistant", "content": [{"type": "text", "text": response1}]}) |
| messages.append({"role": "user", "content": [{"type": "text", "text": "Kết luận bệnh gì?"}]}) |
| response2 = ... # run inference again with updated messages |
| ``` |
|
|
| See `readme/<task>_<Model>.md` for the full per-model multi-turn example. |
|
|
| --- |
|
|
| ## Full Model Table |
|
|
| | Task | Model | Base | Zip path | |
| |------|-------|------|----------| |
| | finding | Intern | InternVL2.5-1B | `finding/Intern.zip` | |
| | finding | Vintern | Vintern-1B-v3.5 | `finding/Vintern.zip` | |
| | finding | Qwen2B | Qwen2-VL-2B | `finding/Qwen2B.zip` | |
| | finding | Qwen7B ⭐ | Qwen2-VL-7B | `finding/Qwen7B.zip` | |
| | finding | MiniCPM | MiniCPM-V-2_6 | `finding/MiniCPM.zip` | |
| | finding | LaVy | LaVy-Instruct | `finding/LaVy.zip` | |
| | impression | Intern | InternVL2.5-1B | `impression/Intern.zip` | |
| | impression | Vintern | Vintern-1B-v3.5 | `impression/Vintern.zip` | |
| | impression | Qwen2B | Qwen2-VL-2B | `impression/Qwen2B.zip` | |
| | impression | Qwen7B ⭐ | Qwen2-VL-7B | `impression/Qwen7B.zip` | |
| | impression | MiniCPM | MiniCPM-V-2_6 | `impression/MiniCPM.zip` | |
| | impression | LaVy | LaVy-Instruct | `impression/LaVy.zip` | |
| | multi | Intern | InternVL2.5-1B | `multi/Intern.zip` | |
| | multi | Vintern | Vintern-1B-v3.5 | `multi/Vintern.zip` | |
| | multi | Qwen2B | Qwen2-VL-2B | `multi/Qwen2B.zip` | |
| | multi | Qwen7B ⭐ | Qwen2-VL-7B | `multi/Qwen7B.zip` | |
| | multi | MiniCPM | MiniCPM-V-2_6 | `multi/MiniCPM.zip` | |
| | multi | LaVy | LaVy-Instruct | `multi/LaVy.zip` | |
| |
| Per-model details (installation, full inference code) are in `readme/<task>_<Model>.md`. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use these models or the ViX-Ray dataset in your research, please cite: |
|
|
| ```bibtex |
| @article{nguyen2026vix, |
| title={ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models}, |
| author={Nguyen, Duy Vu Minh and Truong, Chinh Thanh and Tran, Phuc Hoang and Le, Hung Tuan and Dat, Nguyen Van-Thanh and Pham, Trung Hieu and Van Nguyen, Kiet}, |
| journal={arXiv preprint arXiv:2603.15513}, |
| year={2026} |
| } |
| ``` |
|
|