HIKARI-Vega-8B-SkinCaption-Fused β
Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference
Named after Vega β the brightest star in Lyra, shining above all others
π¦ Model Type: Merged Full Model
This is a fully merged model β the LoRA adapter weights have been merged directly into the base model weights.
β No adapter loading needed. Load and run directly with
transformers,vLLM, orSGLang.πΎ Size: ~17 GB (4 safetensor shards)
π Lightweight adapter version: E27085921/HIKARI-Vega-8B-SkinCaption-Fused-LoRA (~1.1 GB)
Overview
HIKARI-Vega is our best clinical caption generation model, producing detailed multi-sentence descriptions of skin lesion images β including morphological features, differential diagnosis, and recommended examinations.
The key insight is the Fused (Merged-Init) training strategy: before Stage 3 caption training begins, the Stage 2 disease classifier weights are permanently merged into the base model. Fresh LoRA adapters are then added for caption learning. This ensures disease knowledge cannot be overwritten β the model knows what disease it is describing before it learns how to describe it.
| Property | Value |
|---|---|
| Task | Clinical skin lesion caption generation (Stage 3 of HIKARI pipeline) |
| Base model | Qwen/Qwen3-VL-8B-Thinking |
| Init strategy | Merged-Init: Stage 2 merged β fresh LoRA for Stage 3 |
| BLEU-4 | 29.33 |
| ROUGE-1 | 53.55 |
| BERTScore-F | 91.12 (roberta-large, layer 17) |
| Disease mention rate | 62.63% (noguide) / 78.79% (guided) |
| Model type | Merged full model |
| Hardware tested | RTX 5070 Ti (16 GB VRAM) |
π Why Fused (Merged-Init) Works β Ablation Results
| Experiment | Init Strategy | STS | BLEU-4 | ROUGE-1 | BERTScore-F |
|---|---|---|---|---|---|
| Way 1 β HIKARI-Rigel | LoRA checkpoint | β | 9.82 | 38.90 | 88.12 |
| Way 2 β HIKARI-Vega (this model) | Merged weights | β | 29.33 | 53.55 | 91.12 |
| Way 1 + STS | LoRA checkpoint | β | 0.00 | 5.03 | β |
| Way 2 + STS β HIKARI-Antares | Merged weights | β | 0.61 | 15.68 | β |
Checkpoint-init (Rigel) is 3Γ worse: the caption LoRA training overwrites disease knowledge that was only stored in LoRA adapters. Merged-init (Vega) freezes disease knowledge permanently in base weights β it cannot be overwritten.
ποΈ Example Output
Input: [skin lesion image showing erythematous plaques with silvery scales]
Output:
"The image shows well-defined erythematous plaques covered by thick silvery-white scales
located on the extensor surface. The lesion demonstrates the characteristic features of
psoriasis, including Auspitz sign potential and sharp borders. The distribution and
morphology are consistent with plaque-type psoriasis. Recommended examinations include
dermoscopy to evaluate the vascular pattern, skin biopsy for histopathological confirmation
(epidermal thickening, parakeratosis, Munro microabscesses), and PASI scoring for disease
severity assessment before initiating treatment."
π§ Usage
Stage 3 in the Full HIKARI Pipeline
π· Image
β
βΌ
[Stage 1] HIKARI-Subaru-8B-SkinGroup βββΊ group label
β
βΌ
[Stage 2] HIKARI-Sirius-8B-SkinDx-RAG βββΊ disease label
β
βΌ
[Stage 3] HIKARI-Vega-8B-SkinCaption-Fused βββΊ clinical caption β YOU ARE HERE
Mode A β Standard (noguide): best BLEU-4 / BERTScore
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
model_id = "E27085921/HIKARI-Vega-8B-SkinCaption-Fused"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
image = Image.open("skin_lesion.jpg").convert("RGB")
PROMPT_NOGUIDE = (
"Describe this skin lesion image in detail. Include information about its "
"appearance, possible diagnosis, and recommended examinations."
)
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPT_NOGUIDE},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=256, temperature=0.0, do_sample=False)
caption = processor.batch_decode(
out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)[0].strip()
print(caption)
Mode B β Guided: highest disease-name correctness (78.79%)
Pass the Stage 2 disease label into the caption prompt:
PROMPT_GUIDED = (
"Stage 2 diagnosis: {disease}.\n\n"
"Describe this skin lesion image in detail. Include information about its "
"appearance, possible diagnosis, and recommended examinations."
)
disease = "psoriasis" # from HIKARI-Sirius (Stage 2)
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPT_GUIDED.format(disease=disease)},
]}]
# same generation code as above
| Mode | BLEU-4 | BERTScore-F | Disease Correctness |
|---|---|---|---|
| Standard (noguide) | 29.33 | 91.12 | 62.63% |
| Guided | 13.11 | 88.57 | 78.79% |
Use noguide for best overall quality. Use guided when the disease name must appear explicitly in the caption.
Production β vLLM BnB-4bit β‘
Throughput: 0.91 img/s at batch=4 (256-token output)
from vllm import LLM, SamplingParams
from transformers import AutoProcessor
from PIL import Image
model_id = "E27085921/HIKARI-Vega-8B-SkinCaption-Fused"
llm = LLM(
model=model_id,
quantization="bitsandbytes",
load_format="bitsandbytes",
trust_remote_code=True,
max_model_len=2048,
gpu_memory_utilization=0.88,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
sp = SamplingParams(max_tokens=256, temperature=0.0)
def generate_caption(image: Image.Image, disease: str = None) -> str:
if disease:
prompt = f"Stage 2 diagnosis: {disease}.\n\nDescribe this skin lesion image in detail. Include information about its appearance, possible diagnosis, and recommended examinations."
else:
prompt = "Describe this skin lesion image in detail. Include information about its appearance, possible diagnosis, and recommended examinations."
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
n = max(text.count("<|vision_start|>"), 1)
out = llm.generate({"prompt": text, "multi_modal_data": {"image": [image] * n}}, sp)
return out[0].outputs[0].text.strip()
img = Image.open("skin_lesion.jpg").convert("RGB")
print(generate_caption(img))
# or: print(generate_caption(img, disease="psoriasis"))
Production β SGLang FP8 π
Throughput: 1.71 img/s at batch=4 (256-token output)
import sglang as sgl
from transformers import AutoProcessor
from PIL import Image
model_id = "E27085921/HIKARI-Vega-8B-SkinCaption-Fused"
engine = sgl.Engine(
model_path=model_id,
dtype="bfloat16",
quantization="fp8",
context_length=2048,
mem_fraction_static=0.88,
trust_remote_code=True,
disable_cuda_graph=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
def generate_caption_sglang(image: Image.Image) -> str:
PROMPT = "Describe this skin lesion image in detail. Include information about its appearance, possible diagnosis, and recommended examinations."
messages = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": PROMPT},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
out = engine.generate(
prompt=text, image_data=image,
sampling_params={"max_new_tokens": 256, "temperature": 0.0},
)
return (out["text"] if isinstance(out, dict) else out[0]["text"]).strip()
# engine.shutdown()
β‘ Speed Benchmark (RTX 5070 Ti, 16 GB VRAM β Stage 3, 256-token output)
| Engine | Batch 1 | Batch 4 | vs Unsloth bs=1 |
|---|---|---|---|
| Unsloth 4-bit | 6,699 ms/img | 3,003 ms/img | baseline |
| vLLM BnB-4bit | 2,957 ms/img | 1,094 ms/img | 6.1Γ faster |
| SGLang FP8 | 1,695 ms/img | 584 ms/img β‘ | 11.5Γ faster |
π LoRA Adapter Version
from peft import PeftModel
from transformers import Qwen3VLForConditionalGeneration
import torch
# Load the MERGED Stage 2 model as base (disease knowledge in base weights)
base = Qwen3VLForConditionalGeneration.from_pretrained(
"E27085921/HIKARI-Sirius-8B-SkinDx-RAG", # merged Stage 2 weights
torch_dtype=torch.bfloat16, device_map="auto"
)
# Apply Stage 3 caption LoRA on top
model = PeftModel.from_pretrained(base, "E27085921/HIKARI-Vega-8B-SkinCaption-Fused-LoRA")
β E27085921/HIKARI-Vega-8B-SkinCaption-Fused-LoRA
π HIKARI Model Family
| Model | Task | Metric | Type |
|---|---|---|---|
| HIKARI-Subaru-8B-SkinGroup | 4-class group classifier (Stage 1) | 88.68% | Merged |
| β HIKARI-Sirius-8B-SkinDx-RAG | 10-class disease dx β RAG (Stage 2) | 85.86% | Merged + LoRA |
| β HIKARI-Vega-8B-SkinCaption-Fused (this model) | Clinical caption β merged init (Stage 3) | BLEU-4: 29.33 | Merged + LoRA |
π Citation
@misc{hikari2026,
title = {HIKARI: RAG-in-Training for Skin Disease Diagnosis
with Cascaded Vision-Language Models},
author = {Watin Promfiy and Pawitra Boonprasart},
year = {2026},
institution = {King Mongkut's Institute of Technology Ladkrabang,
Department of Information Technology, Bangkok, Thailand}
}
Made with β€οΈ at King Mongkut's Institute of Technology Ladkrabang (KMITL)
Department of Information Technology
- Downloads last month
- 5