HIKARI — Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference

HIKARI-Vega-8B-SkinCaption-Fused ⭐

Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference
Named after Vega — the brightest star in Lyra, shining above all others

📦 Model Type: Merged Full Model

This is a fully merged model — the LoRA adapter weights have been merged directly into the base model weights.

✅ No adapter loading needed. Load and run directly with transformers, vLLM, or SGLang.

💾 Size: ~17 GB (4 safetensor shards)

🔌 Lightweight adapter version: E27085921/HIKARI-Vega-8B-SkinCaption-Fused-LoRA (~1.1 GB)

Overview

HIKARI-Vega is our best clinical caption generation model, producing detailed multi-sentence descriptions of skin lesion images — including morphological features, differential diagnosis, and recommended examinations.

The key insight is the Fused (Merged-Init) training strategy: before Stage 3 caption training begins, the Stage 2 disease classifier weights are permanently merged into the base model. Fresh LoRA adapters are then added for caption learning. This ensures disease knowledge cannot be overwritten — the model knows what disease it is describing before it learns how to describe it.

Property	Value
Task	Clinical skin lesion caption generation (Stage 3 of HIKARI pipeline)
Base model	`Qwen/Qwen3-VL-8B-Thinking`
Init strategy	Merged-Init: Stage 2 merged → fresh LoRA for Stage 3
BLEU-4	29.33
ROUGE-1	53.55
BERTScore-F	91.12 (roberta-large, layer 17)
Disease mention rate	62.63% (noguide) / 78.79% (guided)
Model type	Merged full model
Hardware tested	RTX 5070 Ti (16 GB VRAM)

📊 Why Fused (Merged-Init) Works — Ablation Results

Experiment	Init Strategy	STS	BLEU-4	ROUGE-1	BERTScore-F
Way 1 — HIKARI-Rigel	LoRA checkpoint	✗	9.82	38.90	88.12
Way 2 — HIKARI-Vega (this model)	Merged weights	✗	29.33	53.55	91.12
Way 1 + STS	LoRA checkpoint	✓	0.00	5.03	—
Way 2 + STS — HIKARI-Antares	Merged weights	✓	0.61	15.68	—

Checkpoint-init (Rigel) is 3× worse: the caption LoRA training overwrites disease knowledge that was only stored in LoRA adapters. Merged-init (Vega) freezes disease knowledge permanently in base weights — it cannot be overwritten.

🗒️ Example Output

Input: [skin lesion image showing erythematous plaques with silvery scales]

Output:
"The image shows well-defined erythematous plaques covered by thick silvery-white scales
located on the extensor surface. The lesion demonstrates the characteristic features of
psoriasis, including Auspitz sign potential and sharp borders. The distribution and
morphology are consistent with plaque-type psoriasis. Recommended examinations include
dermoscopy to evaluate the vascular pattern, skin biopsy for histopathological confirmation
(epidermal thickening, parakeratosis, Munro microabscesses), and PASI scoring for disease
severity assessment before initiating treatment."

🔧 Usage

Stage 3 in the Full HIKARI Pipeline

📷 Image
   │
   ▼
[Stage 1] HIKARI-Subaru-8B-SkinGroup ──► group label
   │
   ▼
[Stage 2] HIKARI-Sirius-8B-SkinDx-RAG ──► disease label
   │
   ▼
[Stage 3] HIKARI-Vega-8B-SkinCaption-Fused ──► clinical caption  ← YOU ARE HERE

Mode A — Standard (noguide): best BLEU-4 / BERTScore

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

model_id = "E27085921/HIKARI-Vega-8B-SkinCaption-Fused"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)

image = Image.open("skin_lesion.jpg").convert("RGB")

PROMPT_NOGUIDE = (
    "Describe this skin lesion image in detail. Include information about its "
    "appearance, possible diagnosis, and recommended examinations."
)

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": PROMPT_NOGUIDE},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, temperature=0.0, do_sample=False)

caption = processor.batch_decode(
    out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)[0].strip()
print(caption)

Mode B — Guided: highest disease-name correctness (78.79%)

Pass the Stage 2 disease label into the caption prompt:

PROMPT_GUIDED = (
    "Stage 2 diagnosis: {disease}.\n\n"
    "Describe this skin lesion image in detail. Include information about its "
    "appearance, possible diagnosis, and recommended examinations."
)

disease = "psoriasis"  # from HIKARI-Sirius (Stage 2)

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": PROMPT_GUIDED.format(disease=disease)},
]}]
# same generation code as above

Mode	BLEU-4	BERTScore-F	Disease Correctness
Standard (noguide)	29.33	91.12	62.63%
Guided	13.11	88.57	78.79%

Use noguide for best overall quality. Use guided when the disease name must appear explicitly in the caption.

Production — vLLM BnB-4bit ⚡

Throughput: 0.91 img/s at batch=4 (256-token output)

from vllm import LLM, SamplingParams
from transformers import AutoProcessor
from PIL import Image

model_id = "E27085921/HIKARI-Vega-8B-SkinCaption-Fused"

llm = LLM(
    model=model_id,
    quantization="bitsandbytes",
    load_format="bitsandbytes",
    trust_remote_code=True,
    max_model_len=2048,
    gpu_memory_utilization=0.88,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
sp = SamplingParams(max_tokens=256, temperature=0.0)

def generate_caption(image: Image.Image, disease: str = None) -> str:
    if disease:
        prompt = f"Stage 2 diagnosis: {disease}.\n\nDescribe this skin lesion image in detail. Include information about its appearance, possible diagnosis, and recommended examinations."
    else:
        prompt = "Describe this skin lesion image in detail. Include information about its appearance, possible diagnosis, and recommended examinations."

    messages = [{"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt},
    ]}]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    n = max(text.count("<|vision_start|>"), 1)
    out = llm.generate({"prompt": text, "multi_modal_data": {"image": [image] * n}}, sp)
    return out[0].outputs[0].text.strip()

img = Image.open("skin_lesion.jpg").convert("RGB")
print(generate_caption(img))
# or: print(generate_caption(img, disease="psoriasis"))

Production — SGLang FP8 🚀

Throughput: 1.71 img/s at batch=4 (256-token output)

import sglang as sgl
from transformers import AutoProcessor
from PIL import Image

model_id = "E27085921/HIKARI-Vega-8B-SkinCaption-Fused"

engine = sgl.Engine(
    model_path=model_id,
    dtype="bfloat16",
    quantization="fp8",
    context_length=2048,
    mem_fraction_static=0.88,
    trust_remote_code=True,
    disable_cuda_graph=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

def generate_caption_sglang(image: Image.Image) -> str:
    PROMPT = "Describe this skin lesion image in detail. Include information about its appearance, possible diagnosis, and recommended examinations."
    messages = [{"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": PROMPT},
    ]}]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    out = engine.generate(
        prompt=text, image_data=image,
        sampling_params={"max_new_tokens": 256, "temperature": 0.0},
    )
    return (out["text"] if isinstance(out, dict) else out[0]["text"]).strip()

# engine.shutdown()

⚡ Speed Benchmark (RTX 5070 Ti, 16 GB VRAM — Stage 3, 256-token output)

Engine	Batch 1	Batch 4	vs Unsloth bs=1
Unsloth 4-bit	6,699 ms/img	3,003 ms/img	baseline
vLLM BnB-4bit	2,957 ms/img	1,094 ms/img	6.1× faster
SGLang FP8	1,695 ms/img	584 ms/img ⚡	11.5× faster

🔌 LoRA Adapter Version

from peft import PeftModel
from transformers import Qwen3VLForConditionalGeneration
import torch

# Load the MERGED Stage 2 model as base (disease knowledge in base weights)
base = Qwen3VLForConditionalGeneration.from_pretrained(
    "E27085921/HIKARI-Sirius-8B-SkinDx-RAG",   # merged Stage 2 weights
    torch_dtype=torch.bfloat16, device_map="auto"
)
# Apply Stage 3 caption LoRA on top
model = PeftModel.from_pretrained(base, "E27085921/HIKARI-Vega-8B-SkinCaption-Fused-LoRA")

→ E27085921/HIKARI-Vega-8B-SkinCaption-Fused-LoRA

🌟 HIKARI Model Family

Model	Task	Metric	Type
HIKARI-Subaru-8B-SkinGroup	4-class group classifier (Stage 1)	88.68%	Merged
⭐ HIKARI-Sirius-8B-SkinDx-RAG	10-class disease dx — RAG (Stage 2)	85.86%	Merged + LoRA
⭐ HIKARI-Vega-8B-SkinCaption-Fused (this model)	Clinical caption — merged init (Stage 3)	BLEU-4: 29.33	Merged + LoRA

📄 Citation

@misc{hikari2026,
  title  = {HIKARI: RAG-in-Training for Skin Disease Diagnosis
            with Cascaded Vision-Language Models},
  author = {Watin Promfiy and Pawitra Boonprasart},
  year   = {2026},
  institution = {King Mongkut's Institute of Technology Ladkrabang,
                 Department of Information Technology, Bangkok, Thailand}
}

Made with ❤️ at King Mongkut's Institute of Technology Ladkrabang (KMITL)
Department of Information Technology