HIKARI β€” Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference

HIKARI-Vega-8B-SkinCaption-Fused ⭐

Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference
Named after Vega β€” the brightest star in Lyra, shining above all others


πŸ“¦ Model Type: Merged Full Model

This is a fully merged model β€” the LoRA adapter weights have been merged directly into the base model weights.

βœ… No adapter loading needed. Load and run directly with transformers, vLLM, or SGLang.

πŸ’Ύ Size: ~17 GB (4 safetensor shards)

πŸ”Œ Lightweight adapter version: E27085921/HIKARI-Vega-8B-SkinCaption-Fused-LoRA (~1.1 GB)


Overview

HIKARI-Vega is our best clinical caption generation model, producing detailed multi-sentence descriptions of skin lesion images β€” including morphological features, differential diagnosis, and recommended examinations.

The key insight is the Fused (Merged-Init) training strategy: before Stage 3 caption training begins, the Stage 2 disease classifier weights are permanently merged into the base model. Fresh LoRA adapters are then added for caption learning. This ensures disease knowledge cannot be overwritten β€” the model knows what disease it is describing before it learns how to describe it.

Property Value
Task Clinical skin lesion caption generation (Stage 3 of HIKARI pipeline)
Base model Qwen/Qwen3-VL-8B-Thinking
Init strategy Merged-Init: Stage 2 merged β†’ fresh LoRA for Stage 3
BLEU-4 29.33
ROUGE-1 53.55
BERTScore-F 91.12 (roberta-large, layer 17)
Disease mention rate 62.63% (noguide) / 78.79% (guided)
Model type Merged full model
Hardware tested RTX 5070 Ti (16 GB VRAM)

πŸ“Š Why Fused (Merged-Init) Works β€” Ablation Results

Experiment Init Strategy STS BLEU-4 ROUGE-1 BERTScore-F
Way 1 β€” HIKARI-Rigel LoRA checkpoint βœ— 9.82 38.90 88.12
Way 2 β€” HIKARI-Vega (this model) Merged weights βœ— 29.33 53.55 91.12
Way 1 + STS LoRA checkpoint βœ“ 0.00 5.03 β€”
Way 2 + STS β€” HIKARI-Antares Merged weights βœ“ 0.61 15.68 β€”

Checkpoint-init (Rigel) is 3Γ— worse: the caption LoRA training overwrites disease knowledge that was only stored in LoRA adapters. Merged-init (Vega) freezes disease knowledge permanently in base weights β€” it cannot be overwritten.


πŸ—’οΈ Example Output

Input: [skin lesion image showing erythematous plaques with silvery scales]

Output:
"The image shows well-defined erythematous plaques covered by thick silvery-white scales
located on the extensor surface. The lesion demonstrates the characteristic features of
psoriasis, including Auspitz sign potential and sharp borders. The distribution and
morphology are consistent with plaque-type psoriasis. Recommended examinations include
dermoscopy to evaluate the vascular pattern, skin biopsy for histopathological confirmation
(epidermal thickening, parakeratosis, Munro microabscesses), and PASI scoring for disease
severity assessment before initiating treatment."

πŸ”§ Usage

Stage 3 in the Full HIKARI Pipeline

πŸ“· Image
   β”‚
   β–Ό
[Stage 1] HIKARI-Subaru-8B-SkinGroup ──► group label
   β”‚
   β–Ό
[Stage 2] HIKARI-Sirius-8B-SkinDx-RAG ──► disease label
   β”‚
   β–Ό
[Stage 3] HIKARI-Vega-8B-SkinCaption-Fused ──► clinical caption  ← YOU ARE HERE

Mode A β€” Standard (noguide): best BLEU-4 / BERTScore

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

model_id = "E27085921/HIKARI-Vega-8B-SkinCaption-Fused"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)

image = Image.open("skin_lesion.jpg").convert("RGB")

PROMPT_NOGUIDE = (
    "Describe this skin lesion image in detail. Include information about its "
    "appearance, possible diagnosis, and recommended examinations."
)

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": PROMPT_NOGUIDE},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, temperature=0.0, do_sample=False)

caption = processor.batch_decode(
    out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)[0].strip()
print(caption)

Mode B β€” Guided: highest disease-name correctness (78.79%)

Pass the Stage 2 disease label into the caption prompt:

PROMPT_GUIDED = (
    "Stage 2 diagnosis: {disease}.\n\n"
    "Describe this skin lesion image in detail. Include information about its "
    "appearance, possible diagnosis, and recommended examinations."
)

disease = "psoriasis"  # from HIKARI-Sirius (Stage 2)

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": PROMPT_GUIDED.format(disease=disease)},
]}]
# same generation code as above
Mode BLEU-4 BERTScore-F Disease Correctness
Standard (noguide) 29.33 91.12 62.63%
Guided 13.11 88.57 78.79%

Use noguide for best overall quality. Use guided when the disease name must appear explicitly in the caption.


Production β€” vLLM BnB-4bit ⚑

Throughput: 0.91 img/s at batch=4 (256-token output)

from vllm import LLM, SamplingParams
from transformers import AutoProcessor
from PIL import Image

model_id = "E27085921/HIKARI-Vega-8B-SkinCaption-Fused"

llm = LLM(
    model=model_id,
    quantization="bitsandbytes",
    load_format="bitsandbytes",
    trust_remote_code=True,
    max_model_len=2048,
    gpu_memory_utilization=0.88,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
sp = SamplingParams(max_tokens=256, temperature=0.0)

def generate_caption(image: Image.Image, disease: str = None) -> str:
    if disease:
        prompt = f"Stage 2 diagnosis: {disease}.\n\nDescribe this skin lesion image in detail. Include information about its appearance, possible diagnosis, and recommended examinations."
    else:
        prompt = "Describe this skin lesion image in detail. Include information about its appearance, possible diagnosis, and recommended examinations."

    messages = [{"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt},
    ]}]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    n = max(text.count("<|vision_start|>"), 1)
    out = llm.generate({"prompt": text, "multi_modal_data": {"image": [image] * n}}, sp)
    return out[0].outputs[0].text.strip()

img = Image.open("skin_lesion.jpg").convert("RGB")
print(generate_caption(img))
# or: print(generate_caption(img, disease="psoriasis"))

Production β€” SGLang FP8 πŸš€

Throughput: 1.71 img/s at batch=4 (256-token output)

import sglang as sgl
from transformers import AutoProcessor
from PIL import Image

model_id = "E27085921/HIKARI-Vega-8B-SkinCaption-Fused"

engine = sgl.Engine(
    model_path=model_id,
    dtype="bfloat16",
    quantization="fp8",
    context_length=2048,
    mem_fraction_static=0.88,
    trust_remote_code=True,
    disable_cuda_graph=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

def generate_caption_sglang(image: Image.Image) -> str:
    PROMPT = "Describe this skin lesion image in detail. Include information about its appearance, possible diagnosis, and recommended examinations."
    messages = [{"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": PROMPT},
    ]}]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    out = engine.generate(
        prompt=text, image_data=image,
        sampling_params={"max_new_tokens": 256, "temperature": 0.0},
    )
    return (out["text"] if isinstance(out, dict) else out[0]["text"]).strip()

# engine.shutdown()

⚑ Speed Benchmark (RTX 5070 Ti, 16 GB VRAM β€” Stage 3, 256-token output)

Engine Batch 1 Batch 4 vs Unsloth bs=1
Unsloth 4-bit 6,699 ms/img 3,003 ms/img baseline
vLLM BnB-4bit 2,957 ms/img 1,094 ms/img 6.1Γ— faster
SGLang FP8 1,695 ms/img 584 ms/img ⚑ 11.5Γ— faster

πŸ”Œ LoRA Adapter Version

from peft import PeftModel
from transformers import Qwen3VLForConditionalGeneration
import torch

# Load the MERGED Stage 2 model as base (disease knowledge in base weights)
base = Qwen3VLForConditionalGeneration.from_pretrained(
    "E27085921/HIKARI-Sirius-8B-SkinDx-RAG",   # merged Stage 2 weights
    torch_dtype=torch.bfloat16, device_map="auto"
)
# Apply Stage 3 caption LoRA on top
model = PeftModel.from_pretrained(base, "E27085921/HIKARI-Vega-8B-SkinCaption-Fused-LoRA")

β†’ E27085921/HIKARI-Vega-8B-SkinCaption-Fused-LoRA


🌟 HIKARI Model Family

ModelTaskMetricType
HIKARI-Subaru-8B-SkinGroup 4-class group classifier (Stage 1)88.68%Merged
⭐ HIKARI-Sirius-8B-SkinDx-RAG 10-class disease dx β€” RAG (Stage 2)85.86%Merged + LoRA
⭐ HIKARI-Vega-8B-SkinCaption-Fused (this model) Clinical caption β€” merged init (Stage 3)BLEU-4: 29.33Merged + LoRA

πŸ“„ Citation

@misc{hikari2026,
  title  = {HIKARI: RAG-in-Training for Skin Disease Diagnosis
            with Cascaded Vision-Language Models},
  author = {Watin Promfiy and Pawitra Boonprasart},
  year   = {2026},
  institution = {King Mongkut's Institute of Technology Ladkrabang,
                 Department of Information Technology, Bangkok, Thailand}
}

Made with ❀️ at King Mongkut's Institute of Technology Ladkrabang (KMITL)
Department of Information Technology

Downloads last month
5
Safetensors
Model size
9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for E27085921/HIKARI-Vega-8B-SkinCaption-Fused

Finetuned
(49)
this model
Quantizations
1 model

Collection including E27085921/HIKARI-Vega-8B-SkinCaption-Fused