Delta-LLaVA Hermes-4.3-36B

Delta-LLaVA Hermes-4.3-36B is a multimodal vision-language assistant trained with the Delta-LLaVA codebase on top of NousResearch/Hermes-4.3-36B.

Project Home

GitHub repository: mzamini92/Delta-LLaVA

Model Details

  • Base language model: NousResearch/Hermes-4.3-36B
  • Model family: LLaVA-style multimodal instruction model
  • Conversation template: hermes_43
  • Vision encoder: CLIP ViT-L/14 336px style vision tower
  • Multimodal projector: deltallava
  • Context length used in training: 144 tokens
  • Precision during training: bf16

Training Summary

This checkpoint follows a two-stage LLaVA-style training recipe:

  1. Vision-language projector pretraining

    • Data: blip_laion_cc_sbu_558k.json
    • Trainable module: multimodal projector (tune_mm_mlp_adapter=True)
    • Learning rate: 1e-3
    • Epochs: 1
    • Scheduler: cosine with warmup_ratio=0.03
    • Vision layer selection: mm_vision_select_layer=-2
  2. Instruction tuning

    • Data: eagle-1-sft-1_8M.json
    • Initialized from the pretrained projector checkpoint
    • Learning rate: 2e-5
    • Epochs: 1
    • Image aspect ratio handling: pad
    • Grouped by modality length: True

Intended Use

This model is intended for research and prototyping on image-grounded chat, visual question answering, image description, and multimodal reasoning.

Limitations and Safety

  • The model may hallucinate visual details or produce incorrect reasoning, especially for fine-grained OCR, counting, medical, legal, or safety-critical use cases.
  • The model inherits biases and limitations from both the Hermes-4.3-36B base model and the multimodal training data.
  • For high-stakes settings, verify outputs with domain experts and independent tools.

How to Use

Clone the project code and point PYTHONPATH to the repository before loading the model:

git clone https://github.com/mzamini92/Delta-LLaVA.git
cd Delta-LLaVA

Example inference script:

import torch
from PIL import Image

from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init

model_path = "YOUR_HF_USERNAME/YOUR_MODEL_REPO"
image_path = "example.jpg"
prompt_text = "Describe this image in detail."

disable_torch_init()
model_name = get_model_name_from_path(model_path)

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=model_name,
    device="cuda",
)

messages = [
    {
        "role": "user",
        "content": f"{DEFAULT_IMAGE_TOKEN}\n{prompt_text}",
    }
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image = Image.open(image_path).convert("RGB")
image_tensor = process_images([image], image_processor, model.config)[0].unsqueeze(0)

input_ids = tokenizer_image_token(
    prompt,
    tokenizer,
    IMAGE_TOKEN_INDEX,
    return_tensors="pt",
).unsqueeze(0).cuda()

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor.to(dtype=torch.float16, device="cuda", non_blocking=True),
        do_sample=False,
        max_new_tokens=256,
        use_cache=True,
    )

response = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=True)
print(response.strip())

Training Configuration Reference

The Hermes-4.3-specific training implemented with Batch size 64 Due to resource limitations. You can extend it to 1024 for better performance. The Data

We Sincerely Thank EAGLE for their great dataset that we used during finetuning.

Citation

If you use this model, please cite the Delta-LLaVA project and the upstream Hermes and LLaVA work.

@inproceedings{zamini2026delta,
  title={Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models},
  author={Zamini, Mohamad and Shukla, Diksha},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={3648--3657},
  year={2026}
}
Downloads last month
127
Safetensors
Model size
1.87M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mzamini/DeltaLLaVA-36B-144

Finetuned
(1)
this model