CoE-SlideVQA-8B

CoE-SlideVQA-8B is an 8B vision-language checkpoint fine-tuned for Chain-of-Evidence question answering over presentation slide screenshots. Given a user question and candidate slide images, the model is trained to answer using visual evidence from the slides.

This checkpoint is intended for research and prototyping on slide-based visual QA, evidence selection, and grounded multimodal reasoning.

Expected input and output

The model expects:

  • a natural-language question about a presentation
  • candidate slide screenshots selected by a retrieval system or provided by the user

The expected output is a JSON-style response with:

  • evidence_chain: the selected supporting slide screenshots and localized evidence
  • answer: the final answer

For exact prompt formatting and evaluation scripts, see the project code.

Usage

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "PeiyangLiu/CoE-SlideVQA-8B"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Use the same image preprocessing and prompt format as the CoE repository for reproducible results.

Related resources

Downloads last month
59
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PeiyangLiu/CoE-SlideVQA-8B

Quantizations
1 model

Paper for PeiyangLiu/CoE-SlideVQA-8B