| --- |
| datasets: |
| - letxbe/BoundingDocs |
| language: |
| - en |
| pipeline_tag: visual-question-answering |
| tags: |
| - Visual-Question-Answering |
| - Question-Answering |
| - Document |
| license: apache-2.0 |
| --- |
| |
|
|
| <div align="center"> |
|
|
| <h1>DocExplainer: Document VQA with Bounding Box Localization</h1> |
|
|
| </div> |
|
|
| DocExplainer is a an approach to Document Visual Question Answering (Document VQA) with bounding box localization. |
| Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable. |
| It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding. |
| |
| - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai |
| - **Affiliations:** [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it) |
| - **License:** apache-2.0 |
| - **Paper:** ["Towards Reliable and Interpretable Document Question Answering via VLMs"](https://arxiv.org/abs/2509.10129) by Alessio Chen et al. |
|
|
| <div align="center"> |
| <img src="https://cdn.prod.website-files.com/655f447668b4ad1dd3d4b3d9/664cc272c3e176608bc14a4c_LOGO%20v0%20-%20LetXBebicolore.svg" alt="letxbe ai logo" width="200"> |
| <img src="https://www.dinfo.unifi.it/upload/notizie/Logo_Dinfo_web%20(1).png" alt="Logo Unifi" width="200"> |
| </div> |
|
|
|
|
| ## Model Details |
|
|
| DocExplainer is a fine-tuned [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384)-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process: |
|
|
| 1. **Question Answering**: Any VLM is used as a black box component to generate a textual answer given in input a document image and question. |
| 2. **Bounding Box Explanation**: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence. |
|
|
|
|
| ## Model Architecture |
| DocExplainer builds on [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings. |
|
|
|  |
|
|
| ## Training Procedure |
| - Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers. |
| - A regression head outputs normalized coordinates `[x1, y1, x2, y2]`. |
| - **Backbone**: SigLiP2 Giant (frozen). |
| - **Loss Function**: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1]. |
|
|
| #### Training Setup |
| - **Dataset**: [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs) |
| - **Epochs**: 20 |
| - **Optimizer**: AdamW |
| - **Hardware**: 1 × NVIDIA L40S-1-48G GPU |
| - **Model Selection**: Best checkpoint chosen by highest mean IoU on the validation split. |
|
|
|
|
|
|
| ## Quick Start |
|
|
| Here is a simple example of how to use `DocExplainer` to get an answer and its corresponding bounding box from a document image. |
|
|
| ```python |
| from PIL import Image |
| import requests |
| import torch |
| from transformers import AutoModel, AutoModelForImageTextToText, AutoProcessor |
| import json |
| |
| url = "https://i.postimg.cc/BvftyvS3/image-1d100e9.jpg" |
| image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
| question = "What is the invoice number?" |
| |
| # ----------------------- |
| # 1. Load SmolVLM2-2.2B for answer generation |
| # ----------------------- |
| vlm_model = AutoModelForImageTextToText.from_pretrained( |
| "HuggingFaceTB/SmolVLM2-2.2B-Instruct", |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| attn_implementation="flash_attention_2" |
| ) |
| processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct") |
| |
| PROMPT = """Based only on the document image, answer the following question: |
| Question: {QUESTION} |
| Provide ONLY a JSON response in the following format (no trailing commas!): |
| {{ |
| "content": "answer" |
| }} |
| """ |
| |
| prompt_text = PROMPT.format(QUESTION=question) |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image", "image": image}, |
| {"type": "text", "text": prompt_text}, |
| ] |
| }, |
| ] |
| |
| inputs = processor.apply_chat_template( |
| messages, |
| add_generation_prompt=True, |
| tokenize=True, |
| return_dict=True, |
| return_tensors="pt", |
| ).to(vlm_model.device, dtype=torch.bfloat16) |
| |
| input_length = inputs['input_ids'].shape[1] |
| generated_ids = vlm_model.generate(**inputs, do_sample=False, max_new_tokens=2056) |
| |
| output_ids = generated_ids[:, input_length:] |
| generated_texts = processor.batch_decode( |
| output_ids, |
| skip_special_tokens=True, |
| ) |
| |
| decoded_output = generated_texts[0].replace("Assistant:", "", 1).strip() |
| answer = json.loads(decoded_output)['content'] |
| |
| print(f"Answer: {answer}") |
| |
| # ----------------------- |
| # 2. Load DocExplainer for bounding box prediction |
| # ----------------------- |
| explainer = AutoModel.from_pretrained("letxbe/DocExplainer", trust_remote_code=True) |
| bbox = explainer.predict(image, answer) |
| print(f"Predicted bounding box (normalized): {bbox}") |
| ``` |
|
|
|
|
| <table> |
| <tr> |
| <td width="50%" valign="top"> |
| Example Output: |
| |
| **Question**: What is the invoice number? <br> |
| **Answer**: 3Y8M2d-846<br><br> |
| **Predicted BBox**: [0.6353235244750977, 0.03685223311185837, 0.8617828488349915, 0.058749228715896606] <br> |
| </td> |
| <td width="50%" valign="top"> |
| Visualized Answer Location: |
| <img src="https://i.postimg.cc/0NmBM0b1/invoice-explained.png" alt="Invoice with predicted bounding box" width="100%"> |
| </td> |
| </tr> |
| </table> |
| |
|
|
| ## Performance |
|
|
| | Architecture | Prompting | ANLS | MeanIoU | |
| |--------------------------------|------------|-------|---------| |
| | Smolvlm-2.2B | Zero-shot | 0.527 | 0.011 | |
| | | Anchors | 0.543 | 0.026 | |
| | | CoT | 0.561 | 0.011 | |
| | Qwen2-vl-7B | Zero-shot | 0.691 | 0.048 | |
| | | Anchors | 0.694 | 0.051 | |
| | | CoT | <ins>0.720</ins> | 0.038 | |
| | Claude Sonnet 4 | Zero-shot | **0.737** | 0.031 | |
| | Smolvlm-2.2B + DocExplainer | Zero-shot | 0.572 | 0.175 | |
| | Qwen2-vl-7B + DocExplainer | Zero-shot | 0.689 | 0.188 | |
| | Smol + Naive OCR | Zero-shot | 0.556 | <ins>0.405</ins> | |
| | Qwen + Naive OCR | Zero-shot | 0.690 | **0.494** | |
|
|
|
|
| Document VQA performance of different models and prompting strategies on the [BoundingDocs v2.0 dataset](https://huggingface.co/datasets/letxbe/BoundingDocs). <br> |
| The best value is shown in **bold**, the second-best value is <ins>underlined</ins>. |
|
|
| ## Citation |
|
|
| If you use `DocExplainer`, please cite: |
|
|
| ```bibtex |
| @misc{chen2025reliableinterpretabledocumentquestion, |
| title={Towards Reliable and Interpretable Document Question Answering via VLMs}, |
| author={Alessio Chen and Simone Giovannini and Andrea Gemelli and Fabio Coppini and Simone Marinai}, |
| year={2025}, |
| eprint={2509.10129}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2509.10129}, |
| } |
| ``` |
|
|
| ## Limitations |
| - **Prototype only**: Intended as a first approach, not a production-ready solution. |
| - **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly |
|
|
|
|
|
|
|
|