indrehus
/

CoExVQA

Visual Question Answering

Safetensors

Model card Files Files and versions

xet

Community

indrehus commited on Mar 3

Commit

17bbe90

verified ·

1 Parent(s): 067bf7b

Update README.md

Browse files

Files changed (1) hide show

README.md +87 -1

README.md CHANGED Viewed

@@ -5,4 +5,90 @@ datasets:
 base_model:
 - google/pix2struct-docvqa-base
 pipeline_tag: visual-question-answering
----

 base_model:
 - google/pix2struct-docvqa-base
 pipeline_tag: visual-question-answering
+---
+# CoExVQA
+![image](https://cdn-uploads.huggingface.co/production/uploads/67936c027244d9f625d18afe/KvP8_1xgFL_9QNCdl9cUa.png)
+Document Visual Question Answering (DocVQA) requires vision–language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Despite strong predictive performance, existing DocVQA systems entangle these two aspects and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework that enforces a grounded reasoning process through a chain-of-explanation design. The model first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. By making both evidence selection and spatial grounding intrinsic to prediction, CoExVQA enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence yields competitive performance while providing transparent and verifiable predictions.
+> [!NOTE]
+> This work requires you to download the repository from github
+> - https://github.com/KjetilIN/CoExVQA
+## Usage
+1. Clone the repo
+```terminal
+git clone git@github.com:KjetilIN/CoExVQA.git
+````
+2. Download the pretrained model
+```python
+from src.model.model import CoExVQA
+model = CoExVQA.from_hf(cache_dir=args.cache_dir).to(device)
+model.eval()
+```
+3. Use it to predict and generate labels
+```python
+# Get images and their quires
+images = batch["images"]
+questions = batch["questions"]
+# Forward pass
+outputs = model(
+    images=images,
+    questions=questions,
+)
+# Get the predicted box and mask
+box_pred = outputs["box"]
+q_mask = outputs.get("q_mask", None)
+# Or just generete the text answers
+preds = model.generate_preds(
+    images=images,
+    questions=questions,
+    gen_kwargs=gen_kwargs,
+    gt_boxes=None,
+)
+```
+## Citation
+If you use this code or dataset in your research, please cite appropriate:
+```bibtex
+# TBA
+```
+This work uses the DocVQA dataset:
+```bibtex
+@misc{mathew2021docvqadatasetvqadocument,
+    title={DocVQA: A Dataset for VQA on Document Images},
+    author={Minesh Mathew and Dimosthenis Karatzas and C. V. Jawahar},
+    year={2021},
+    eprint={2007.00398},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV},
+    url={https://arxiv.org/abs/2007.00398},
+}
+```