Visual Question Answering
Safetensors
indrehus commited on
Commit
17bbe90
·
verified ·
1 Parent(s): 067bf7b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -1
README.md CHANGED
@@ -5,4 +5,90 @@ datasets:
5
  base_model:
6
  - google/pix2struct-docvqa-base
7
  pipeline_tag: visual-question-answering
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - google/pix2struct-docvqa-base
7
  pipeline_tag: visual-question-answering
8
+ ---
9
+
10
+
11
+ # CoExVQA
12
+
13
+
14
+
15
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/67936c027244d9f625d18afe/KvP8_1xgFL_9QNCdl9cUa.png)
16
+
17
+
18
+ Document Visual Question Answering (DocVQA) requires vision–language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Despite strong predictive performance, existing DocVQA systems entangle these two aspects and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework that enforces a grounded reasoning process through a chain-of-explanation design. The model first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. By making both evidence selection and spatial grounding intrinsic to prediction, CoExVQA enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence yields competitive performance while providing transparent and verifiable predictions.
19
+
20
+
21
+ > [!NOTE]
22
+ > This work requires you to download the repository from github
23
+ > - https://github.com/KjetilIN/CoExVQA
24
+
25
+
26
+
27
+ ## Usage
28
+
29
+ 1. Clone the repo
30
+ ```terminal
31
+ git clone git@github.com:KjetilIN/CoExVQA.git
32
+ ````
33
+
34
+
35
+ 2. Download the pretrained model
36
+
37
+ ```python
38
+ from src.model.model import CoExVQA
39
+
40
+ model = CoExVQA.from_hf(cache_dir=args.cache_dir).to(device)
41
+ model.eval()
42
+
43
+ ```
44
+
45
+ 3. Use it to predict and generate labels
46
+
47
+ ```python
48
+ # Get images and their quires
49
+ images = batch["images"]
50
+ questions = batch["questions"]
51
+
52
+ # Forward pass
53
+ outputs = model(
54
+ images=images,
55
+ questions=questions,
56
+ )
57
+
58
+ # Get the predicted box and mask
59
+ box_pred = outputs["box"]
60
+ q_mask = outputs.get("q_mask", None)
61
+
62
+
63
+ # Or just generete the text answers
64
+ preds = model.generate_preds(
65
+ images=images,
66
+ questions=questions,
67
+ gen_kwargs=gen_kwargs,
68
+ gt_boxes=None,
69
+ )
70
+ ```
71
+
72
+
73
+ ## Citation
74
+
75
+ If you use this code or dataset in your research, please cite appropriate:
76
+
77
+ ```bibtex
78
+ # TBA
79
+ ```
80
+
81
+ This work uses the DocVQA dataset:
82
+
83
+ ```bibtex
84
+ @misc{mathew2021docvqadatasetvqadocument,
85
+ title={DocVQA: A Dataset for VQA on Document Images},
86
+ author={Minesh Mathew and Dimosthenis Karatzas and C. V. Jawahar},
87
+ year={2021},
88
+ eprint={2007.00398},
89
+ archivePrefix={arXiv},
90
+ primaryClass={cs.CV},
91
+ url={https://arxiv.org/abs/2007.00398},
92
+ }
93
+ ```
94
+