Sanket17 commited on
Commit
55f5a27
·
verified ·
1 Parent(s): 5483ea1

Upload 5 files

Browse files
Files changed (5) hide show
  1. LICENSE +395 -0
  2. SECURITY.md +41 -0
  3. demo.ipynb +0 -0
  4. omniparser.py +60 -0
  5. utils.py +417 -0
LICENSE ADDED
@@ -0,0 +1,395 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Attribution 4.0 International
2
+
3
+ =======================================================================
4
+
5
+ Creative Commons Corporation ("Creative Commons") is not a law firm and
6
+ does not provide legal services or legal advice. Distribution of
7
+ Creative Commons public licenses does not create a lawyer-client or
8
+ other relationship. Creative Commons makes its licenses and related
9
+ information available on an "as-is" basis. Creative Commons gives no
10
+ warranties regarding its licenses, any material licensed under their
11
+ terms and conditions, or any related information. Creative Commons
12
+ disclaims all liability for damages resulting from their use to the
13
+ fullest extent possible.
14
+
15
+ Using Creative Commons Public Licenses
16
+
17
+ Creative Commons public licenses provide a standard set of terms and
18
+ conditions that creators and other rights holders may use to share
19
+ original works of authorship and other material subject to copyright
20
+ and certain other rights specified in the public license below. The
21
+ following considerations are for informational purposes only, are not
22
+ exhaustive, and do not form part of our licenses.
23
+
24
+ Considerations for licensors: Our public licenses are
25
+ intended for use by those authorized to give the public
26
+ permission to use material in ways otherwise restricted by
27
+ copyright and certain other rights. Our licenses are
28
+ irrevocable. Licensors should read and understand the terms
29
+ and conditions of the license they choose before applying it.
30
+ Licensors should also secure all rights necessary before
31
+ applying our licenses so that the public can reuse the
32
+ material as expected. Licensors should clearly mark any
33
+ material not subject to the license. This includes other CC-
34
+ licensed material, or material used under an exception or
35
+ limitation to copyright. More considerations for licensors:
36
+ wiki.creativecommons.org/Considerations_for_licensors
37
+
38
+ Considerations for the public: By using one of our public
39
+ licenses, a licensor grants the public permission to use the
40
+ licensed material under specified terms and conditions. If
41
+ the licensor's permission is not necessary for any reason--for
42
+ example, because of any applicable exception or limitation to
43
+ copyright--then that use is not regulated by the license. Our
44
+ licenses grant only permissions under copyright and certain
45
+ other rights that a licensor has authority to grant. Use of
46
+ the licensed material may still be restricted for other
47
+ reasons, including because others have copyright or other
48
+ rights in the material. A licensor may make special requests,
49
+ such as asking that all changes be marked or described.
50
+ Although not required by our licenses, you are encouraged to
51
+ respect those requests where reasonable. More_considerations
52
+ for the public:
53
+ wiki.creativecommons.org/Considerations_for_licensees
54
+
55
+ =======================================================================
56
+
57
+ Creative Commons Attribution 4.0 International Public License
58
+
59
+ By exercising the Licensed Rights (defined below), You accept and agree
60
+ to be bound by the terms and conditions of this Creative Commons
61
+ Attribution 4.0 International Public License ("Public License"). To the
62
+ extent this Public License may be interpreted as a contract, You are
63
+ granted the Licensed Rights in consideration of Your acceptance of
64
+ these terms and conditions, and the Licensor grants You such rights in
65
+ consideration of benefits the Licensor receives from making the
66
+ Licensed Material available under these terms and conditions.
67
+
68
+
69
+ Section 1 -- Definitions.
70
+
71
+ a. Adapted Material means material subject to Copyright and Similar
72
+ Rights that is derived from or based upon the Licensed Material
73
+ and in which the Licensed Material is translated, altered,
74
+ arranged, transformed, or otherwise modified in a manner requiring
75
+ permission under the Copyright and Similar Rights held by the
76
+ Licensor. For purposes of this Public License, where the Licensed
77
+ Material is a musical work, performance, or sound recording,
78
+ Adapted Material is always produced where the Licensed Material is
79
+ synched in timed relation with a moving image.
80
+
81
+ b. Adapter's License means the license You apply to Your Copyright
82
+ and Similar Rights in Your contributions to Adapted Material in
83
+ accordance with the terms and conditions of this Public License.
84
+
85
+ c. Copyright and Similar Rights means copyright and/or similar rights
86
+ closely related to copyright including, without limitation,
87
+ performance, broadcast, sound recording, and Sui Generis Database
88
+ Rights, without regard to how the rights are labeled or
89
+ categorized. For purposes of this Public License, the rights
90
+ specified in Section 2(b)(1)-(2) are not Copyright and Similar
91
+ Rights.
92
+
93
+ d. Effective Technological Measures means those measures that, in the
94
+ absence of proper authority, may not be circumvented under laws
95
+ fulfilling obligations under Article 11 of the WIPO Copyright
96
+ Treaty adopted on December 20, 1996, and/or similar international
97
+ agreements.
98
+
99
+ e. Exceptions and Limitations means fair use, fair dealing, and/or
100
+ any other exception or limitation to Copyright and Similar Rights
101
+ that applies to Your use of the Licensed Material.
102
+
103
+ f. Licensed Material means the artistic or literary work, database,
104
+ or other material to which the Licensor applied this Public
105
+ License.
106
+
107
+ g. Licensed Rights means the rights granted to You subject to the
108
+ terms and conditions of this Public License, which are limited to
109
+ all Copyright and Similar Rights that apply to Your use of the
110
+ Licensed Material and that the Licensor has authority to license.
111
+
112
+ h. Licensor means the individual(s) or entity(ies) granting rights
113
+ under this Public License.
114
+
115
+ i. Share means to provide material to the public by any means or
116
+ process that requires permission under the Licensed Rights, such
117
+ as reproduction, public display, public performance, distribution,
118
+ dissemination, communication, or importation, and to make material
119
+ available to the public including in ways that members of the
120
+ public may access the material from a place and at a time
121
+ individually chosen by them.
122
+
123
+ j. Sui Generis Database Rights means rights other than copyright
124
+ resulting from Directive 96/9/EC of the European Parliament and of
125
+ the Council of 11 March 1996 on the legal protection of databases,
126
+ as amended and/or succeeded, as well as other essentially
127
+ equivalent rights anywhere in the world.
128
+
129
+ k. You means the individual or entity exercising the Licensed Rights
130
+ under this Public License. Your has a corresponding meaning.
131
+
132
+
133
+ Section 2 -- Scope.
134
+
135
+ a. License grant.
136
+
137
+ 1. Subject to the terms and conditions of this Public License,
138
+ the Licensor hereby grants You a worldwide, royalty-free,
139
+ non-sublicensable, non-exclusive, irrevocable license to
140
+ exercise the Licensed Rights in the Licensed Material to:
141
+
142
+ a. reproduce and Share the Licensed Material, in whole or
143
+ in part; and
144
+
145
+ b. produce, reproduce, and Share Adapted Material.
146
+
147
+ 2. Exceptions and Limitations. For the avoidance of doubt, where
148
+ Exceptions and Limitations apply to Your use, this Public
149
+ License does not apply, and You do not need to comply with
150
+ its terms and conditions.
151
+
152
+ 3. Term. The term of this Public License is specified in Section
153
+ 6(a).
154
+
155
+ 4. Media and formats; technical modifications allowed. The
156
+ Licensor authorizes You to exercise the Licensed Rights in
157
+ all media and formats whether now known or hereafter created,
158
+ and to make technical modifications necessary to do so. The
159
+ Licensor waives and/or agrees not to assert any right or
160
+ authority to forbid You from making technical modifications
161
+ necessary to exercise the Licensed Rights, including
162
+ technical modifications necessary to circumvent Effective
163
+ Technological Measures. For purposes of this Public License,
164
+ simply making modifications authorized by this Section 2(a)
165
+ (4) never produces Adapted Material.
166
+
167
+ 5. Downstream recipients.
168
+
169
+ a. Offer from the Licensor -- Licensed Material. Every
170
+ recipient of the Licensed Material automatically
171
+ receives an offer from the Licensor to exercise the
172
+ Licensed Rights under the terms and conditions of this
173
+ Public License.
174
+
175
+ b. No downstream restrictions. You may not offer or impose
176
+ any additional or different terms or conditions on, or
177
+ apply any Effective Technological Measures to, the
178
+ Licensed Material if doing so restricts exercise of the
179
+ Licensed Rights by any recipient of the Licensed
180
+ Material.
181
+
182
+ 6. No endorsement. Nothing in this Public License constitutes or
183
+ may be construed as permission to assert or imply that You
184
+ are, or that Your use of the Licensed Material is, connected
185
+ with, or sponsored, endorsed, or granted official status by,
186
+ the Licensor or others designated to receive attribution as
187
+ provided in Section 3(a)(1)(A)(i).
188
+
189
+ b. Other rights.
190
+
191
+ 1. Moral rights, such as the right of integrity, are not
192
+ licensed under this Public License, nor are publicity,
193
+ privacy, and/or other similar personality rights; however, to
194
+ the extent possible, the Licensor waives and/or agrees not to
195
+ assert any such rights held by the Licensor to the limited
196
+ extent necessary to allow You to exercise the Licensed
197
+ Rights, but not otherwise.
198
+
199
+ 2. Patent and trademark rights are not licensed under this
200
+ Public License.
201
+
202
+ 3. To the extent possible, the Licensor waives any right to
203
+ collect royalties from You for the exercise of the Licensed
204
+ Rights, whether directly or through a collecting society
205
+ under any voluntary or waivable statutory or compulsory
206
+ licensing scheme. In all other cases the Licensor expressly
207
+ reserves any right to collect such royalties.
208
+
209
+
210
+ Section 3 -- License Conditions.
211
+
212
+ Your exercise of the Licensed Rights is expressly made subject to the
213
+ following conditions.
214
+
215
+ a. Attribution.
216
+
217
+ 1. If You Share the Licensed Material (including in modified
218
+ form), You must:
219
+
220
+ a. retain the following if it is supplied by the Licensor
221
+ with the Licensed Material:
222
+
223
+ i. identification of the creator(s) of the Licensed
224
+ Material and any others designated to receive
225
+ attribution, in any reasonable manner requested by
226
+ the Licensor (including by pseudonym if
227
+ designated);
228
+
229
+ ii. a copyright notice;
230
+
231
+ iii. a notice that refers to this Public License;
232
+
233
+ iv. a notice that refers to the disclaimer of
234
+ warranties;
235
+
236
+ v. a URI or hyperlink to the Licensed Material to the
237
+ extent reasonably practicable;
238
+
239
+ b. indicate if You modified the Licensed Material and
240
+ retain an indication of any previous modifications; and
241
+
242
+ c. indicate the Licensed Material is licensed under this
243
+ Public License, and include the text of, or the URI or
244
+ hyperlink to, this Public License.
245
+
246
+ 2. You may satisfy the conditions in Section 3(a)(1) in any
247
+ reasonable manner based on the medium, means, and context in
248
+ which You Share the Licensed Material. For example, it may be
249
+ reasonable to satisfy the conditions by providing a URI or
250
+ hyperlink to a resource that includes the required
251
+ information.
252
+
253
+ 3. If requested by the Licensor, You must remove any of the
254
+ information required by Section 3(a)(1)(A) to the extent
255
+ reasonably practicable.
256
+
257
+ 4. If You Share Adapted Material You produce, the Adapter's
258
+ License You apply must not prevent recipients of the Adapted
259
+ Material from complying with this Public License.
260
+
261
+
262
+ Section 4 -- Sui Generis Database Rights.
263
+
264
+ Where the Licensed Rights include Sui Generis Database Rights that
265
+ apply to Your use of the Licensed Material:
266
+
267
+ a. for the avoidance of doubt, Section 2(a)(1) grants You the right
268
+ to extract, reuse, reproduce, and Share all or a substantial
269
+ portion of the contents of the database;
270
+
271
+ b. if You include all or a substantial portion of the database
272
+ contents in a database in which You have Sui Generis Database
273
+ Rights, then the database in which You have Sui Generis Database
274
+ Rights (but not its individual contents) is Adapted Material; and
275
+
276
+ c. You must comply with the conditions in Section 3(a) if You Share
277
+ all or a substantial portion of the contents of the database.
278
+
279
+ For the avoidance of doubt, this Section 4 supplements and does not
280
+ replace Your obligations under this Public License where the Licensed
281
+ Rights include other Copyright and Similar Rights.
282
+
283
+
284
+ Section 5 -- Disclaimer of Warranties and Limitation of Liability.
285
+
286
+ a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
287
+ EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
288
+ AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
289
+ ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
290
+ IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
291
+ WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
292
+ PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
293
+ ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
294
+ KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
295
+ ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
296
+
297
+ b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
298
+ TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
299
+ NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
300
+ INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
301
+ COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
302
+ USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
303
+ ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
304
+ DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
305
+ IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
306
+
307
+ c. The disclaimer of warranties and limitation of liability provided
308
+ above shall be interpreted in a manner that, to the extent
309
+ possible, most closely approximates an absolute disclaimer and
310
+ waiver of all liability.
311
+
312
+
313
+ Section 6 -- Term and Termination.
314
+
315
+ a. This Public License applies for the term of the Copyright and
316
+ Similar Rights licensed here. However, if You fail to comply with
317
+ this Public License, then Your rights under this Public License
318
+ terminate automatically.
319
+
320
+ b. Where Your right to use the Licensed Material has terminated under
321
+ Section 6(a), it reinstates:
322
+
323
+ 1. automatically as of the date the violation is cured, provided
324
+ it is cured within 30 days of Your discovery of the
325
+ violation; or
326
+
327
+ 2. upon express reinstatement by the Licensor.
328
+
329
+ For the avoidance of doubt, this Section 6(b) does not affect any
330
+ right the Licensor may have to seek remedies for Your violations
331
+ of this Public License.
332
+
333
+ c. For the avoidance of doubt, the Licensor may also offer the
334
+ Licensed Material under separate terms or conditions or stop
335
+ distributing the Licensed Material at any time; however, doing so
336
+ will not terminate this Public License.
337
+
338
+ d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
339
+ License.
340
+
341
+
342
+ Section 7 -- Other Terms and Conditions.
343
+
344
+ a. The Licensor shall not be bound by any additional or different
345
+ terms or conditions communicated by You unless expressly agreed.
346
+
347
+ b. Any arrangements, understandings, or agreements regarding the
348
+ Licensed Material not stated herein are separate from and
349
+ independent of the terms and conditions of this Public License.
350
+
351
+
352
+ Section 8 -- Interpretation.
353
+
354
+ a. For the avoidance of doubt, this Public License does not, and
355
+ shall not be interpreted to, reduce, limit, restrict, or impose
356
+ conditions on any use of the Licensed Material that could lawfully
357
+ be made without permission under this Public License.
358
+
359
+ b. To the extent possible, if any provision of this Public License is
360
+ deemed unenforceable, it shall be automatically reformed to the
361
+ minimum extent necessary to make it enforceable. If the provision
362
+ cannot be reformed, it shall be severed from this Public License
363
+ without affecting the enforceability of the remaining terms and
364
+ conditions.
365
+
366
+ c. No term or condition of this Public License will be waived and no
367
+ failure to comply consented to unless expressly agreed to by the
368
+ Licensor.
369
+
370
+ d. Nothing in this Public License constitutes or may be interpreted
371
+ as a limitation upon, or waiver of, any privileges and immunities
372
+ that apply to the Licensor or You, including from the legal
373
+ processes of any jurisdiction or authority.
374
+
375
+
376
+ =======================================================================
377
+
378
+ Creative Commons is not a party to its public
379
+ licenses. Notwithstanding, Creative Commons may elect to apply one of
380
+ its public licenses to material it publishes and in those instances
381
+ will be considered the “Licensor.” The text of the Creative Commons
382
+ public licenses is dedicated to the public domain under the CC0 Public
383
+ Domain Dedication. Except for the limited purpose of indicating that
384
+ material is shared under a Creative Commons public license or as
385
+ otherwise permitted by the Creative Commons policies published at
386
+ creativecommons.org/policies, Creative Commons does not authorize the
387
+ use of the trademark "Creative Commons" or any other trademark or logo
388
+ of Creative Commons without its prior written consent including,
389
+ without limitation, in connection with any unauthorized modifications
390
+ to any of its public licenses or any other arrangements,
391
+ understandings, or agreements concerning use of licensed material. For
392
+ the avoidance of doubt, this paragraph does not form part of the
393
+ public licenses.
394
+
395
+ Creative Commons may be contacted at creativecommons.org.
SECURITY.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
2
+
3
+ ## Security
4
+
5
+ Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
6
+
7
+ If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
8
+
9
+ ## Reporting Security Issues
10
+
11
+ **Please do not report security vulnerabilities through public GitHub issues.**
12
+
13
+ Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
14
+
15
+ If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
16
+
17
+ You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
18
+
19
+ Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
20
+
21
+ * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
22
+ * Full paths of source file(s) related to the manifestation of the issue
23
+ * The location of the affected source code (tag/branch/commit or direct URL)
24
+ * Any special configuration required to reproduce the issue
25
+ * Step-by-step instructions to reproduce the issue
26
+ * Proof-of-concept or exploit code (if possible)
27
+ * Impact of the issue, including how an attacker might exploit the issue
28
+
29
+ This information will help us triage your report more quickly.
30
+
31
+ If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
32
+
33
+ ## Preferred Languages
34
+
35
+ We prefer all communications to be in English.
36
+
37
+ ## Policy
38
+
39
+ Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
40
+
41
+ <!-- END MICROSOFT SECURITY.MD BLOCK -->
demo.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
omniparser.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from utils import get_som_labeled_img, check_ocr_box, get_caption_model_processor, get_dino_model, get_yolo_model
2
+ import torch
3
+ from ultralytics import YOLO
4
+ from PIL import Image
5
+ from typing import Dict, Tuple, List
6
+ import io
7
+ import base64
8
+
9
+
10
+ config = {
11
+ 'som_model_path': 'finetuned_icon_detect.pt',
12
+ 'device': 'cpu',
13
+ 'caption_model_path': 'Salesforce/blip2-opt-2.7b',
14
+ 'draw_bbox_config': {
15
+ 'text_scale': 0.8,
16
+ 'text_thickness': 2,
17
+ 'text_padding': 3,
18
+ 'thickness': 3,
19
+ },
20
+ 'BOX_TRESHOLD': 0.05
21
+ }
22
+
23
+
24
+ class Omniparser(object):
25
+ def __init__(self, config: Dict):
26
+ self.config = config
27
+
28
+ self.som_model = get_yolo_model(model_path=config['som_model_path'])
29
+ # self.caption_model_processor = get_caption_model_processor(config['caption_model_path'], device=cofig['device'])
30
+ # self.caption_model_processor['model'].to(torch.float32)
31
+
32
+ def parse(self, image_path: str):
33
+ print('Parsing image:', image_path)
34
+ ocr_bbox_rslt, is_goal_filtered = check_ocr_box(image_path, display_img = False, output_bb_format='xyxy', goal_filtering=None, easyocr_args={'paragraph': False, 'text_threshold':0.9})
35
+ text, ocr_bbox = ocr_bbox_rslt
36
+
37
+ draw_bbox_config = self.config['draw_bbox_config']
38
+ BOX_TRESHOLD = self.config['BOX_TRESHOLD']
39
+ dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_path, self.som_model, BOX_TRESHOLD = BOX_TRESHOLD, output_coord_in_ratio=False, ocr_bbox=ocr_bbox,draw_bbox_config=draw_bbox_config, caption_model_processor=None, ocr_text=text,use_local_semantics=False)
40
+
41
+ image = Image.open(io.BytesIO(base64.b64decode(dino_labled_img)))
42
+ # formating output
43
+ return_list = [{'from': 'omniparser', 'shape': {'x':coord[0], 'y':coord[1], 'width':coord[2], 'height':coord[3]},
44
+ 'text': parsed_content_list[i].split(': ')[1], 'type':'text'} for i, (k, coord) in enumerate(label_coordinates.items()) if i < len(parsed_content_list)]
45
+ return_list.extend(
46
+ [{'from': 'omniparser', 'shape': {'x':coord[0], 'y':coord[1], 'width':coord[2], 'height':coord[3]},
47
+ 'text': 'None', 'type':'icon'} for i, (k, coord) in enumerate(label_coordinates.items()) if i >= len(parsed_content_list)]
48
+ )
49
+
50
+ return [image, return_list]
51
+
52
+ parser = Omniparser(config)
53
+ image_path = 'examples/pc_1.png'
54
+
55
+ # time the parser
56
+ import time
57
+ s = time.time()
58
+ image, parsed_content_list = parser.parse(image_path)
59
+ device = config['device']
60
+ print(f'Time taken for Omniparser on {device}:', time.time() - s)
utils.py ADDED
@@ -0,0 +1,417 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # from ultralytics import YOLO
2
+ import os
3
+ import io
4
+ import base64
5
+ import time
6
+ from PIL import Image, ImageDraw, ImageFont
7
+ import json
8
+ import requests
9
+ # utility function
10
+ import os
11
+ from openai import AzureOpenAI
12
+
13
+ import json
14
+ import sys
15
+ import os
16
+ import cv2
17
+ import numpy as np
18
+ # %matplotlib inline
19
+ from matplotlib import pyplot as plt
20
+ import easyocr
21
+ from paddleocr import PaddleOCR
22
+ reader = easyocr.Reader(['en'])
23
+ paddle_ocr = PaddleOCR(
24
+ lang='en', # other lang also available
25
+ use_angle_cls=False,
26
+ use_gpu=False, # using cuda will conflict with pytorch in the same process
27
+ show_log=False,
28
+ max_batch_size=1024,
29
+ use_dilation=True, # improves accuracy
30
+ det_db_score_mode='slow', # improves accuracy
31
+ rec_batch_num=1024)
32
+ import time
33
+ import base64
34
+
35
+ import os
36
+ import ast
37
+ import torch
38
+ from typing import Tuple, List
39
+ from torchvision.ops import box_convert
40
+ import re
41
+ from torchvision.transforms import ToPILImage
42
+ import supervision as sv
43
+ import torchvision.transforms as T
44
+
45
+
46
+ def get_caption_model_processor(model_name, model_name_or_path="Salesforce/blip2-opt-2.7b", device=None):
47
+ if not device:
48
+ device = "cuda" if torch.cuda.is_available() else "cpu"
49
+ if model_name == "blip2":
50
+ from transformers import Blip2Processor, Blip2ForConditionalGeneration
51
+ processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
52
+ if device == 'cpu':
53
+ model = Blip2ForConditionalGeneration.from_pretrained(
54
+ model_name_or_path, device_map=None, torch_dtype=torch.float32
55
+ )
56
+ else:
57
+ model = Blip2ForConditionalGeneration.from_pretrained(
58
+ model_name_or_path, device_map=None, torch_dtype=torch.float16
59
+ ).to(device)
60
+ elif model_name == "florence2":
61
+ from transformers import AutoProcessor, AutoModelForCausalLM
62
+ processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
63
+ if device == 'cpu':
64
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype=torch.float32, trust_remote_code=True)
65
+ else:
66
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype=torch.float16, trust_remote_code=True).to(device)
67
+ return {'model': model.to(device), 'processor': processor}
68
+
69
+
70
+ def get_yolo_model(model_path):
71
+ from ultralytics import YOLO
72
+ # Load the model.
73
+ model = YOLO(model_path)
74
+ return model
75
+
76
+
77
+ @torch.inference_mode()
78
+ def get_parsed_content_icon(filtered_boxes, ocr_bbox, image_source, caption_model_processor, prompt=None):
79
+ to_pil = ToPILImage()
80
+ if ocr_bbox:
81
+ non_ocr_boxes = filtered_boxes[len(ocr_bbox):]
82
+ else:
83
+ non_ocr_boxes = filtered_boxes
84
+ croped_pil_image = []
85
+ for i, coord in enumerate(non_ocr_boxes):
86
+ xmin, xmax = int(coord[0]*image_source.shape[1]), int(coord[2]*image_source.shape[1])
87
+ ymin, ymax = int(coord[1]*image_source.shape[0]), int(coord[3]*image_source.shape[0])
88
+ cropped_image = image_source[ymin:ymax, xmin:xmax, :]
89
+ croped_pil_image.append(to_pil(cropped_image))
90
+
91
+ model, processor = caption_model_processor['model'], caption_model_processor['processor']
92
+ if not prompt:
93
+ if 'florence' in model.config.name_or_path:
94
+ prompt = "<CAPTION>"
95
+ else:
96
+ prompt = "The image shows"
97
+
98
+ batch_size = 10 # Number of samples per batch
99
+ generated_texts = []
100
+ device = model.device
101
+
102
+ for i in range(0, len(croped_pil_image), batch_size):
103
+ batch = croped_pil_image[i:i+batch_size]
104
+ if model.device.type == 'cuda':
105
+ inputs = processor(images=batch, text=[prompt]*len(batch), return_tensors="pt").to(device=device, dtype=torch.float16)
106
+ else:
107
+ inputs = processor(images=batch, text=[prompt]*len(batch), return_tensors="pt").to(device=device)
108
+ if 'florence' in model.config.name_or_path:
109
+ generated_ids = model.generate(input_ids=inputs["input_ids"],pixel_values=inputs["pixel_values"],max_new_tokens=1024,num_beams=3, do_sample=False)
110
+ else:
111
+ generated_ids = model.generate(**inputs, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True, num_return_sequences=1) # temperature=0.01, do_sample=True,
112
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
113
+ generated_text = [gen.strip() for gen in generated_text]
114
+ generated_texts.extend(generated_text)
115
+
116
+ return generated_texts
117
+
118
+
119
+
120
+ def get_parsed_content_icon_phi3v(filtered_boxes, ocr_bbox, image_source, caption_model_processor):
121
+ to_pil = ToPILImage()
122
+ if ocr_bbox:
123
+ non_ocr_boxes = filtered_boxes[len(ocr_bbox):]
124
+ else:
125
+ non_ocr_boxes = filtered_boxes
126
+ croped_pil_image = []
127
+ for i, coord in enumerate(non_ocr_boxes):
128
+ xmin, xmax = int(coord[0]*image_source.shape[1]), int(coord[2]*image_source.shape[1])
129
+ ymin, ymax = int(coord[1]*image_source.shape[0]), int(coord[3]*image_source.shape[0])
130
+ cropped_image = image_source[ymin:ymax, xmin:xmax, :]
131
+ croped_pil_image.append(to_pil(cropped_image))
132
+
133
+ model, processor = caption_model_processor['model'], caption_model_processor['processor']
134
+ device = model.device
135
+ messages = [{"role": "user", "content": "<|image_1|>\ndescribe the icon in one sentence"}]
136
+ prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
137
+
138
+ batch_size = 5 # Number of samples per batch
139
+ generated_texts = []
140
+
141
+ for i in range(0, len(croped_pil_image), batch_size):
142
+ images = croped_pil_image[i:i+batch_size]
143
+ image_inputs = [processor.image_processor(x, return_tensors="pt") for x in images]
144
+ inputs ={'input_ids': [], 'attention_mask': [], 'pixel_values': [], 'image_sizes': []}
145
+ texts = [prompt] * len(images)
146
+ for i, txt in enumerate(texts):
147
+ input = processor._convert_images_texts_to_inputs(image_inputs[i], txt, return_tensors="pt")
148
+ inputs['input_ids'].append(input['input_ids'])
149
+ inputs['attention_mask'].append(input['attention_mask'])
150
+ inputs['pixel_values'].append(input['pixel_values'])
151
+ inputs['image_sizes'].append(input['image_sizes'])
152
+ max_len = max([x.shape[1] for x in inputs['input_ids']])
153
+ for i, v in enumerate(inputs['input_ids']):
154
+ inputs['input_ids'][i] = torch.cat([processor.tokenizer.pad_token_id * torch.ones(1, max_len - v.shape[1], dtype=torch.long), v], dim=1)
155
+ inputs['attention_mask'][i] = torch.cat([torch.zeros(1, max_len - v.shape[1], dtype=torch.long), inputs['attention_mask'][i]], dim=1)
156
+ inputs_cat = {k: torch.concatenate(v).to(device) for k, v in inputs.items()}
157
+
158
+ generation_args = {
159
+ "max_new_tokens": 25,
160
+ "temperature": 0.01,
161
+ "do_sample": False,
162
+ }
163
+ generate_ids = model.generate(**inputs_cat, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)
164
+ # # remove input tokens
165
+ generate_ids = generate_ids[:, inputs_cat['input_ids'].shape[1]:]
166
+ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
167
+ response = [res.strip('\n').strip() for res in response]
168
+ generated_texts.extend(response)
169
+
170
+ return generated_texts
171
+
172
+ def remove_overlap(boxes, iou_threshold, ocr_bbox=None):
173
+ assert ocr_bbox is None or isinstance(ocr_bbox, List)
174
+
175
+ def box_area(box):
176
+ return (box[2] - box[0]) * (box[3] - box[1])
177
+
178
+ def intersection_area(box1, box2):
179
+ x1 = max(box1[0], box2[0])
180
+ y1 = max(box1[1], box2[1])
181
+ x2 = min(box1[2], box2[2])
182
+ y2 = min(box1[3], box2[3])
183
+ return max(0, x2 - x1) * max(0, y2 - y1)
184
+
185
+ def IoU(box1, box2):
186
+ intersection = intersection_area(box1, box2)
187
+ union = box_area(box1) + box_area(box2) - intersection + 1e-6
188
+ if box_area(box1) > 0 and box_area(box2) > 0:
189
+ ratio1 = intersection / box_area(box1)
190
+ ratio2 = intersection / box_area(box2)
191
+ else:
192
+ ratio1, ratio2 = 0, 0
193
+ return max(intersection / union, ratio1, ratio2)
194
+
195
+ boxes = boxes.tolist()
196
+ filtered_boxes = []
197
+ if ocr_bbox:
198
+ filtered_boxes.extend(ocr_bbox)
199
+ # print('ocr_bbox!!!', ocr_bbox)
200
+ for i, box1 in enumerate(boxes):
201
+ # if not any(IoU(box1, box2) > iou_threshold and box_area(box1) > box_area(box2) for j, box2 in enumerate(boxes) if i != j):
202
+ is_valid_box = True
203
+ for j, box2 in enumerate(boxes):
204
+ if i != j and IoU(box1, box2) > iou_threshold and box_area(box1) > box_area(box2):
205
+ is_valid_box = False
206
+ break
207
+ if is_valid_box:
208
+ # add the following 2 lines to include ocr bbox
209
+ if ocr_bbox:
210
+ if not any(IoU(box1, box3) > iou_threshold for k, box3 in enumerate(ocr_bbox)):
211
+ filtered_boxes.append(box1)
212
+ else:
213
+ filtered_boxes.append(box1)
214
+ return torch.tensor(filtered_boxes)
215
+
216
+ def load_image(image_path: str) -> Tuple[np.array, torch.Tensor]:
217
+ transform = T.Compose(
218
+ [
219
+ T.RandomResize([800], max_size=1333),
220
+ T.ToTensor(),
221
+ T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
222
+ ]
223
+ )
224
+ image_source = Image.open(image_path).convert("RGB")
225
+ image = np.asarray(image_source)
226
+ image_transformed, _ = transform(image_source, None)
227
+ return image, image_transformed
228
+
229
+
230
+ def annotate(image_source: np.ndarray, boxes: torch.Tensor, logits: torch.Tensor, phrases: List[str], text_scale: float,
231
+ text_padding=5, text_thickness=2, thickness=3) -> np.ndarray:
232
+ """
233
+ This function annotates an image with bounding boxes and labels.
234
+
235
+ Parameters:
236
+ image_source (np.ndarray): The source image to be annotated.
237
+ boxes (torch.Tensor): A tensor containing bounding box coordinates. in cxcywh format, pixel scale
238
+ logits (torch.Tensor): A tensor containing confidence scores for each bounding box.
239
+ phrases (List[str]): A list of labels for each bounding box.
240
+ text_scale (float): The scale of the text to be displayed. 0.8 for mobile/web, 0.3 for desktop # 0.4 for mind2web
241
+
242
+ Returns:
243
+ np.ndarray: The annotated image.
244
+ """
245
+ h, w, _ = image_source.shape
246
+ boxes = boxes * torch.Tensor([w, h, w, h])
247
+ xyxy = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()
248
+ xywh = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xywh").numpy()
249
+ detections = sv.Detections(xyxy=xyxy)
250
+
251
+ labels = [f"{phrase}" for phrase in range(boxes.shape[0])]
252
+
253
+ from util.box_annotator import BoxAnnotator
254
+ box_annotator = BoxAnnotator(text_scale=text_scale, text_padding=text_padding,text_thickness=text_thickness,thickness=thickness) # 0.8 for mobile/web, 0.3 for desktop # 0.4 for mind2web
255
+ annotated_frame = image_source.copy()
256
+ annotated_frame = box_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels, image_size=(w,h))
257
+
258
+ label_coordinates = {f"{phrase}": v for phrase, v in zip(phrases, xywh)}
259
+ return annotated_frame, label_coordinates
260
+
261
+
262
+ def predict(model, image, caption, box_threshold, text_threshold):
263
+ """ Use huggingface model to replace the original model
264
+ """
265
+ model, processor = model['model'], model['processor']
266
+ device = model.device
267
+
268
+ inputs = processor(images=image, text=caption, return_tensors="pt").to(device)
269
+ with torch.no_grad():
270
+ outputs = model(**inputs)
271
+
272
+ results = processor.post_process_grounded_object_detection(
273
+ outputs,
274
+ inputs.input_ids,
275
+ box_threshold=box_threshold, # 0.4,
276
+ text_threshold=text_threshold, # 0.3,
277
+ target_sizes=[image.size[::-1]]
278
+ )[0]
279
+ boxes, logits, phrases = results["boxes"], results["scores"], results["labels"]
280
+ return boxes, logits, phrases
281
+
282
+
283
+ def predict_yolo(model, image_path, box_threshold):
284
+ """ Use huggingface model to replace the original model
285
+ """
286
+ # model = model['model']
287
+
288
+ result = model.predict(
289
+ source=image_path,
290
+ conf=box_threshold,
291
+ # iou=0.5, # default 0.7
292
+ )
293
+ boxes = result[0].boxes.xyxy#.tolist() # in pixel space
294
+ conf = result[0].boxes.conf
295
+ phrases = [str(i) for i in range(len(boxes))]
296
+
297
+ return boxes, conf, phrases
298
+
299
+
300
+ def get_som_labeled_img(img_path, model=None, BOX_TRESHOLD = 0.01, output_coord_in_ratio=False, ocr_bbox=None, text_scale=0.4, text_padding=5, draw_bbox_config=None, caption_model_processor=None, ocr_text=[], use_local_semantics=True, iou_threshold=0.9,prompt=None):
301
+ """ ocr_bbox: list of xyxy format bbox
302
+ """
303
+ TEXT_PROMPT = "clickable buttons on the screen"
304
+ # BOX_TRESHOLD = 0.02 # 0.05/0.02 for web and 0.1 for mobile
305
+ TEXT_TRESHOLD = 0.01 # 0.9 # 0.01
306
+ image_source = Image.open(img_path).convert("RGB")
307
+ w, h = image_source.size
308
+ # import pdb; pdb.set_trace()
309
+ if False: # TODO
310
+ xyxy, logits, phrases = predict(model=model, image=image_source, caption=TEXT_PROMPT, box_threshold=BOX_TRESHOLD, text_threshold=TEXT_TRESHOLD)
311
+ else:
312
+ xyxy, logits, phrases = predict_yolo(model=model, image_path=img_path, box_threshold=BOX_TRESHOLD)
313
+ xyxy = xyxy / torch.Tensor([w, h, w, h]).to(xyxy.device)
314
+ image_source = np.asarray(image_source)
315
+ phrases = [str(i) for i in range(len(phrases))]
316
+
317
+ # annotate the image with labels
318
+ h, w, _ = image_source.shape
319
+ if ocr_bbox:
320
+ ocr_bbox = torch.tensor(ocr_bbox) / torch.Tensor([w, h, w, h])
321
+ ocr_bbox=ocr_bbox.tolist()
322
+ else:
323
+ print('no ocr bbox!!!')
324
+ ocr_bbox = None
325
+ filtered_boxes = remove_overlap(boxes=xyxy, iou_threshold=iou_threshold, ocr_bbox=ocr_bbox)
326
+
327
+ # get parsed icon local semantics
328
+ if use_local_semantics:
329
+ caption_model = caption_model_processor['model']
330
+ if 'phi3_v' in caption_model.config.model_type:
331
+ parsed_content_icon = get_parsed_content_icon_phi3v(filtered_boxes, ocr_bbox, image_source, caption_model_processor)
332
+ else:
333
+ parsed_content_icon = get_parsed_content_icon(filtered_boxes, ocr_bbox, image_source, caption_model_processor, prompt=prompt)
334
+ ocr_text = [f"Text Box ID {i}: {txt}" for i, txt in enumerate(ocr_text)]
335
+ icon_start = len(ocr_text)
336
+ parsed_content_icon_ls = []
337
+ for i, txt in enumerate(parsed_content_icon):
338
+ parsed_content_icon_ls.append(f"Icon Box ID {str(i+icon_start)}: {txt}")
339
+ parsed_content_merged = ocr_text + parsed_content_icon_ls
340
+ else:
341
+ ocr_text = [f"Text Box ID {i}: {txt}" for i, txt in enumerate(ocr_text)]
342
+ parsed_content_merged = ocr_text
343
+
344
+ filtered_boxes = box_convert(boxes=filtered_boxes, in_fmt="xyxy", out_fmt="cxcywh")
345
+
346
+ phrases = [i for i in range(len(filtered_boxes))]
347
+
348
+ # draw boxes
349
+ if draw_bbox_config:
350
+ annotated_frame, label_coordinates = annotate(image_source=image_source, boxes=filtered_boxes, logits=logits, phrases=phrases, **draw_bbox_config)
351
+ else:
352
+ annotated_frame, label_coordinates = annotate(image_source=image_source, boxes=filtered_boxes, logits=logits, phrases=phrases, text_scale=text_scale, text_padding=text_padding)
353
+
354
+ pil_img = Image.fromarray(annotated_frame)
355
+ buffered = io.BytesIO()
356
+ pil_img.save(buffered, format="PNG")
357
+ encoded_image = base64.b64encode(buffered.getvalue()).decode('ascii')
358
+ if output_coord_in_ratio:
359
+ # h, w, _ = image_source.shape
360
+ label_coordinates = {k: [v[0]/w, v[1]/h, v[2]/w, v[3]/h] for k, v in label_coordinates.items()}
361
+ assert w == annotated_frame.shape[1] and h == annotated_frame.shape[0]
362
+
363
+ return encoded_image, label_coordinates, parsed_content_merged
364
+
365
+
366
+ def get_xywh(input):
367
+ x, y, w, h = input[0][0], input[0][1], input[2][0] - input[0][0], input[2][1] - input[0][1]
368
+ x, y, w, h = int(x), int(y), int(w), int(h)
369
+ return x, y, w, h
370
+
371
+ def get_xyxy(input):
372
+ x, y, xp, yp = input[0][0], input[0][1], input[2][0], input[2][1]
373
+ x, y, xp, yp = int(x), int(y), int(xp), int(yp)
374
+ return x, y, xp, yp
375
+
376
+ def get_xywh_yolo(input):
377
+ x, y, w, h = input[0], input[1], input[2] - input[0], input[3] - input[1]
378
+ x, y, w, h = int(x), int(y), int(w), int(h)
379
+ return x, y, w, h
380
+
381
+
382
+
383
+ def check_ocr_box(image_path, display_img = True, output_bb_format='xywh', goal_filtering=None, easyocr_args=None, use_paddleocr=False):
384
+ if use_paddleocr:
385
+ result = paddle_ocr.ocr(image_path, cls=False)[0]
386
+ coord = [item[0] for item in result]
387
+ text = [item[1][0] for item in result]
388
+ else: # EasyOCR
389
+ if easyocr_args is None:
390
+ easyocr_args = {}
391
+ result = reader.readtext(image_path, **easyocr_args)
392
+ # print('goal filtering pred:', result[-5:])
393
+ coord = [item[0] for item in result]
394
+ text = [item[1] for item in result]
395
+ # read the image using cv2
396
+ if display_img:
397
+ opencv_img = cv2.imread(image_path)
398
+ opencv_img = cv2.cvtColor(opencv_img, cv2.COLOR_RGB2BGR)
399
+ bb = []
400
+ for item in coord:
401
+ x, y, a, b = get_xywh(item)
402
+ # print(x, y, a, b)
403
+ bb.append((x, y, a, b))
404
+ cv2.rectangle(opencv_img, (x, y), (x+a, y+b), (0, 255, 0), 2)
405
+
406
+ # Display the image
407
+ plt.imshow(opencv_img)
408
+ else:
409
+ if output_bb_format == 'xywh':
410
+ bb = [get_xywh(item) for item in coord]
411
+ elif output_bb_format == 'xyxy':
412
+ bb = [get_xyxy(item) for item in coord]
413
+ # print('bounding box!!!', bb)
414
+ return (text, bb), goal_filtering
415
+
416
+
417
+