lzzzzy commited on
Commit
03f64ed
·
1 Parent(s): 340f695

update README.md

Browse files
Files changed (2) hide show
  1. README.md +99 -38
  2. data/test.jsonl +0 -0
README.md CHANGED
@@ -1,48 +1,109 @@
1
  ---
2
- title: BizGenEval
3
- emoji: 🥇
4
- colorFrom: green
5
- colorTo: indigo
6
- sdk: gradio
7
- app_file: app.py
8
- pinned: true
9
  license: mit
10
- short_description: Duplicate this leaderboard to initialize your own!
11
- sdk_version: 5.43.1
12
  tags:
13
- - leaderboard
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
- # Start the configuration
17
-
18
- Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
19
-
20
- Results files should have the following format and be stored as json files:
21
- ```json
22
- {
23
- "config": {
24
- "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
25
- "model_name": "path of the model on the hub: org/model",
26
- "model_sha": "revision on the hub",
27
- },
28
- "results": {
29
- "task_name": {
30
- "metric_name": score,
31
- },
32
- "task_name2": {
33
- "metric_name": score,
34
- }
35
- }
36
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ```
38
 
39
- Request files are created automatically by this tool.
40
 
41
- If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
42
 
43
- # Code logic for more complex edits
 
 
 
44
 
45
- You'll find
46
- - the main table' columns names and properties in `src/display/utils.py`
47
- - the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
48
- - the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
 
1
  ---
2
+ language:
3
+ - en
 
 
 
 
 
4
  license: mit
5
+ task_categories:
6
+ - text-to-image
7
  tags:
8
+ - benchmark
9
+ - image-generation
10
+ - evaluation
11
+ - commercial-design
12
+ pretty_name: BizGenEval
13
+ size_categories:
14
+ - n<1K
15
+ configs:
16
+ - config_name: default
17
+ data_files:
18
+ - split: test
19
+ path: data/test*
20
+ dataset_info:
21
+ features:
22
+ - name: id
23
+ dtype: int64
24
+ - name: domain
25
+ dtype: string
26
+ - name: dimension
27
+ dtype: string
28
+ - name: aspect_ratio
29
+ dtype: string
30
+ - name: reference_image_wh
31
+ dtype: string
32
+ - name: prompt
33
+ dtype: string
34
+ - name: questions
35
+ sequence: string
36
+ - name: eval_tag
37
+ dtype: string
38
+ - name: easy_qidxs
39
+ sequence: int64
40
+ - name: hard_qidxs
41
+ sequence: int64
42
+ splits:
43
+ - name: test
44
+ num_examples: 400
45
  ---
46
 
47
+ # BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation
48
+
49
+ BizGenEval is a benchmark for evaluating image generation models on real-world commercial design tasks. It covers **5 document types** × **4 capability dimensions** = **20 evaluation tasks**, with **400** curated prompts and **8,000** checklist questions (4,000 easy + 4,000 hard).
50
+
51
+ ## Overview
52
+
53
+ ### Document Types
54
+
55
+ | Domain | Description | Count |
56
+ |---|---|---|
57
+ | **Slides** | Presentation slides used in reports, lectures, and business scenarios, characterized by hierarchical structure, bullet lists, and aligned visual elements. | 80 |
58
+ | **Webpages** | Webpage designs that integrate structured layouts, textual content, and functional interface elements such as headers, sections, buttons, etc. | 80 |
59
+ | **Posters** | Promotional or informational posters combining typography, graphics, and layout design, often emphasizing visual hierarchy and balanced composition. | 80 |
60
+ | **Charts** | Data visualization graphics such as bar charts, line charts, and multi-series plots, involving precise rendering of numeric values, axes, and legends. | 80 |
61
+ | **Scientific Figures** | Figures in academic papers including diagrams, pipelines, and illustrations, where components, arrows, and annotations form clear structure. | 80 |
62
+
63
+ ### Capability Dimensions
64
+
65
+ | Dimension | What It Tests | Count |
66
+ |---|---|---|
67
+ | **Text Rendering** | Textual content rendering, including short titles, long paragraphs, tables, and integration with other components. | 100 |
68
+ | **Layout Control** | Spatial and structural organization, including overall layout, complex flows, arrows, blocks, and hierarchical arrangement of elements. | 100 |
69
+ | **Attribute Binding** | Visual attributes such as color, shape, style, icons, and quantity, emphasizing fine-grained control. | 100 |
70
+ | **Knowledge Reasoning** | Reasoning and domain knowledge, including applying world knowledge across diverse domains such as physics, chemistry, arts, history, etc. | 100 |
71
+
72
+ Every domain × dimension combination contains **20 samples**, each with **20 checklist questions** (10 easy + 10 hard).
73
+
74
+ ## Dataset Schema
75
+
76
+ Each example contains:
77
+
78
+ | Field | Type | Description |
79
+ |---|---|---|
80
+ | `id` | int | Unique prompt identifier (0–399) |
81
+ | `domain` | string | One of: `slides`, `webpage`, `chart`, `poster`, `scientific_figure` |
82
+ | `dimension` | string | One of: `layout`, `attribute`, `text`, `knowledge` |
83
+ | `aspect_ratio` | string | Target aspect ratio (e.g., `16:9`, `4:3`, `1:1`) |
84
+ | `reference_image_wh` | string | Reference resolution as `WxH` (e.g., `2400x1800`) |
85
+ | `prompt` | string | The full generation prompt describing the desired image |
86
+ | `questions` | list[string] | 20 yes/no checklist questions for evaluation |
87
+ | `eval_tag` | string | Evaluation prompt template key |
88
+ | `easy_qidxs` | list[int] | Indices of the 10 easy questions |
89
+ | `hard_qidxs` | list[int] | Indices of the 10 hard questions |
90
+
91
+ ## Usage
92
+
93
+ ```python
94
+ from datasets import load_dataset
95
+
96
+ ds = load_dataset("microsoft/BizGenEval", split="test")
97
+ print(ds[0])
98
  ```
99
 
100
+ ## Evaluation Pipeline
101
 
102
+ The evaluation uses a **checklist-based** methodology: each generated image is assessed on its set of yes/no questions by a judge model.
103
 
104
+ ```
105
+ Prompts ──► Image Generation ──► Checklist Evaluation ──► Summary CSVs
106
+ (any model) (per-question True/False) (by domain/dimension)
107
+ ```
108
 
109
+ See the [GitHub repository](https://github.com/microsoft/BizGenEval) for the evaluation pipeline.
 
 
 
data/test.jsonl ADDED
The diff for this file is too large to render. See raw diff