Spaces:

microsoft
/

BizGenEval-Leaderboard

Runtime error

App Files Files Community

lzzzzy commited on 26 days ago

Commit

03f64ed

1 Parent(s): 340f695

update README.md

Browse files

Files changed (2) hide show

README.md +99 -38
data/test.jsonl +0 -0

README.md CHANGED Viewed

@@ -1,48 +1,109 @@
 ---
-title: BizGenEval
-emoji: 🥇
-colorFrom: green
-colorTo: indigo
-sdk: gradio
-app_file: app.py
-pinned: true
 license: mit
-short_description: Duplicate this leaderboard to initialize your own!
-sdk_version: 5.43.1
 tags:
-- leaderboard
 ---
-# Start the configuration
-Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
-Results files should have the following format and be stored as json files:
-```json
-{
-    "config": {
-        "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
-        "model_name": "path of the model on the hub: org/model",
-        "model_sha": "revision on the hub",
-    },
-    "results": {
-        "task_name": {
-            "metric_name": score,
-        },
-        "task_name2": {
-            "metric_name": score,
-        }
-    }
-}
 ```
-Request files are created automatically by this tool.
-If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
-# Code logic for more complex edits
-You'll find
-- the main table' columns names and properties in `src/display/utils.py`
-- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
-- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`

 ---
+language:
+- en
 license: mit
+task_categories:
+- text-to-image
 tags:
+- benchmark
+- image-generation
+- evaluation
+- commercial-design
+pretty_name: BizGenEval
+size_categories:
+- n<1K
+configs:
+- config_name: default
+  data_files:
+  - split: test
+    path: data/test*
+dataset_info:
+  features:
+  - name: id
+    dtype: int64
+  - name: domain
+    dtype: string
+  - name: dimension
+    dtype: string
+  - name: aspect_ratio
+    dtype: string
+  - name: reference_image_wh
+    dtype: string
+  - name: prompt
+    dtype: string
+  - name: questions
+    sequence: string
+  - name: eval_tag
+    dtype: string
+  - name: easy_qidxs
+    sequence: int64
+  - name: hard_qidxs
+    sequence: int64
+  splits:
+  - name: test
+    num_examples: 400
 ---
+# BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation
+BizGenEval is a benchmark for evaluating image generation models on real-world commercial design tasks. It covers **5 document types** × **4 capability dimensions** = **20 evaluation tasks**, with **400** curated prompts and **8,000** checklist questions (4,000 easy + 4,000 hard).
+## Overview
+### Document Types
+| Domain | Description | Count |
+|---|---|---|
+| **Slides** | Presentation slides used in reports, lectures, and business scenarios, characterized by hierarchical structure, bullet lists, and aligned visual elements. | 80 |
+| **Webpages** | Webpage designs that integrate structured layouts, textual content, and functional interface elements such as headers, sections, buttons, etc. | 80 |
+| **Posters** | Promotional or informational posters combining typography, graphics, and layout design, often emphasizing visual hierarchy and balanced composition. | 80 |
+| **Charts** | Data visualization graphics such as  bar charts, line charts, and multi-series plots, involving precise rendering of numeric values, axes,  and legends. | 80 |
+| **Scientific Figures** | Figures in academic papers including diagrams, pipelines, and illustrations, where components, arrows, and annotations form clear structure. | 80 |
+### Capability Dimensions
+| Dimension | What It Tests | Count |
+|---|---|---|
+| **Text Rendering** | Textual content rendering, including short titles, long paragraphs, tables, and integration with other components. | 100 |
+| **Layout Control** | Spatial and structural organization, including overall layout, complex flows, arrows, blocks, and hierarchical arrangement of elements. | 100 |
+| **Attribute Binding** | Visual attributes such as color, shape, style, icons, and quantity, emphasizing fine-grained control. | 100 |
+| **Knowledge Reasoning** | Reasoning and domain knowledge, including applying world knowledge across diverse domains such as physics, chemistry, arts, history, etc. | 100 |
+Every domain × dimension combination contains **20 samples**, each with **20 checklist questions** (10 easy + 10 hard).
+## Dataset Schema
+Each example contains:
+| Field | Type | Description |
+|---|---|---|
+| `id` | int | Unique prompt identifier (0–399) |
+| `domain` | string | One of: `slides`, `webpage`, `chart`, `poster`, `scientific_figure` |
+| `dimension` | string | One of: `layout`, `attribute`, `text`, `knowledge` |
+| `aspect_ratio` | string | Target aspect ratio (e.g., `16:9`, `4:3`, `1:1`) |
+| `reference_image_wh` | string | Reference resolution as `WxH` (e.g., `2400x1800`) |
+| `prompt` | string | The full generation prompt describing the desired image |
+| `questions` | list[string] | 20 yes/no checklist questions for evaluation |
+| `eval_tag` | string | Evaluation prompt template key |
+| `easy_qidxs` | list[int] | Indices of the 10 easy questions |
+| `hard_qidxs` | list[int] | Indices of the 10 hard questions |
+## Usage
+```python
+from datasets import load_dataset
+ds = load_dataset("microsoft/BizGenEval", split="test")
+print(ds[0])
 ```
+## Evaluation Pipeline
+The evaluation uses a **checklist-based** methodology: each generated image is assessed on its set of yes/no questions by a judge model.
+```
+Prompts ──► Image Generation ──► Checklist Evaluation  ──►  Summary CSVs
+             (any model)     (per-question True/False)  (by domain/dimension)
+```
+See the [GitHub repository](https://github.com/microsoft/BizGenEval) for the evaluation pipeline.

data/test.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff