--- title: BizGenEval Leaderboard emoji: 🥇 colorFrom: green colorTo: indigo sdk: gradio app_file: app.py pinned: true license: mit short_description: Official BizGenEval leaderboard on Hugging Face. sdk_version: 5.50.0 tags: - leaderboard --- # BizGenEval Leaderboard This repository hosts the Hugging Face leaderboard for BizGenEval, the benchmark introduced in [*BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation*](https://arxiv.org/abs/2603.25732). Primary project resources: - Project page: `https://aka.ms/BizGenEval` - GitHub: `https://github.com/microsoft/BizGenEval` - Dataset: `https://huggingface.co/datasets/microsoft/BizGenEval` The codebase supports: 1. **LOCAL_DEV mode** (no HF permission required): reads/writes local namespaced paths under `eval-queue/` and `eval-results/`. 2. **HF mode** (with permission): syncs datasets from the Hub and uploads queue requests. ## 1) Local development quick start (no HF permission) ### Step 1. Create and activate virtualenv ```bash cd /Users/clarencestark/code/BizGenEval-Leaderboard python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` ### Step 2. Bootstrap local demo data ```bash python3 scripts/bootstrap_local_dev.py ``` This will create: - `eval-queue/bizgeneval/requests/microsoft/Phi-4o-mini_eval_request_False_float16_Original.json` - `eval-results/bizgeneval/results/microsoft/Phi-4o-mini/summary.json` ### Step 3. Launch in local mode ```bash export LOCAL_DEV=1 python3 app.py ``` In LOCAL_DEV mode: - `snapshot_download` is skipped. - Model-card/tokenizer checks are skipped during submission. - New submissions are written to local `eval-queue/bizgeneval/requests/` only (no upload). ## 2) Result file format supported The leaderboard parser currently supports two formats: ### A) BizGenEval summary format (recommended) Put a `summary.json` under: `eval-results/bizgeneval/results///summary.json` Example: ```json { "model_name": "microsoft/Phi-4o-mini", "model_sha": "main", "by_domain": { "slides": {"error_score": 0.8125}, "webpage": {"error_score": 0.845}, "poster": {"error_score": 0.7875}, "chart": {"error_score": 0.8025}, "scientific_figure": {"error_score": 0.77} }, "by_dimension": { "layout": {"error_score": 0.835}, "attribute": {"error_score": 0.805}, "text": {"error_score": 0.79}, "knowledge": {"error_score": 0.775} } } ``` `error_score` can be either `0~1` or `0~100`; both are accepted and normalized to a displayed `0~100` scale. ### B) Legacy template format Legacy `config/results` JSON is still accepted for compatibility. ## 3) Queue file format Queue entries are JSON files in: `eval-queue/bizgeneval/requests//*.json` A typical file contains: - `model` - `revision` - `precision` - `weight_type` - `status` (`PENDING`, `RUNNING`, `FINISHED*`) - metadata (`license`, `params`, `likes`, ...) ## 4) Config knobs Main config file: `src/envs.py` - `LOCAL_DEV` (env): `1/true/on` to enable local mode - `HF_OWNER` (env, optional): owner fallback - `PROJECT_NAMESPACE` (env, optional): defaults to `bizgeneval` - `HF_SPACE_REPO` (env, optional) - `HF_QUEUE_REPO` (env, optional) - `HF_RESULTS_REPO` (env, optional) - `HF_TOKEN` (env): required only for Hub sync/upload Default repo names are: - Space: `microsoft/BizGenEval-Leaderboard` - Queue dataset: `demo-leaderboard-backend/requests` - Results dataset: `demo-leaderboard-backend/results` ## 5) Key code locations - Columns and UI display fields: `src/display/utils.py` - Result parser: `src/leaderboard/read_evals.py` - DataFrame build logic: `src/populate.py` - Submission validation/upload behavior: `src/submission/submit.py` - Task definitions and page text: `src/about.py`