---
title: BizGenEval Leaderboard
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: mit
short_description: Official BizGenEval leaderboard on Hugging Face.
sdk_version: 5.50.0
tags:
- leaderboard
---

# BizGenEval Leaderboard

This repository hosts the Hugging Face leaderboard for BizGenEval, the benchmark introduced in
[*BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation*](https://arxiv.org/abs/2603.25732).

Primary project resources:

- Project page: `https://aka.ms/BizGenEval`
- GitHub: `https://github.com/microsoft/BizGenEval`
- Dataset: `https://huggingface.co/datasets/microsoft/BizGenEval`

The codebase supports:

1. **LOCAL_DEV mode** (no HF permission required): reads/writes local namespaced paths under `eval-queue/` and `eval-results/`.
2. **HF mode** (with permission): syncs datasets from the Hub and uploads queue requests.

## 1) Local development quick start (no HF permission)

### Step 1. Create and activate virtualenv

```bash
cd /Users/clarencestark/code/BizGenEval-Leaderboard
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

### Step 2. Bootstrap local demo data

```bash
python3 scripts/bootstrap_local_dev.py
```

This will create:

- `eval-queue/bizgeneval/requests/microsoft/Phi-4o-mini_eval_request_False_float16_Original.json`
- `eval-results/bizgeneval/results/microsoft/Phi-4o-mini/summary.json`

### Step 3. Launch in local mode

```bash
export LOCAL_DEV=1
python3 app.py
```

In LOCAL_DEV mode:

- `snapshot_download` is skipped.
- Model-card/tokenizer checks are skipped during submission.
- New submissions are written to local `eval-queue/bizgeneval/requests/` only (no upload).

## 2) Result file format supported

The leaderboard parser currently supports two formats:

### A) BizGenEval summary format (recommended)

Put a `summary.json` under:

`eval-results/bizgeneval/results/<org>/<model>/summary.json`

Example:

```json
{
  "model_name": "microsoft/Phi-4o-mini",
  "model_sha": "main",
  "by_domain": {
    "slides": {"error_score": 0.8125},
    "webpage": {"error_score": 0.845},
    "poster": {"error_score": 0.7875},
    "chart": {"error_score": 0.8025},
    "scientific_figure": {"error_score": 0.77}
  },
  "by_dimension": {
    "layout": {"error_score": 0.835},
    "attribute": {"error_score": 0.805},
    "text": {"error_score": 0.79},
    "knowledge": {"error_score": 0.775}
  }
}
```

`error_score` can be either `0~1` or `0~100`; both are accepted and normalized to a displayed `0~100` scale.

### B) Legacy template format

Legacy `config/results` JSON is still accepted for compatibility.

## 3) Queue file format

Queue entries are JSON files in:

`eval-queue/bizgeneval/requests/<org>/*.json`

A typical file contains:

- `model`
- `revision`
- `precision`
- `weight_type`
- `status` (`PENDING`, `RUNNING`, `FINISHED*`)
- metadata (`license`, `params`, `likes`, ...)

## 4) Config knobs

Main config file: `src/envs.py`

- `LOCAL_DEV` (env): `1/true/on` to enable local mode
- `HF_OWNER` (env, optional): owner fallback
- `PROJECT_NAMESPACE` (env, optional): defaults to `bizgeneval`
- `HF_SPACE_REPO` (env, optional)
- `HF_QUEUE_REPO` (env, optional)
- `HF_RESULTS_REPO` (env, optional)
- `HF_TOKEN` (env): required only for Hub sync/upload

Default repo names are:

- Space: `microsoft/BizGenEval-Leaderboard`
- Queue dataset: `demo-leaderboard-backend/requests`
- Results dataset: `demo-leaderboard-backend/results`

## 5) Key code locations

- Columns and UI display fields: `src/display/utils.py`
- Result parser: `src/leaderboard/read_evals.py`
- DataFrame build logic: `src/populate.py`
- Submission validation/upload behavior: `src/submission/submit.py`
- Task definitions and page text: `src/about.py`