Upload bundle README
Browse files
README.md
ADDED
|
@@ -0,0 +1,212 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Portable 10-Benchmark Eval Bundle
|
| 2 |
+
|
| 3 |
+
This bundle contains the code needed to run the 10-benchmark API evaluation stack on another cluster without shipping local datasets, caches, model weights, or run outputs.
|
| 4 |
+
|
| 5 |
+
## Included
|
| 6 |
+
|
| 7 |
+
- `agent_eval_api/` with the top-level 10-benchmark runners and bundled code for `APTBench`, `VLMEvalKit`, `thinking-in-space`, BFCL, and local task configs
|
| 8 |
+
- `AgentBench/` code/config needed for `DBBench`
|
| 9 |
+
- `lm-evaluation-harness/` for `ARC`, `RULER`, `HH-RLHF`, and `AdvBench`
|
| 10 |
+
- `env.example.sh` with cluster-specific path placeholders
|
| 11 |
+
|
| 12 |
+
## Not Included
|
| 13 |
+
|
| 14 |
+
- model checkpoints or HF snapshots
|
| 15 |
+
- `hf_cache/`, `LMUData/`, downloaded benchmark data
|
| 16 |
+
- local result directories such as `manual_runs/`, `runs/`, `automation*/`, `score/`, or BFCL `result/`
|
| 17 |
+
- `AgentBench/data/` and other benchmark payload data
|
| 18 |
+
|
| 19 |
+
## Directory Layout
|
| 20 |
+
|
| 21 |
+
```
|
| 22 |
+
portable_10bench_eval_bundle_20260330_0140/
|
| 23 |
+
README.md
|
| 24 |
+
env.example.sh
|
| 25 |
+
agent_eval_api/
|
| 26 |
+
AgentBench/
|
| 27 |
+
lm-evaluation-harness/
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
The patched scripts in this bundle use relative paths where possible.
|
| 31 |
+
|
| 32 |
+
## Environment Setup
|
| 33 |
+
|
| 34 |
+
You still need working Python environments on the target cluster. A practical split is:
|
| 35 |
+
|
| 36 |
+
- `vllm` env: model serving
|
| 37 |
+
- `vsibench` env: `VSI-Bench` and current `VLMEvalKit` wrappers
|
| 38 |
+
- `BFCL` env: BFCL
|
| 39 |
+
- base Python env: text-benchmark wrappers and orchestration
|
| 40 |
+
|
| 41 |
+
A simple starting point is:
|
| 42 |
+
|
| 43 |
+
```bash
|
| 44 |
+
cd portable_10bench_eval_bundle_20260330_0140
|
| 45 |
+
source env.example.sh
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
Then edit `env.example.sh` to match the new cluster.
|
| 49 |
+
|
| 50 |
+
## Data / Cache Prep
|
| 51 |
+
|
| 52 |
+
This bundle does not include benchmark data. Before running, prepare these inputs on the new cluster:
|
| 53 |
+
|
| 54 |
+
1. Model weights or HF snapshots
|
| 55 |
+
- For local models, download them to a local path and pass that path as `--tokenizer` / `--model` inputs.
|
| 56 |
+
- For the parallel round launcher, set:
|
| 57 |
+
- `SNAPSHOT_R5=/path/to/round5_snapshot`
|
| 58 |
+
- `SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot`
|
| 59 |
+
|
| 60 |
+
2. Hugging Face cache for multimodal benchmarks
|
| 61 |
+
- `MMBench`, `VideoMME`, and `VSI-Bench` expect a shared cache root.
|
| 62 |
+
- Create a cache directory such as `/path/to/hf_cache`, then export:
|
| 63 |
+
|
| 64 |
+
```bash
|
| 65 |
+
export FIXED_HF_CACHE=/path/to/hf_cache
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
Typical subpaths used by the runners are:
|
| 69 |
+
- `$FIXED_HF_CACHE`
|
| 70 |
+
- `$FIXED_HF_CACHE/hub`
|
| 71 |
+
- `$FIXED_HF_CACHE/datasets`
|
| 72 |
+
- `$FIXED_HF_CACHE/LMUData` for `MMBench`
|
| 73 |
+
|
| 74 |
+
3. Docker for DBBench
|
| 75 |
+
- `DBBench` uses `docker compose` via `AgentBench/extra/docker-compose.yml`.
|
| 76 |
+
- Make sure Docker and Compose are available on the cluster node.
|
| 77 |
+
|
| 78 |
+
## Core Entry Points
|
| 79 |
+
|
| 80 |
+
### 1. All 10 benchmarks against one API
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
|
| 84 |
+
|
| 85 |
+
bash ./run_all_10bench_db_api.sh \
|
| 86 |
+
--api-base http://127.0.0.1:8100/v1 \
|
| 87 |
+
--model your_served_model_name \
|
| 88 |
+
--tokenizer /path/to/local/model_or_tokenizer \
|
| 89 |
+
--full \
|
| 90 |
+
--canonical-hf-home "$FIXED_HF_CACHE" \
|
| 91 |
+
--canonical-vlmeval-cache "$FIXED_HF_CACHE" \
|
| 92 |
+
--vsibench-python "$VSIBENCH_PYTHON" \
|
| 93 |
+
--vlmeval-python "$VLMEVAL_PYTHON" \
|
| 94 |
+
--bfcl-python "$BFCL_PYTHON" \
|
| 95 |
+
--base-python "$BASE_PYTHON" \
|
| 96 |
+
--aptbench-python "$APTBENCH_PYTHON" \
|
| 97 |
+
--dbbench-python "$DBBENCH_PYTHON" \
|
| 98 |
+
--output-root ./manual_runs/your_model_run/benchmarks \
|
| 99 |
+
--tag your_model_run
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
### 2. Single text benchmark
|
| 103 |
+
|
| 104 |
+
For `arc`, `ruler`, `hh_rlhf`, or `advbench`:
|
| 105 |
+
|
| 106 |
+
```bash
|
| 107 |
+
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
|
| 108 |
+
|
| 109 |
+
bash ./run_eval_task_api.sh arc \
|
| 110 |
+
--api-base http://127.0.0.1:8100/v1 \
|
| 111 |
+
--model your_served_model_name \
|
| 112 |
+
--tokenizer /path/to/local/model_or_tokenizer \
|
| 113 |
+
--lm-eval-dir ../lm-evaluation-harness \
|
| 114 |
+
--include-path ./tasks \
|
| 115 |
+
--full
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
### 3. Individual multimodal benchmarks
|
| 119 |
+
|
| 120 |
+
- `MMBench`: `agent_eval_api/run_mmbench_api.sh`
|
| 121 |
+
- `VideoMME`: `agent_eval_api/run_videomme_api.sh`
|
| 122 |
+
- `VSI-Bench`: `agent_eval_api/run_vsibench_api.sh`
|
| 123 |
+
- `APTBench`: `agent_eval_api/run_aptbench_api.sh`
|
| 124 |
+
|
| 125 |
+
Example for `VideoMME` infer mode:
|
| 126 |
+
|
| 127 |
+
```bash
|
| 128 |
+
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
|
| 129 |
+
|
| 130 |
+
bash ./run_videomme_api.sh \
|
| 131 |
+
--api-base http://127.0.0.1:8100/v1 \
|
| 132 |
+
--model your_served_model_name \
|
| 133 |
+
--model-alias your_served_model_name \
|
| 134 |
+
--run-mode infer \
|
| 135 |
+
--api-nproc 4 \
|
| 136 |
+
--hf-home "$FIXED_HF_CACHE" \
|
| 137 |
+
--hf-hub-cache "$FIXED_HF_CACHE/hub" \
|
| 138 |
+
--hf-datasets-cache "$FIXED_HF_CACHE/datasets" \
|
| 139 |
+
--output-root ./manual_runs/videomme_your_model
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
### 4. Parallel round launcher
|
| 143 |
+
|
| 144 |
+
Before running this helper, you must set:
|
| 145 |
+
|
| 146 |
+
```bash
|
| 147 |
+
export SNAPSHOT_R5=/path/to/round5_snapshot
|
| 148 |
+
export SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot
|
| 149 |
+
export FIXED_HF_CACHE=/path/to/hf_cache
|
| 150 |
+
export VSIBENCH_PYTHON=/path/to/envs/vsibench/bin/python
|
| 151 |
+
export VLMEVAL_PYTHON=/path/to/envs/vlmeval/bin/python
|
| 152 |
+
export BFCL_PYTHON=/path/to/envs/BFCL/bin/python
|
| 153 |
+
export VLLM_PYTHON=/path/to/envs/vllm/bin/python
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
Then run:
|
| 157 |
+
|
| 158 |
+
```bash
|
| 159 |
+
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
|
| 160 |
+
bash ./run_rounds_from_bfcl_parallel.sh
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
## Benchmark Notes
|
| 164 |
+
|
| 165 |
+
- `ARC`, `RULER`, `HH-RLHF`, `AdvBench`
|
| 166 |
+
- use `lm-evaluation-harness` plus local task configs under `agent_eval_api/tasks/`
|
| 167 |
+
- `BFCL`
|
| 168 |
+
- uses the bundled BFCL code under `agent_eval_api/gorilla/berkeley-function-call-leaderboard/`
|
| 169 |
+
- `MMBench`, `VideoMME`
|
| 170 |
+
- use `VLMEvalKit`; these wrappers are often used in `infer` mode for leaderboard submission workflows
|
| 171 |
+
- `VSI-Bench`
|
| 172 |
+
- uses the bundled `thinking-in-space` integration and `lmms_eval` in the chosen Python env
|
| 173 |
+
- `APTBench`
|
| 174 |
+
- uses `agent_eval_api/APTBench/code/`
|
| 175 |
+
- `DBBench`
|
| 176 |
+
- uses the bundled `AgentBench/` code and Docker services
|
| 177 |
+
|
| 178 |
+
## Sanity Checks on a New Cluster
|
| 179 |
+
|
| 180 |
+
Before running a full evaluation, check:
|
| 181 |
+
|
| 182 |
+
```bash
|
| 183 |
+
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
|
| 184 |
+
|
| 185 |
+
bash -n run_all_10bench_db_api.sh
|
| 186 |
+
bash -n run_eval_task_api.sh
|
| 187 |
+
bash -n run_mmbench_api.sh
|
| 188 |
+
bash -n run_videomme_api.sh
|
| 189 |
+
bash -n run_vsibench_api.sh
|
| 190 |
+
bash -n run_aptbench_api.sh
|
| 191 |
+
|
| 192 |
+
curl -fsS http://127.0.0.1:8100/v1/models
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
If the API responds and these scripts parse, the bundle layout is consistent.
|
| 196 |
+
|
| 197 |
+
## Scope
|
| 198 |
+
|
| 199 |
+
This package is a reusable evaluation code bundle, not a frozen environment export. You still need to:
|
| 200 |
+
|
| 201 |
+
- install the Python environments on the target cluster
|
| 202 |
+
- download the models and benchmark data/cache there
|
| 203 |
+
- point the scripts at the new local paths
|
| 204 |
+
|
| 205 |
+
## Portability Notes
|
| 206 |
+
|
| 207 |
+
- HH-RLHF and AdvBench data are not bundled. Set HH_RLHF_DATASET_PATH and ADVBENCH_DATASET_PATH on the new cluster, or place those files under datasets/hh_rlhf and datasets/advbench inside the bundle root.
|
| 208 |
+
- The core scripts default to relative lm-evaluation-harness and VLMEvalKit cache paths inside this bundle.
|
| 209 |
+
- The multimodal runners expect a shared cache root, typically exported as FIXED_HF_CACHE.
|
| 210 |
+
- VideoMME and MMBench are commonly used in infer mode when you plan to upload predictions to an external leaderboard.
|
| 211 |
+
|
| 212 |
+
- APTBench, BFCL, and VSI-Bench benchmark payloads are also not bundled in this package. Populate their expected data locations after you move the bundle to the new cluster.
|