luuuulinnnn commited on
Commit
424158e
·
verified ·
1 Parent(s): 1ec4ac2

Upload bundle README

Browse files
Files changed (1) hide show
  1. README.md +212 -0
README.md ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Portable 10-Benchmark Eval Bundle
2
+
3
+ This bundle contains the code needed to run the 10-benchmark API evaluation stack on another cluster without shipping local datasets, caches, model weights, or run outputs.
4
+
5
+ ## Included
6
+
7
+ - `agent_eval_api/` with the top-level 10-benchmark runners and bundled code for `APTBench`, `VLMEvalKit`, `thinking-in-space`, BFCL, and local task configs
8
+ - `AgentBench/` code/config needed for `DBBench`
9
+ - `lm-evaluation-harness/` for `ARC`, `RULER`, `HH-RLHF`, and `AdvBench`
10
+ - `env.example.sh` with cluster-specific path placeholders
11
+
12
+ ## Not Included
13
+
14
+ - model checkpoints or HF snapshots
15
+ - `hf_cache/`, `LMUData/`, downloaded benchmark data
16
+ - local result directories such as `manual_runs/`, `runs/`, `automation*/`, `score/`, or BFCL `result/`
17
+ - `AgentBench/data/` and other benchmark payload data
18
+
19
+ ## Directory Layout
20
+
21
+ ```
22
+ portable_10bench_eval_bundle_20260330_0140/
23
+ README.md
24
+ env.example.sh
25
+ agent_eval_api/
26
+ AgentBench/
27
+ lm-evaluation-harness/
28
+ ```
29
+
30
+ The patched scripts in this bundle use relative paths where possible.
31
+
32
+ ## Environment Setup
33
+
34
+ You still need working Python environments on the target cluster. A practical split is:
35
+
36
+ - `vllm` env: model serving
37
+ - `vsibench` env: `VSI-Bench` and current `VLMEvalKit` wrappers
38
+ - `BFCL` env: BFCL
39
+ - base Python env: text-benchmark wrappers and orchestration
40
+
41
+ A simple starting point is:
42
+
43
+ ```bash
44
+ cd portable_10bench_eval_bundle_20260330_0140
45
+ source env.example.sh
46
+ ```
47
+
48
+ Then edit `env.example.sh` to match the new cluster.
49
+
50
+ ## Data / Cache Prep
51
+
52
+ This bundle does not include benchmark data. Before running, prepare these inputs on the new cluster:
53
+
54
+ 1. Model weights or HF snapshots
55
+ - For local models, download them to a local path and pass that path as `--tokenizer` / `--model` inputs.
56
+ - For the parallel round launcher, set:
57
+ - `SNAPSHOT_R5=/path/to/round5_snapshot`
58
+ - `SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot`
59
+
60
+ 2. Hugging Face cache for multimodal benchmarks
61
+ - `MMBench`, `VideoMME`, and `VSI-Bench` expect a shared cache root.
62
+ - Create a cache directory such as `/path/to/hf_cache`, then export:
63
+
64
+ ```bash
65
+ export FIXED_HF_CACHE=/path/to/hf_cache
66
+ ```
67
+
68
+ Typical subpaths used by the runners are:
69
+ - `$FIXED_HF_CACHE`
70
+ - `$FIXED_HF_CACHE/hub`
71
+ - `$FIXED_HF_CACHE/datasets`
72
+ - `$FIXED_HF_CACHE/LMUData` for `MMBench`
73
+
74
+ 3. Docker for DBBench
75
+ - `DBBench` uses `docker compose` via `AgentBench/extra/docker-compose.yml`.
76
+ - Make sure Docker and Compose are available on the cluster node.
77
+
78
+ ## Core Entry Points
79
+
80
+ ### 1. All 10 benchmarks against one API
81
+
82
+ ```bash
83
+ cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
84
+
85
+ bash ./run_all_10bench_db_api.sh \
86
+ --api-base http://127.0.0.1:8100/v1 \
87
+ --model your_served_model_name \
88
+ --tokenizer /path/to/local/model_or_tokenizer \
89
+ --full \
90
+ --canonical-hf-home "$FIXED_HF_CACHE" \
91
+ --canonical-vlmeval-cache "$FIXED_HF_CACHE" \
92
+ --vsibench-python "$VSIBENCH_PYTHON" \
93
+ --vlmeval-python "$VLMEVAL_PYTHON" \
94
+ --bfcl-python "$BFCL_PYTHON" \
95
+ --base-python "$BASE_PYTHON" \
96
+ --aptbench-python "$APTBENCH_PYTHON" \
97
+ --dbbench-python "$DBBENCH_PYTHON" \
98
+ --output-root ./manual_runs/your_model_run/benchmarks \
99
+ --tag your_model_run
100
+ ```
101
+
102
+ ### 2. Single text benchmark
103
+
104
+ For `arc`, `ruler`, `hh_rlhf`, or `advbench`:
105
+
106
+ ```bash
107
+ cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
108
+
109
+ bash ./run_eval_task_api.sh arc \
110
+ --api-base http://127.0.0.1:8100/v1 \
111
+ --model your_served_model_name \
112
+ --tokenizer /path/to/local/model_or_tokenizer \
113
+ --lm-eval-dir ../lm-evaluation-harness \
114
+ --include-path ./tasks \
115
+ --full
116
+ ```
117
+
118
+ ### 3. Individual multimodal benchmarks
119
+
120
+ - `MMBench`: `agent_eval_api/run_mmbench_api.sh`
121
+ - `VideoMME`: `agent_eval_api/run_videomme_api.sh`
122
+ - `VSI-Bench`: `agent_eval_api/run_vsibench_api.sh`
123
+ - `APTBench`: `agent_eval_api/run_aptbench_api.sh`
124
+
125
+ Example for `VideoMME` infer mode:
126
+
127
+ ```bash
128
+ cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
129
+
130
+ bash ./run_videomme_api.sh \
131
+ --api-base http://127.0.0.1:8100/v1 \
132
+ --model your_served_model_name \
133
+ --model-alias your_served_model_name \
134
+ --run-mode infer \
135
+ --api-nproc 4 \
136
+ --hf-home "$FIXED_HF_CACHE" \
137
+ --hf-hub-cache "$FIXED_HF_CACHE/hub" \
138
+ --hf-datasets-cache "$FIXED_HF_CACHE/datasets" \
139
+ --output-root ./manual_runs/videomme_your_model
140
+ ```
141
+
142
+ ### 4. Parallel round launcher
143
+
144
+ Before running this helper, you must set:
145
+
146
+ ```bash
147
+ export SNAPSHOT_R5=/path/to/round5_snapshot
148
+ export SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot
149
+ export FIXED_HF_CACHE=/path/to/hf_cache
150
+ export VSIBENCH_PYTHON=/path/to/envs/vsibench/bin/python
151
+ export VLMEVAL_PYTHON=/path/to/envs/vlmeval/bin/python
152
+ export BFCL_PYTHON=/path/to/envs/BFCL/bin/python
153
+ export VLLM_PYTHON=/path/to/envs/vllm/bin/python
154
+ ```
155
+
156
+ Then run:
157
+
158
+ ```bash
159
+ cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
160
+ bash ./run_rounds_from_bfcl_parallel.sh
161
+ ```
162
+
163
+ ## Benchmark Notes
164
+
165
+ - `ARC`, `RULER`, `HH-RLHF`, `AdvBench`
166
+ - use `lm-evaluation-harness` plus local task configs under `agent_eval_api/tasks/`
167
+ - `BFCL`
168
+ - uses the bundled BFCL code under `agent_eval_api/gorilla/berkeley-function-call-leaderboard/`
169
+ - `MMBench`, `VideoMME`
170
+ - use `VLMEvalKit`; these wrappers are often used in `infer` mode for leaderboard submission workflows
171
+ - `VSI-Bench`
172
+ - uses the bundled `thinking-in-space` integration and `lmms_eval` in the chosen Python env
173
+ - `APTBench`
174
+ - uses `agent_eval_api/APTBench/code/`
175
+ - `DBBench`
176
+ - uses the bundled `AgentBench/` code and Docker services
177
+
178
+ ## Sanity Checks on a New Cluster
179
+
180
+ Before running a full evaluation, check:
181
+
182
+ ```bash
183
+ cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
184
+
185
+ bash -n run_all_10bench_db_api.sh
186
+ bash -n run_eval_task_api.sh
187
+ bash -n run_mmbench_api.sh
188
+ bash -n run_videomme_api.sh
189
+ bash -n run_vsibench_api.sh
190
+ bash -n run_aptbench_api.sh
191
+
192
+ curl -fsS http://127.0.0.1:8100/v1/models
193
+ ```
194
+
195
+ If the API responds and these scripts parse, the bundle layout is consistent.
196
+
197
+ ## Scope
198
+
199
+ This package is a reusable evaluation code bundle, not a frozen environment export. You still need to:
200
+
201
+ - install the Python environments on the target cluster
202
+ - download the models and benchmark data/cache there
203
+ - point the scripts at the new local paths
204
+
205
+ ## Portability Notes
206
+
207
+ - HH-RLHF and AdvBench data are not bundled. Set HH_RLHF_DATASET_PATH and ADVBENCH_DATASET_PATH on the new cluster, or place those files under datasets/hh_rlhf and datasets/advbench inside the bundle root.
208
+ - The core scripts default to relative lm-evaluation-harness and VLMEvalKit cache paths inside this bundle.
209
+ - The multimodal runners expect a shared cache root, typically exported as FIXED_HF_CACHE.
210
+ - VideoMME and MMBench are commonly used in infer mode when you plan to upload predictions to an external leaderboard.
211
+
212
+ - APTBench, BFCL, and VSI-Bench benchmark payloads are also not bundled in this package. Populate their expected data locations after you move the bundle to the new cluster.