luuuulinnnn
/

vllm_eval

Model card Files Files and versions

xet

Community

luuuulinnnn commited on 23 days ago

Commit

424158e

verified ·

1 Parent(s): 1ec4ac2

Upload bundle README

Browse files

Files changed (1) hide show

README.md +212 -0

README.md ADDED Viewed

	@@ -0,0 +1,212 @@

+# Portable 10-Benchmark Eval Bundle
+This bundle contains the code needed to run the 10-benchmark API evaluation stack on another cluster without shipping local datasets, caches, model weights, or run outputs.
+## Included
+- `agent_eval_api/` with the top-level 10-benchmark runners and bundled code for `APTBench`, `VLMEvalKit`, `thinking-in-space`, BFCL, and local task configs
+- `AgentBench/` code/config needed for `DBBench`
+- `lm-evaluation-harness/` for `ARC`, `RULER`, `HH-RLHF`, and `AdvBench`
+- `env.example.sh` with cluster-specific path placeholders
+## Not Included
+- model checkpoints or HF snapshots
+- `hf_cache/`, `LMUData/`, downloaded benchmark data
+- local result directories such as `manual_runs/`, `runs/`, `automation*/`, `score/`, or BFCL `result/`
+- `AgentBench/data/` and other benchmark payload data
+## Directory Layout
+```
+portable_10bench_eval_bundle_20260330_0140/
+  README.md
+  env.example.sh
+  agent_eval_api/
+  AgentBench/
+  lm-evaluation-harness/
+```
+The patched scripts in this bundle use relative paths where possible.
+## Environment Setup
+You still need working Python environments on the target cluster. A practical split is:
+- `vllm` env: model serving
+- `vsibench` env: `VSI-Bench` and current `VLMEvalKit` wrappers
+- `BFCL` env: BFCL
+- base Python env: text-benchmark wrappers and orchestration
+A simple starting point is:
+```bash
+cd portable_10bench_eval_bundle_20260330_0140
+source env.example.sh
+```
+Then edit `env.example.sh` to match the new cluster.
+## Data / Cache Prep
+This bundle does not include benchmark data. Before running, prepare these inputs on the new cluster:
+1. Model weights or HF snapshots
+- For local models, download them to a local path and pass that path as `--tokenizer` / `--model` inputs.
+- For the parallel round launcher, set:
+  - `SNAPSHOT_R5=/path/to/round5_snapshot`
+  - `SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot`
+2. Hugging Face cache for multimodal benchmarks
+- `MMBench`, `VideoMME`, and `VSI-Bench` expect a shared cache root.
+- Create a cache directory such as `/path/to/hf_cache`, then export:
+```bash
+export FIXED_HF_CACHE=/path/to/hf_cache
+```
+Typical subpaths used by the runners are:
+- `$FIXED_HF_CACHE`
+- `$FIXED_HF_CACHE/hub`
+- `$FIXED_HF_CACHE/datasets`
+- `$FIXED_HF_CACHE/LMUData` for `MMBench`
+3. Docker for DBBench
+- `DBBench` uses `docker compose` via `AgentBench/extra/docker-compose.yml`.
+- Make sure Docker and Compose are available on the cluster node.
+## Core Entry Points
+### 1. All 10 benchmarks against one API
+```bash
+cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
+bash ./run_all_10bench_db_api.sh \
+  --api-base http://127.0.0.1:8100/v1 \
+  --model your_served_model_name \
+  --tokenizer /path/to/local/model_or_tokenizer \
+  --full \
+  --canonical-hf-home "$FIXED_HF_CACHE" \
+  --canonical-vlmeval-cache "$FIXED_HF_CACHE" \
+  --vsibench-python "$VSIBENCH_PYTHON" \
+  --vlmeval-python "$VLMEVAL_PYTHON" \
+  --bfcl-python "$BFCL_PYTHON" \
+  --base-python "$BASE_PYTHON" \
+  --aptbench-python "$APTBENCH_PYTHON" \
+  --dbbench-python "$DBBENCH_PYTHON" \
+  --output-root ./manual_runs/your_model_run/benchmarks \
+  --tag your_model_run
+```
+### 2. Single text benchmark
+For `arc`, `ruler`, `hh_rlhf`, or `advbench`:
+```bash
+cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
+bash ./run_eval_task_api.sh arc \
+  --api-base http://127.0.0.1:8100/v1 \
+  --model your_served_model_name \
+  --tokenizer /path/to/local/model_or_tokenizer \
+  --lm-eval-dir ../lm-evaluation-harness \
+  --include-path ./tasks \
+  --full
+```
+### 3. Individual multimodal benchmarks
+- `MMBench`: `agent_eval_api/run_mmbench_api.sh`
+- `VideoMME`: `agent_eval_api/run_videomme_api.sh`
+- `VSI-Bench`: `agent_eval_api/run_vsibench_api.sh`
+- `APTBench`: `agent_eval_api/run_aptbench_api.sh`
+Example for `VideoMME` infer mode:
+```bash
+cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
+bash ./run_videomme_api.sh \
+  --api-base http://127.0.0.1:8100/v1 \
+  --model your_served_model_name \
+  --model-alias your_served_model_name \
+  --run-mode infer \
+  --api-nproc 4 \
+  --hf-home "$FIXED_HF_CACHE" \
+  --hf-hub-cache "$FIXED_HF_CACHE/hub" \
+  --hf-datasets-cache "$FIXED_HF_CACHE/datasets" \
+  --output-root ./manual_runs/videomme_your_model
+```
+### 4. Parallel round launcher
+Before running this helper, you must set:
+```bash
+export SNAPSHOT_R5=/path/to/round5_snapshot
+export SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot
+export FIXED_HF_CACHE=/path/to/hf_cache
+export VSIBENCH_PYTHON=/path/to/envs/vsibench/bin/python
+export VLMEVAL_PYTHON=/path/to/envs/vlmeval/bin/python
+export BFCL_PYTHON=/path/to/envs/BFCL/bin/python
+export VLLM_PYTHON=/path/to/envs/vllm/bin/python
+```
+Then run:
+```bash
+cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
+bash ./run_rounds_from_bfcl_parallel.sh
+```
+## Benchmark Notes
+- `ARC`, `RULER`, `HH-RLHF`, `AdvBench`
+  - use `lm-evaluation-harness` plus local task configs under `agent_eval_api/tasks/`
+- `BFCL`
+  - uses the bundled BFCL code under `agent_eval_api/gorilla/berkeley-function-call-leaderboard/`
+- `MMBench`, `VideoMME`
+  - use `VLMEvalKit`; these wrappers are often used in `infer` mode for leaderboard submission workflows
+- `VSI-Bench`
+  - uses the bundled `thinking-in-space` integration and `lmms_eval` in the chosen Python env
+- `APTBench`
+  - uses `agent_eval_api/APTBench/code/`
+- `DBBench`
+  - uses the bundled `AgentBench/` code and Docker services
+## Sanity Checks on a New Cluster
+Before running a full evaluation, check:
+```bash
+cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
+bash -n run_all_10bench_db_api.sh
+bash -n run_eval_task_api.sh
+bash -n run_mmbench_api.sh
+bash -n run_videomme_api.sh
+bash -n run_vsibench_api.sh
+bash -n run_aptbench_api.sh
+curl -fsS http://127.0.0.1:8100/v1/models
+```
+If the API responds and these scripts parse, the bundle layout is consistent.
+## Scope
+This package is a reusable evaluation code bundle, not a frozen environment export. You still need to:
+- install the Python environments on the target cluster
+- download the models and benchmark data/cache there
+- point the scripts at the new local paths
+## Portability Notes
+- HH-RLHF and AdvBench data are not bundled. Set HH_RLHF_DATASET_PATH and ADVBENCH_DATASET_PATH on the new cluster, or place those files under datasets/hh_rlhf and datasets/advbench inside the bundle root.
+- The core scripts default to relative lm-evaluation-harness and VLMEvalKit cache paths inside this bundle.
+- The multimodal runners expect a shared cache root, typically exported as FIXED_HF_CACHE.
+- VideoMME and MMBench are commonly used in infer mode when you plan to upload predictions to an external leaderboard.
+- APTBench, BFCL, and VSI-Bench benchmark payloads are also not bundled in this package. Populate their expected data locations after you move the bundle to the new cluster.