Upload folder using huggingface_hub
Browse files- README.md +82 -0
- run_all.sh +151 -0
- run_video_rerun.sh +94 -0
- scripts/_common.py +66 -0
- scripts/eval_daily_omni.py +197 -0
- scripts/eval_dpo_sync.py +205 -0
- scripts/eval_lvbench.py +168 -0
- scripts/eval_vggsoundsync.py +195 -0
- scripts/eval_videomme.py +173 -0
- scripts/eval_worldsense.py +175 -0
- scripts/merge_shards.py +128 -0
- scripts/minicpmo_inference.py +264 -0
- scripts/patch_minicpmo.py +255 -0
- scripts/test_minicpmo.py +62 -0
- scripts/upload_to_hf_model.py +84 -0
- setup_env.sh +80 -0
README.md
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MiniCPM-o 4.5 Evaluation
|
| 2 |
+
|
| 3 |
+
Evaluation scripts for `openbmb/MiniCPM-o-4_5` on the same 6 benchmarks as `CleverHans-Evaluation`:
|
| 4 |
+
|
| 5 |
+
- Sync (DPO test set: synced / delay / early)
|
| 6 |
+
- VGGSoundSync (3k freetext)
|
| 7 |
+
- VideoMME (MCQ A/B/C/D)
|
| 8 |
+
- LVBench (MCQ)
|
| 9 |
+
- WorldSense (MCQ)
|
| 10 |
+
- Daily-Omni (MCQ)
|
| 11 |
+
|
| 12 |
+
## Why a separate folder
|
| 13 |
+
|
| 14 |
+
MiniCPM-o 4.5 has a completely different architecture (SigLip2 + Whisper + Qwen3-8B, 9B params) and API (`model.chat(msgs=...)` style) vs Qwen3-Omni (`generate()` + `qwen_omni_utils`). Sharing code is impractical; data loading / metrics can still be reused from the other repo.
|
| 15 |
+
|
| 16 |
+
## Setup
|
| 17 |
+
|
| 18 |
+
```bash
|
| 19 |
+
bash setup_env.sh # install MiniCPM-o dependencies in conda env 'minicpmo'
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
## Layout
|
| 23 |
+
|
| 24 |
+
```
|
| 25 |
+
MiniCPM-Evaluation/
|
| 26 |
+
├── README.md
|
| 27 |
+
├── setup_env.sh
|
| 28 |
+
└── scripts/
|
| 29 |
+
├── minicpmo_inference.py # common inference wrapper
|
| 30 |
+
├── test_minicpmo.py # quick sanity check (single sample)
|
| 31 |
+
├── eval_videomme.py # per-benchmark evaluators
|
| 32 |
+
├── eval_lvbench.py
|
| 33 |
+
├── eval_worldsense.py
|
| 34 |
+
├── eval_daily_omni.py
|
| 35 |
+
├── eval_vggsoundsync.py
|
| 36 |
+
└── eval_dpo_sync.py
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
## Quick Start
|
| 40 |
+
|
| 41 |
+
```bash
|
| 42 |
+
conda activate minicpmo
|
| 43 |
+
cd /home/ubuntu/MiniCPM-Evaluation
|
| 44 |
+
|
| 45 |
+
# 1. Sanity check: single-sample inference
|
| 46 |
+
python scripts/test_minicpmo.py
|
| 47 |
+
|
| 48 |
+
# 2. Run a full benchmark (e.g. Daily-Omni)
|
| 49 |
+
python scripts/eval_daily_omni.py \
|
| 50 |
+
--data-dir /opt/dlami/nvme/daily_omni \
|
| 51 |
+
--output-dir /home/ubuntu/eval_results/daily_omni \
|
| 52 |
+
--label do_minicpmo_45
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Publish to Hugging Face (model repo)
|
| 56 |
+
|
| 57 |
+
This tree is **evaluation code only** (no model weights). You can still host it
|
| 58 |
+
under a Hugging Face **model** repo as a snapshot (e.g. next to weight releases).
|
| 59 |
+
|
| 60 |
+
```bash
|
| 61 |
+
pip install huggingface_hub
|
| 62 |
+
export HF_TOKEN=hf_... # or: huggingface-cli login
|
| 63 |
+
cd MiniCPM-Evaluation
|
| 64 |
+
python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
Private repo:
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation --private
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Data paths (reused from CleverHans-Evaluation)
|
| 74 |
+
|
| 75 |
+
| Benchmark | Path |
|
| 76 |
+
|---|---|
|
| 77 |
+
| Sync videos | `/opt/dlami/nvme/video_source/{original,random_shift_video,extracted_audio}` |
|
| 78 |
+
| VGGSoundSync | `/opt/dlami/nvme/vggsoundsync_test/` |
|
| 79 |
+
| VideoMME | `/opt/dlami/nvme/videomme/data/data/` |
|
| 80 |
+
| LVBench | `/opt/dlami/nvme/lvbench/` |
|
| 81 |
+
| WorldSense | `/opt/dlami/nvme/worldsense/` |
|
| 82 |
+
| Daily-Omni | `/opt/dlami/nvme/daily_omni/` |
|
run_all.sh
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Run all 6 benchmarks for MiniCPM-o 4.5 with 4-GPU data parallelism.
|
| 3 |
+
#
|
| 4 |
+
# For each bench, launches NUM_SHARDS python workers simultaneously (one per
|
| 5 |
+
# GPU), each processes 1/NUM_SHARDS of the samples. After all shards finish,
|
| 6 |
+
# merge_shards.py aggregates the per-shard jsonls and computes metrics.
|
| 7 |
+
# Only ONE bench runs at a time; benches run sequentially.
|
| 8 |
+
#
|
| 9 |
+
# Two sync benches use freetext + GPT judge (matches Qwen3-Omni reference).
|
| 10 |
+
#
|
| 11 |
+
# Usage:
|
| 12 |
+
# export OPENAI_API_KEY=sk-...
|
| 13 |
+
# bash run_all.sh
|
| 14 |
+
#
|
| 15 |
+
# Override via env vars, e.g.:
|
| 16 |
+
# CUDA_VISIBLE_DEVICES=4,5,6,7 LABEL=minicpmo_ckpt200 bash run_all.sh
|
| 17 |
+
# NUM_SHARDS=2 CUDA_VISIBLE_DEVICES=6,7 bash run_all.sh
|
| 18 |
+
|
| 19 |
+
set -uo pipefail # no -e: one bench failure shouldn't block the rest
|
| 20 |
+
|
| 21 |
+
# ── Config ─────────────────────────────────────────────────────────────────────
|
| 22 |
+
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-4,5,6,7}"
|
| 23 |
+
MODEL="${MODEL:-openbmb/MiniCPM-o-4_5}"
|
| 24 |
+
LABEL="${LABEL:-minicpmo_4_5}"
|
| 25 |
+
SCRIPTS="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/scripts"
|
| 26 |
+
CONDA_ENV="${CONDA_ENV:-minicpmo}"
|
| 27 |
+
|
| 28 |
+
# Data parallel: how many shards (= number of GPUs to use)
|
| 29 |
+
IFS=',' read -ra GPU_ARR <<< "$CUDA_VISIBLE_DEVICES"
|
| 30 |
+
NUM_SHARDS="${NUM_SHARDS:-${#GPU_ARR[@]}}"
|
| 31 |
+
|
| 32 |
+
# Data paths (match Qwen3-Omni reference config)
|
| 33 |
+
DATA_ROOT="${DATA_ROOT:-/opt/dlami/nvme/video_source}"
|
| 34 |
+
SYNC_TEST_JSONL="${SYNC_TEST_JSONL:-/home/ubuntu/CleverHans-Evaluation/data/kto_training_data_v2_test.jsonl}"
|
| 35 |
+
VGG_TEST_JSONL="${VGG_TEST_JSONL:-/opt/dlami/nvme/vggsoundsync_test/test_3k.jsonl}"
|
| 36 |
+
WORLDSENSE_DIR="${WORLDSENSE_DIR:-/opt/dlami/nvme/worldsense}"
|
| 37 |
+
DAILY_OMNI_DIR="${DAILY_OMNI_DIR:-/opt/dlami/nvme/daily_omni}"
|
| 38 |
+
VIDEOMME_DIR="${VIDEOMME_DIR:-/opt/dlami/nvme/videomme/data/data}"
|
| 39 |
+
LVBENCH_DIR="${LVBENCH_DIR:-/opt/dlami/nvme/lvbench}"
|
| 40 |
+
|
| 41 |
+
EVAL_ROOT="${EVAL_ROOT:-/home/ubuntu/eval_results}"
|
| 42 |
+
|
| 43 |
+
# ── Conda ──────────────────────────────────────────────────────────────────────
|
| 44 |
+
if [[ -f "${HOME}/anaconda3/etc/profile.d/conda.sh" ]]; then
|
| 45 |
+
source "${HOME}/anaconda3/etc/profile.d/conda.sh"
|
| 46 |
+
fi
|
| 47 |
+
conda activate "${CONDA_ENV}"
|
| 48 |
+
|
| 49 |
+
echo "=== Model: $MODEL | Label: $LABEL"
|
| 50 |
+
echo "=== GPUs: ${GPU_ARR[*]} | Shards: $NUM_SHARDS"
|
| 51 |
+
|
| 52 |
+
# ── Helper: run one bench with data-parallel sharding ─────────────────────────
|
| 53 |
+
# $1 -- bench short name (matches merge_shards.py --bench)
|
| 54 |
+
# $2 -- eval script path
|
| 55 |
+
# $3 -- full label (e.g. sync_minicpmo_4_5)
|
| 56 |
+
# $4 -- output-dir root (e.g. $EVAL_ROOT/sync)
|
| 57 |
+
# $5+ -- extra args passed to each eval script
|
| 58 |
+
run_bench_dp() {
|
| 59 |
+
local bench="$1"; shift
|
| 60 |
+
local script="$1"; shift
|
| 61 |
+
local full_label="$1"; shift
|
| 62 |
+
local out_root="$1"; shift
|
| 63 |
+
local label_dir="${out_root}/${full_label}"
|
| 64 |
+
mkdir -p "${label_dir}/logs"
|
| 65 |
+
|
| 66 |
+
echo ""
|
| 67 |
+
echo "==== [$(date +%T)] Bench: $bench | Label: $full_label ===="
|
| 68 |
+
local pids=()
|
| 69 |
+
for (( i=0; i<NUM_SHARDS; i++ )); do
|
| 70 |
+
local gpu="${GPU_ARR[$i]}"
|
| 71 |
+
local log="${label_dir}/logs/shard${i}of${NUM_SHARDS}.log"
|
| 72 |
+
echo " → shard $i on GPU $gpu (log: $log)"
|
| 73 |
+
CUDA_VISIBLE_DEVICES="$gpu" python "$script" \
|
| 74 |
+
"$@" \
|
| 75 |
+
--output-dir "$out_root" \
|
| 76 |
+
--label "$full_label" \
|
| 77 |
+
--shard "$i" --num-shards "$NUM_SHARDS" \
|
| 78 |
+
> "$log" 2>&1 &
|
| 79 |
+
pids+=($!)
|
| 80 |
+
done
|
| 81 |
+
|
| 82 |
+
# Wait for all shard workers
|
| 83 |
+
local fail=0
|
| 84 |
+
for pid in "${pids[@]}"; do
|
| 85 |
+
wait "$pid" || fail=$((fail+1))
|
| 86 |
+
done
|
| 87 |
+
if (( fail > 0 )); then
|
| 88 |
+
echo " !! $fail shard(s) exited with error; check ${label_dir}/logs/"
|
| 89 |
+
fi
|
| 90 |
+
|
| 91 |
+
# Merge
|
| 92 |
+
echo " → merging shards ..."
|
| 93 |
+
python "$SCRIPTS/merge_shards.py" \
|
| 94 |
+
--bench "$bench" \
|
| 95 |
+
--label-dir "$label_dir" || echo " !! merge failed"
|
| 96 |
+
}
|
| 97 |
+
|
| 98 |
+
# ── 1/6 Sync (in-domain) — freetext + GPT judge ────────────────────────────
|
| 99 |
+
run_bench_dp dpo_sync "$SCRIPTS/eval_dpo_sync.py" \
|
| 100 |
+
"sync_${LABEL}" "$EVAL_ROOT/sync" \
|
| 101 |
+
--model-id "$MODEL" \
|
| 102 |
+
--data-root "$DATA_ROOT" \
|
| 103 |
+
--test-jsonl "$SYNC_TEST_JSONL" \
|
| 104 |
+
--gpt-judge
|
| 105 |
+
|
| 106 |
+
# ── 2/6 VGGSoundSync — freetext + GPT judge ────────────────────────────────
|
| 107 |
+
run_bench_dp vggsoundsync "$SCRIPTS/eval_vggsoundsync.py" \
|
| 108 |
+
"vggsync_freetext_${LABEL}_3k" "$EVAL_ROOT/vggsoundsync" \
|
| 109 |
+
--model-id "$MODEL" \
|
| 110 |
+
--test-jsonl "$VGG_TEST_JSONL" \
|
| 111 |
+
--mode freetext --gpt-judge
|
| 112 |
+
|
| 113 |
+
# ── 3/6 WorldSense ───────────────────────────────────────────────────────────
|
| 114 |
+
run_bench_dp worldsense "$SCRIPTS/eval_worldsense.py" \
|
| 115 |
+
"ws_${LABEL}" "$EVAL_ROOT/worldsense" \
|
| 116 |
+
--model-id "$MODEL" \
|
| 117 |
+
--data-dir "$WORLDSENSE_DIR" \
|
| 118 |
+
--max-samples -1
|
| 119 |
+
|
| 120 |
+
# ── 4/6 Daily-Omni ───────────────────────────────────────────────────────────
|
| 121 |
+
run_bench_dp daily_omni "$SCRIPTS/eval_daily_omni.py" \
|
| 122 |
+
"do_${LABEL}" "$EVAL_ROOT/daily_omni" \
|
| 123 |
+
--model-id "$MODEL" \
|
| 124 |
+
--data-dir "$DAILY_OMNI_DIR" \
|
| 125 |
+
--max-samples -1
|
| 126 |
+
|
| 127 |
+
# ── 5/6 Video-MME ────────────────────────────────────────────────────────────
|
| 128 |
+
run_bench_dp videomme "$SCRIPTS/eval_videomme.py" \
|
| 129 |
+
"vmme_${LABEL}" "$EVAL_ROOT/videomme" \
|
| 130 |
+
--model-id "$MODEL" \
|
| 131 |
+
--video-dir "$VIDEOMME_DIR" \
|
| 132 |
+
--max-samples -1
|
| 133 |
+
|
| 134 |
+
# ── 6/6 LVBench ──────────────────────────────────────────────────────────────
|
| 135 |
+
run_bench_dp lvbench "$SCRIPTS/eval_lvbench.py" \
|
| 136 |
+
"lvb_${LABEL}" "$EVAL_ROOT/lvbench" \
|
| 137 |
+
--model-id "$MODEL" \
|
| 138 |
+
--video-dir "$LVBENCH_DIR" \
|
| 139 |
+
--max-samples -1
|
| 140 |
+
|
| 141 |
+
echo ""
|
| 142 |
+
echo "=== All done: $LABEL ==="
|
| 143 |
+
for b_out in \
|
| 144 |
+
"$EVAL_ROOT/sync/sync_${LABEL}" \
|
| 145 |
+
"$EVAL_ROOT/vggsoundsync/vggsync_freetext_${LABEL}_3k" \
|
| 146 |
+
"$EVAL_ROOT/worldsense/ws_${LABEL}" \
|
| 147 |
+
"$EVAL_ROOT/daily_omni/do_${LABEL}" \
|
| 148 |
+
"$EVAL_ROOT/videomme/vmme_${LABEL}" \
|
| 149 |
+
"$EVAL_ROOT/lvbench/lvb_${LABEL}"; do
|
| 150 |
+
echo " ${b_out}/metrics.json"
|
| 151 |
+
done
|
run_video_rerun.sh
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Re-run the four benchmarks that were broken by the missing video-chat flags
|
| 3 |
+
# (`use_image_id=False, max_slice_nums=1`) and the audio -> TTS template
|
| 4 |
+
# force-enable. After patches in `patch_minicpmo.py` + `minicpmo_inference.py`
|
| 5 |
+
# these should now produce real MCQ answers (smoke-test-verified).
|
| 6 |
+
#
|
| 7 |
+
# Usage: CUDA_VISIBLE_DEVICES=4,5,6,7 bash run_video_rerun.sh
|
| 8 |
+
|
| 9 |
+
set -uo pipefail
|
| 10 |
+
|
| 11 |
+
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-4,5,6,7}"
|
| 12 |
+
MODEL="${MODEL:-openbmb/MiniCPM-o-4_5}"
|
| 13 |
+
LABEL="${LABEL:-minicpmo_4_5}"
|
| 14 |
+
SCRIPTS="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/scripts"
|
| 15 |
+
CONDA_ENV="${CONDA_ENV:-minicpmo}"
|
| 16 |
+
|
| 17 |
+
IFS=',' read -ra GPU_ARR <<< "$CUDA_VISIBLE_DEVICES"
|
| 18 |
+
NUM_SHARDS="${NUM_SHARDS:-${#GPU_ARR[@]}}"
|
| 19 |
+
|
| 20 |
+
WORLDSENSE_DIR="${WORLDSENSE_DIR:-/opt/dlami/nvme/worldsense}"
|
| 21 |
+
DAILY_OMNI_DIR="${DAILY_OMNI_DIR:-/opt/dlami/nvme/daily_omni}"
|
| 22 |
+
VIDEOMME_DIR="${VIDEOMME_DIR:-/opt/dlami/nvme/videomme/data/data}"
|
| 23 |
+
LVBENCH_DIR="${LVBENCH_DIR:-/opt/dlami/nvme/lvbench}"
|
| 24 |
+
EVAL_ROOT="${EVAL_ROOT:-/home/ubuntu/eval_results}"
|
| 25 |
+
|
| 26 |
+
if [[ -f "${HOME}/anaconda3/etc/profile.d/conda.sh" ]]; then
|
| 27 |
+
source "${HOME}/anaconda3/etc/profile.d/conda.sh"
|
| 28 |
+
fi
|
| 29 |
+
conda activate "${CONDA_ENV}"
|
| 30 |
+
|
| 31 |
+
echo "=== Re-running video benches with fixed inference"
|
| 32 |
+
echo "=== Model: $MODEL | Label: $LABEL | GPUs: ${GPU_ARR[*]} | Shards: $NUM_SHARDS"
|
| 33 |
+
|
| 34 |
+
run_bench_dp() {
|
| 35 |
+
local bench="$1"; shift
|
| 36 |
+
local script="$1"; shift
|
| 37 |
+
local full_label="$1"; shift
|
| 38 |
+
local out_root="$1"; shift
|
| 39 |
+
local label_dir="${out_root}/${full_label}"
|
| 40 |
+
mkdir -p "${label_dir}/logs"
|
| 41 |
+
|
| 42 |
+
echo ""
|
| 43 |
+
echo "==== [$(date +%T)] Bench: $bench | Label: $full_label ===="
|
| 44 |
+
local pids=()
|
| 45 |
+
for (( i=0; i<NUM_SHARDS; i++ )); do
|
| 46 |
+
local gpu="${GPU_ARR[$i]}"
|
| 47 |
+
local log="${label_dir}/logs/shard${i}of${NUM_SHARDS}.log"
|
| 48 |
+
echo " -> shard $i on GPU $gpu (log: $log)"
|
| 49 |
+
CUDA_VISIBLE_DEVICES="$gpu" python "$script" \
|
| 50 |
+
"$@" \
|
| 51 |
+
--output-dir "$out_root" \
|
| 52 |
+
--label "$full_label" \
|
| 53 |
+
--shard "$i" --num-shards "$NUM_SHARDS" \
|
| 54 |
+
> "$log" 2>&1 &
|
| 55 |
+
pids+=($!)
|
| 56 |
+
done
|
| 57 |
+
local fail=0
|
| 58 |
+
for pid in "${pids[@]}"; do
|
| 59 |
+
wait "$pid" || fail=$((fail+1))
|
| 60 |
+
done
|
| 61 |
+
if (( fail > 0 )); then
|
| 62 |
+
echo " !! $fail shard(s) exited with error; check ${label_dir}/logs/"
|
| 63 |
+
fi
|
| 64 |
+
echo " -> merging shards ..."
|
| 65 |
+
python "$SCRIPTS/merge_shards.py" \
|
| 66 |
+
--bench "$bench" \
|
| 67 |
+
--label-dir "$label_dir" || echo " !! merge failed"
|
| 68 |
+
}
|
| 69 |
+
|
| 70 |
+
run_bench_dp worldsense "$SCRIPTS/eval_worldsense.py" \
|
| 71 |
+
"ws_${LABEL}" "$EVAL_ROOT/worldsense" \
|
| 72 |
+
--model-id "$MODEL" --data-dir "$WORLDSENSE_DIR" --max-samples -1
|
| 73 |
+
|
| 74 |
+
run_bench_dp daily_omni "$SCRIPTS/eval_daily_omni.py" \
|
| 75 |
+
"do_${LABEL}" "$EVAL_ROOT/daily_omni" \
|
| 76 |
+
--model-id "$MODEL" --data-dir "$DAILY_OMNI_DIR" --max-samples -1
|
| 77 |
+
|
| 78 |
+
run_bench_dp videomme "$SCRIPTS/eval_videomme.py" \
|
| 79 |
+
"vmme_${LABEL}" "$EVAL_ROOT/videomme" \
|
| 80 |
+
--model-id "$MODEL" --video-dir "$VIDEOMME_DIR" --max-samples -1
|
| 81 |
+
|
| 82 |
+
run_bench_dp lvbench "$SCRIPTS/eval_lvbench.py" \
|
| 83 |
+
"lvb_${LABEL}" "$EVAL_ROOT/lvbench" \
|
| 84 |
+
--model-id "$MODEL" --video-dir "$LVBENCH_DIR" --max-samples -1
|
| 85 |
+
|
| 86 |
+
echo ""
|
| 87 |
+
echo "=== Rerun done: $LABEL ==="
|
| 88 |
+
for d in \
|
| 89 |
+
"$EVAL_ROOT/worldsense/ws_${LABEL}" \
|
| 90 |
+
"$EVAL_ROOT/daily_omni/do_${LABEL}" \
|
| 91 |
+
"$EVAL_ROOT/videomme/vmme_${LABEL}" \
|
| 92 |
+
"$EVAL_ROOT/lvbench/lvb_${LABEL}"; do
|
| 93 |
+
echo " ${d}/metrics.json"
|
| 94 |
+
done
|
scripts/_common.py
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Shared glue for all MiniCPM-o eval scripts.
|
| 2 |
+
|
| 3 |
+
Loads the CleverHans-Evaluation counterpart scripts under aliased module
|
| 4 |
+
names (prefixed with `ch_`), so the MiniCPM-o eval scripts can import their
|
| 5 |
+
data loaders / metric functions without filename shadowing.
|
| 6 |
+
|
| 7 |
+
Usage in an eval script:
|
| 8 |
+
|
| 9 |
+
import _common # noqa: F401
|
| 10 |
+
from _common import ch # namespace holding ch_eval_videomme etc.
|
| 11 |
+
|
| 12 |
+
ch_videomme = ch("videomme")
|
| 13 |
+
data = ch_videomme.load_videomme(...)
|
| 14 |
+
"""
|
| 15 |
+
from __future__ import annotations
|
| 16 |
+
|
| 17 |
+
import importlib.util
|
| 18 |
+
import os
|
| 19 |
+
import sys
|
| 20 |
+
import types
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
_HERE = Path(__file__).resolve().parent
|
| 25 |
+
_CLEVERHANS_SCRIPTS = Path(
|
| 26 |
+
os.environ.get(
|
| 27 |
+
"CLEVERHANS_SCRIPTS",
|
| 28 |
+
"/home/ubuntu/CleverHans-Evaluation/scripts",
|
| 29 |
+
)
|
| 30 |
+
).resolve()
|
| 31 |
+
|
| 32 |
+
# Make local (MiniCPM-o) modules importable without package setup.
|
| 33 |
+
if str(_HERE) not in sys.path:
|
| 34 |
+
sys.path.insert(0, str(_HERE))
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
_CACHE: dict[str, types.ModuleType] = {}
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def ch(short_name: str) -> types.ModuleType:
|
| 41 |
+
"""Load a CleverHans-Evaluation script by short name (e.g., 'videomme',
|
| 42 |
+
'lvbench', 'dpo_sync'). Returns the module object.
|
| 43 |
+
|
| 44 |
+
Loaded under an aliased module name `ch_eval_<short_name>` so it doesn't
|
| 45 |
+
collide with same-named files in this directory.
|
| 46 |
+
"""
|
| 47 |
+
cache_key = short_name
|
| 48 |
+
if cache_key in _CACHE:
|
| 49 |
+
return _CACHE[cache_key]
|
| 50 |
+
|
| 51 |
+
script_path = _CLEVERHANS_SCRIPTS / f"eval_{short_name}.py"
|
| 52 |
+
if not script_path.is_file():
|
| 53 |
+
raise FileNotFoundError(
|
| 54 |
+
f"CleverHans-Evaluation script not found: {script_path}\n"
|
| 55 |
+
f"Set CLEVERHANS_SCRIPTS env var to the correct directory."
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
alias = f"ch_eval_{short_name}"
|
| 59 |
+
spec = importlib.util.spec_from_file_location(alias, str(script_path))
|
| 60 |
+
if spec is None or spec.loader is None:
|
| 61 |
+
raise ImportError(f"Could not create spec for {script_path}")
|
| 62 |
+
module = importlib.util.module_from_spec(spec)
|
| 63 |
+
sys.modules[alias] = module
|
| 64 |
+
spec.loader.exec_module(module)
|
| 65 |
+
_CACHE[cache_key] = module
|
| 66 |
+
return module
|
scripts/eval_daily_omni.py
ADDED
|
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Evaluate MiniCPM-o 4.5 on Daily-Omni.
|
| 3 |
+
|
| 4 |
+
Daily-Omni videos include embedded audio; we extract it and feed both frames
|
| 5 |
+
and waveform to MiniCPM-o.
|
| 6 |
+
"""
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
import _common
|
| 10 |
+
|
| 11 |
+
import argparse
|
| 12 |
+
import gc
|
| 13 |
+
import io
|
| 14 |
+
import contextlib
|
| 15 |
+
import json
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
|
| 18 |
+
import torch
|
| 19 |
+
from tqdm import tqdm
|
| 20 |
+
|
| 21 |
+
ch = _common.ch("daily_omni")
|
| 22 |
+
load_daily_omni = ch.load_daily_omni
|
| 23 |
+
extract_answer = ch.extract_answer
|
| 24 |
+
compute_metrics = ch.compute_metrics
|
| 25 |
+
print_summary = ch.print_summary
|
| 26 |
+
DEFAULT_DATA_DIR = ch.DEFAULT_DATA_DIR
|
| 27 |
+
DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
|
| 28 |
+
|
| 29 |
+
from minicpmo_inference import load_model, run_inference
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def parse_args() -> argparse.Namespace:
|
| 33 |
+
p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on Daily-Omni.")
|
| 34 |
+
p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
|
| 35 |
+
p.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR)
|
| 36 |
+
p.add_argument("--output-dir", type=Path,
|
| 37 |
+
default=Path("/home/ubuntu/eval_results/daily_omni_minicpmo"))
|
| 38 |
+
p.add_argument("--max-samples", type=int, default=-1)
|
| 39 |
+
p.add_argument("--max-new-tokens", type=int, default=32)
|
| 40 |
+
p.add_argument("--temperature", type=float, default=0.0)
|
| 41 |
+
p.add_argument("--label", type=str, default="minicpmo_daily_omni")
|
| 42 |
+
p.add_argument("--max-frames", type=int, default=64)
|
| 43 |
+
p.add_argument("--fps", type=float, default=1.0)
|
| 44 |
+
p.add_argument("--attn", type=str, default="flash_attention_2",
|
| 45 |
+
choices=["sdpa", "flash_attention_2", "eager"])
|
| 46 |
+
p.add_argument("--no-audio", action="store_true",
|
| 47 |
+
help="Video-only mode (skip audio extraction).")
|
| 48 |
+
p.add_argument(
|
| 49 |
+
"--skip-audio-durations",
|
| 50 |
+
type=str,
|
| 51 |
+
default="",
|
| 52 |
+
help=(
|
| 53 |
+
"Comma-separated `video_duration` values from the dataset for which "
|
| 54 |
+
"audio is omitted (video-only for those clips). Useful when "
|
| 55 |
+
"MiniCPM-o forward fails on some lengths with audio+vision "
|
| 56 |
+
'(e.g. empty `raw_output` and log errors like "Expected size 122 '
|
| 57 |
+
'but got size 121"). Example: --skip-audio-durations 60s'
|
| 58 |
+
),
|
| 59 |
+
)
|
| 60 |
+
# vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
|
| 61 |
+
p.add_argument("--vllm", action="store_true", default=False,
|
| 62 |
+
help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
|
| 63 |
+
p.add_argument("--tp", type=int, default=None)
|
| 64 |
+
p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
|
| 65 |
+
p.add_argument("--max-model-len", type=int, default=65536)
|
| 66 |
+
p.add_argument("--batch-size", type=int, default=32)
|
| 67 |
+
# Data-parallel sharding
|
| 68 |
+
p.add_argument("--shard", type=int, default=0)
|
| 69 |
+
p.add_argument("--num-shards", type=int, default=1)
|
| 70 |
+
return p.parse_args()
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def main() -> None:
|
| 74 |
+
args = parse_args()
|
| 75 |
+
out_dir = args.output_dir / args.label
|
| 76 |
+
out_dir.mkdir(parents=True, exist_ok=True)
|
| 77 |
+
shard_suffix = (f".shard{args.shard}of{args.num_shards}"
|
| 78 |
+
if args.num_shards > 1 else "")
|
| 79 |
+
results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
|
| 80 |
+
metrics_json = out_dir / "metrics.json"
|
| 81 |
+
summary_txt = out_dir / "summary.txt"
|
| 82 |
+
|
| 83 |
+
if args.vllm:
|
| 84 |
+
print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
|
| 85 |
+
"supported upstream yet; falling back to transformers.")
|
| 86 |
+
print("[data] Loading Daily-Omni dataset...")
|
| 87 |
+
test_data = load_daily_omni(args.data_dir, args.max_samples)
|
| 88 |
+
if args.num_shards > 1:
|
| 89 |
+
test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
|
| 90 |
+
print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
|
| 91 |
+
else:
|
| 92 |
+
print(f"[data] {len(test_data)} questions ready")
|
| 93 |
+
|
| 94 |
+
processed: set = set()
|
| 95 |
+
if results_jsonl.exists():
|
| 96 |
+
with open(results_jsonl) as f:
|
| 97 |
+
for line in f:
|
| 98 |
+
obj = json.loads(line)
|
| 99 |
+
processed.add(obj["question_id"])
|
| 100 |
+
print(f"[resume] {len(processed)} already processed")
|
| 101 |
+
|
| 102 |
+
model, tokenizer = load_model(
|
| 103 |
+
args.model_id, attn_implementation=args.attn, init_audio=not args.no_audio,
|
| 104 |
+
)
|
| 105 |
+
|
| 106 |
+
skip_audio_durs = {
|
| 107 |
+
x.strip()
|
| 108 |
+
for x in args.skip_audio_durations.split(",")
|
| 109 |
+
if x.strip()
|
| 110 |
+
}
|
| 111 |
+
|
| 112 |
+
for item in tqdm(test_data, desc="Daily-Omni", unit="q"):
|
| 113 |
+
if item["question_id"] in processed:
|
| 114 |
+
continue
|
| 115 |
+
use_audio = not args.no_audio and (
|
| 116 |
+
item.get("video_duration", "") not in skip_audio_durs
|
| 117 |
+
)
|
| 118 |
+
try:
|
| 119 |
+
raw_output = run_inference(
|
| 120 |
+
model, tokenizer,
|
| 121 |
+
video_path=item["video_path"],
|
| 122 |
+
audio_path=None,
|
| 123 |
+
prompt=item["prompt"],
|
| 124 |
+
max_new_tokens=args.max_new_tokens,
|
| 125 |
+
temperature=args.temperature,
|
| 126 |
+
max_frames=args.max_frames,
|
| 127 |
+
fps=args.fps,
|
| 128 |
+
use_audio_from_video=use_audio,
|
| 129 |
+
)
|
| 130 |
+
except Exception as exc:
|
| 131 |
+
import traceback
|
| 132 |
+
print(f" [error] {item['question_id']}: {exc}")
|
| 133 |
+
traceback.print_exc()
|
| 134 |
+
raw_output = ""
|
| 135 |
+
|
| 136 |
+
pred = extract_answer(raw_output)
|
| 137 |
+
result = {
|
| 138 |
+
"question_id": item["question_id"],
|
| 139 |
+
"video_id": item["video_id"],
|
| 140 |
+
"question_type": item.get("question_type", ""),
|
| 141 |
+
"content_parent_category": item.get("content_parent_category", ""),
|
| 142 |
+
"content_fine_category": item.get("content_fine_category", ""),
|
| 143 |
+
"video_category": item.get("video_category", ""),
|
| 144 |
+
"video_duration": item.get("video_duration", ""),
|
| 145 |
+
"question": item["question"],
|
| 146 |
+
"choices": item["choices"],
|
| 147 |
+
"gt_answer": item["gt_answer"],
|
| 148 |
+
"pred_answer": pred,
|
| 149 |
+
"raw_output": raw_output,
|
| 150 |
+
}
|
| 151 |
+
with open(results_jsonl, "a", encoding="utf-8") as f:
|
| 152 |
+
f.write(json.dumps(result, ensure_ascii=False) + "\n")
|
| 153 |
+
|
| 154 |
+
processed.add(item["question_id"])
|
| 155 |
+
gc.collect()
|
| 156 |
+
torch.cuda.empty_cache()
|
| 157 |
+
|
| 158 |
+
if args.num_shards > 1:
|
| 159 |
+
print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
|
| 160 |
+
print(f"[shard] Run merge_shards.py --bench daily_omni --label-dir {out_dir}")
|
| 161 |
+
return
|
| 162 |
+
|
| 163 |
+
all_results = []
|
| 164 |
+
if results_jsonl.exists():
|
| 165 |
+
with open(results_jsonl) as f:
|
| 166 |
+
for line in f:
|
| 167 |
+
all_results.append(json.loads(line))
|
| 168 |
+
|
| 169 |
+
metrics = compute_metrics(all_results)
|
| 170 |
+
metrics["eval_config"] = {
|
| 171 |
+
"model_id": args.model_id,
|
| 172 |
+
"data_dir": str(args.data_dir),
|
| 173 |
+
"max_new_tokens": args.max_new_tokens,
|
| 174 |
+
"temperature": args.temperature,
|
| 175 |
+
"max_frames": args.max_frames,
|
| 176 |
+
"fps": args.fps,
|
| 177 |
+
"attn": args.attn,
|
| 178 |
+
"no_audio": args.no_audio,
|
| 179 |
+
"skip_audio_durations": sorted(skip_audio_durs),
|
| 180 |
+
}
|
| 181 |
+
with open(metrics_json, "w", encoding="utf-8") as f:
|
| 182 |
+
json.dump(metrics, f, indent=2, ensure_ascii=False)
|
| 183 |
+
|
| 184 |
+
print_summary(metrics, args.label)
|
| 185 |
+
with open(summary_txt, "w", encoding="utf-8") as f:
|
| 186 |
+
buf = io.StringIO()
|
| 187 |
+
with contextlib.redirect_stdout(buf):
|
| 188 |
+
print_summary(metrics, args.label)
|
| 189 |
+
f.write(buf.getvalue())
|
| 190 |
+
|
| 191 |
+
print(f"\n[output] Results: {results_jsonl}")
|
| 192 |
+
print(f"[output] Metrics: {metrics_json}")
|
| 193 |
+
print(f"[output] Summary: {summary_txt}")
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
if __name__ == "__main__":
|
| 197 |
+
main()
|
scripts/eval_dpo_sync.py
ADDED
|
@@ -0,0 +1,205 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Evaluate MiniCPM-o 4.5 on the in-domain DPO sync test set.
|
| 3 |
+
|
| 4 |
+
Reuses the CleverHans-Evaluation dpo_sync eval_dpo_sync.py for data loading,
|
| 5 |
+
GT parsing, regex prediction extractor, optional GPT judge, and metrics.
|
| 6 |
+
Only the inference path is replaced with MiniCPM-o.
|
| 7 |
+
"""
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
import _common
|
| 11 |
+
|
| 12 |
+
import argparse
|
| 13 |
+
import gc
|
| 14 |
+
import io
|
| 15 |
+
import contextlib
|
| 16 |
+
import json
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
|
| 19 |
+
import torch
|
| 20 |
+
from tqdm import tqdm
|
| 21 |
+
|
| 22 |
+
ch = _common.ch("dpo_sync")
|
| 23 |
+
EVAL_PROMPT = ch.EVAL_PROMPT
|
| 24 |
+
load_test_data = ch.load_test_data
|
| 25 |
+
set_data_root = ch.set_data_root
|
| 26 |
+
extract_prediction = ch.extract_prediction
|
| 27 |
+
gpt_extract_prediction = ch.gpt_extract_prediction
|
| 28 |
+
_get_openai_client = ch._get_openai_client
|
| 29 |
+
compute_metrics = ch.compute_metrics
|
| 30 |
+
print_summary = ch.print_summary
|
| 31 |
+
|
| 32 |
+
from minicpmo_inference import load_model, run_inference
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def parse_args() -> argparse.Namespace:
|
| 36 |
+
p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on DPO sync test set.")
|
| 37 |
+
p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
|
| 38 |
+
p.add_argument("--data-root", type=Path,
|
| 39 |
+
default=Path("/opt/dlami/nvme/video_source"))
|
| 40 |
+
p.add_argument("--test-jsonl", type=Path, default=None,
|
| 41 |
+
help="Default: <data-root>/kto_training_data_v2_test.jsonl")
|
| 42 |
+
p.add_argument("--output-dir", type=Path,
|
| 43 |
+
default=Path("/home/ubuntu/eval_results/sync_minicpmo"))
|
| 44 |
+
p.add_argument("--max-samples", type=int, default=-1)
|
| 45 |
+
p.add_argument("--max-new-tokens", type=int, default=256)
|
| 46 |
+
p.add_argument("--temperature", type=float, default=0.0)
|
| 47 |
+
p.add_argument("--label", type=str, default="minicpmo_sync")
|
| 48 |
+
p.add_argument("--max-frames", type=int, default=32,
|
| 49 |
+
help="Sync clips are short (<30s); 32 frames is plenty.")
|
| 50 |
+
p.add_argument("--fps", type=float, default=2.0)
|
| 51 |
+
p.add_argument("--attn", type=str, default="flash_attention_2",
|
| 52 |
+
choices=["sdpa", "flash_attention_2", "eager"])
|
| 53 |
+
# vLLM flags: accepted for CLI parity with Qwen3-Omni. MiniCPM-o 4.5
|
| 54 |
+
# multimodal vLLM support is not yet available upstream, so these are
|
| 55 |
+
# currently a no-op (we always run transformers). Kept so the same
|
| 56 |
+
# run_*.sh scripts work across the two models.
|
| 57 |
+
p.add_argument("--vllm", action="store_true", default=False,
|
| 58 |
+
help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
|
| 59 |
+
p.add_argument("--tp", type=int, default=None)
|
| 60 |
+
p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
|
| 61 |
+
p.add_argument("--max-model-len", type=int, default=65536)
|
| 62 |
+
p.add_argument("--gpt-judge", action="store_true", default=False)
|
| 63 |
+
p.add_argument("--openai-api-key", type=str, default=None)
|
| 64 |
+
p.add_argument("--gpt-model", type=str, default="gpt-5.4")
|
| 65 |
+
# Data-parallel sharding
|
| 66 |
+
p.add_argument("--shard", type=int, default=0)
|
| 67 |
+
p.add_argument("--num-shards", type=int, default=1)
|
| 68 |
+
return p.parse_args()
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def main() -> None:
|
| 72 |
+
args = parse_args()
|
| 73 |
+
set_data_root(args.data_root)
|
| 74 |
+
test_jsonl = args.test_jsonl or (args.data_root / "kto_training_data_v2_test.jsonl")
|
| 75 |
+
|
| 76 |
+
if args.vllm:
|
| 77 |
+
print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
|
| 78 |
+
"supported upstream yet; falling back to transformers.")
|
| 79 |
+
if args.gpt_judge:
|
| 80 |
+
if _get_openai_client(args.openai_api_key) is None:
|
| 81 |
+
print("[ERROR] --gpt-judge requires OPENAI_API_KEY or --openai-api-key.")
|
| 82 |
+
raise SystemExit(1)
|
| 83 |
+
|
| 84 |
+
out_dir = args.output_dir / args.label
|
| 85 |
+
out_dir.mkdir(parents=True, exist_ok=True)
|
| 86 |
+
shard_suffix = (f".shard{args.shard}of{args.num_shards}"
|
| 87 |
+
if args.num_shards > 1 else "")
|
| 88 |
+
results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
|
| 89 |
+
metrics_json = out_dir / "metrics.json"
|
| 90 |
+
summary_txt = out_dir / "summary.txt"
|
| 91 |
+
|
| 92 |
+
test_data = load_test_data(test_jsonl, args.max_samples)
|
| 93 |
+
if args.num_shards > 1:
|
| 94 |
+
test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
|
| 95 |
+
print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} samples")
|
| 96 |
+
else:
|
| 97 |
+
print(f"[data] {len(test_data)} test samples")
|
| 98 |
+
|
| 99 |
+
processed: set = set()
|
| 100 |
+
if results_jsonl.exists():
|
| 101 |
+
with open(results_jsonl) as f:
|
| 102 |
+
for line in f:
|
| 103 |
+
obj = json.loads(line)
|
| 104 |
+
processed.add(obj["video"])
|
| 105 |
+
print(f"[resume] {len(processed)} already processed")
|
| 106 |
+
|
| 107 |
+
def _do_extract(raw_output: str):
|
| 108 |
+
if args.gpt_judge and raw_output:
|
| 109 |
+
gpt_pred = gpt_extract_prediction(
|
| 110 |
+
raw_output, api_key=args.openai_api_key, model=args.gpt_model,
|
| 111 |
+
)
|
| 112 |
+
if gpt_pred is not None:
|
| 113 |
+
return gpt_pred
|
| 114 |
+
return extract_prediction(raw_output)
|
| 115 |
+
|
| 116 |
+
model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
|
| 117 |
+
init_audio=True)
|
| 118 |
+
|
| 119 |
+
for item in tqdm(test_data, desc="Sync", unit="sample"):
|
| 120 |
+
if item["video"] in processed:
|
| 121 |
+
continue
|
| 122 |
+
try:
|
| 123 |
+
raw_output = run_inference(
|
| 124 |
+
model, tokenizer,
|
| 125 |
+
video_path=item["video_path"],
|
| 126 |
+
audio_path=item["audio_path"],
|
| 127 |
+
prompt=EVAL_PROMPT,
|
| 128 |
+
max_new_tokens=args.max_new_tokens,
|
| 129 |
+
temperature=args.temperature,
|
| 130 |
+
max_frames=args.max_frames,
|
| 131 |
+
fps=args.fps,
|
| 132 |
+
)
|
| 133 |
+
except Exception as exc:
|
| 134 |
+
import traceback
|
| 135 |
+
print(f" [error] {item['video']}: {exc}")
|
| 136 |
+
traceback.print_exc()
|
| 137 |
+
raw_output = ""
|
| 138 |
+
|
| 139 |
+
pred = _do_extract(raw_output)
|
| 140 |
+
result = {
|
| 141 |
+
"video": item["video"],
|
| 142 |
+
"video_path": item["video_path"],
|
| 143 |
+
"gt_synced": item["gt_synced"],
|
| 144 |
+
"gt_direction": item["gt_direction"],
|
| 145 |
+
"gt_offset_sec": item["gt_offset_sec"],
|
| 146 |
+
"gt_t_v": item["gt_t_v"],
|
| 147 |
+
"gt_t_a": item["gt_t_a"],
|
| 148 |
+
"pred_synced": pred["pred_synced"],
|
| 149 |
+
"pred_direction": pred["pred_direction"],
|
| 150 |
+
"pred_offset_sec": pred["pred_offset_sec"],
|
| 151 |
+
"pred_t_v": pred.get("pred_t_v"),
|
| 152 |
+
"pred_t_a": pred.get("pred_t_a"),
|
| 153 |
+
"pred_explanation": pred.get("pred_explanation", ""),
|
| 154 |
+
"parse_method": pred["parse_method"],
|
| 155 |
+
"raw_output": raw_output,
|
| 156 |
+
}
|
| 157 |
+
with open(results_jsonl, "a", encoding="utf-8") as f:
|
| 158 |
+
f.write(json.dumps(result, ensure_ascii=False) + "\n")
|
| 159 |
+
|
| 160 |
+
processed.add(item["video"])
|
| 161 |
+
gc.collect()
|
| 162 |
+
torch.cuda.empty_cache()
|
| 163 |
+
|
| 164 |
+
if args.num_shards > 1:
|
| 165 |
+
print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
|
| 166 |
+
print(f"[shard] Run merge_shards.py --bench dpo_sync --label-dir {out_dir}")
|
| 167 |
+
return
|
| 168 |
+
|
| 169 |
+
all_results = []
|
| 170 |
+
if results_jsonl.exists():
|
| 171 |
+
with open(results_jsonl) as f:
|
| 172 |
+
for line in f:
|
| 173 |
+
all_results.append(json.loads(line))
|
| 174 |
+
|
| 175 |
+
metrics = compute_metrics(all_results)
|
| 176 |
+
metrics["eval_config"] = {
|
| 177 |
+
"model_id": args.model_id,
|
| 178 |
+
"data_root": str(args.data_root),
|
| 179 |
+
"test_jsonl": str(test_jsonl),
|
| 180 |
+
"total_test_samples": len(test_data),
|
| 181 |
+
"max_new_tokens": args.max_new_tokens,
|
| 182 |
+
"temperature": args.temperature,
|
| 183 |
+
"max_frames": args.max_frames,
|
| 184 |
+
"fps": args.fps,
|
| 185 |
+
"attn": args.attn,
|
| 186 |
+
"gpt_judge": args.gpt_judge,
|
| 187 |
+
"gpt_model": args.gpt_model if args.gpt_judge else None,
|
| 188 |
+
}
|
| 189 |
+
with open(metrics_json, "w", encoding="utf-8") as f:
|
| 190 |
+
json.dump(metrics, f, indent=2, ensure_ascii=False)
|
| 191 |
+
|
| 192 |
+
print_summary(metrics, args.label)
|
| 193 |
+
with open(summary_txt, "w", encoding="utf-8") as f:
|
| 194 |
+
buf = io.StringIO()
|
| 195 |
+
with contextlib.redirect_stdout(buf):
|
| 196 |
+
print_summary(metrics, args.label)
|
| 197 |
+
f.write(buf.getvalue())
|
| 198 |
+
|
| 199 |
+
print(f"\n[output] Results: {results_jsonl}")
|
| 200 |
+
print(f"[output] Metrics: {metrics_json}")
|
| 201 |
+
print(f"[output] Summary: {summary_txt}")
|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
if __name__ == "__main__":
|
| 205 |
+
main()
|
scripts/eval_lvbench.py
ADDED
|
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Evaluate MiniCPM-o 4.5 on LVBench.
|
| 3 |
+
|
| 4 |
+
Reuses data loader and metrics from CleverHans-Evaluation's eval_lvbench.py.
|
| 5 |
+
LVBench is video-only (long video QA); no audio is passed.
|
| 6 |
+
"""
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
import _common
|
| 10 |
+
|
| 11 |
+
import argparse
|
| 12 |
+
import gc
|
| 13 |
+
import io
|
| 14 |
+
import contextlib
|
| 15 |
+
import json
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
|
| 18 |
+
import torch
|
| 19 |
+
from tqdm import tqdm
|
| 20 |
+
|
| 21 |
+
ch = _common.ch("lvbench")
|
| 22 |
+
load_lvbench = ch.load_lvbench
|
| 23 |
+
extract_answer = ch.extract_answer
|
| 24 |
+
compute_metrics = ch.compute_metrics
|
| 25 |
+
print_summary = ch.print_summary
|
| 26 |
+
DEFAULT_VIDEO_DIR = ch.DEFAULT_VIDEO_DIR
|
| 27 |
+
DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
|
| 28 |
+
|
| 29 |
+
from minicpmo_inference import load_model, run_inference
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def parse_args() -> argparse.Namespace:
|
| 33 |
+
p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on LVBench.")
|
| 34 |
+
p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
|
| 35 |
+
p.add_argument("--video-dir", type=Path, default=DEFAULT_VIDEO_DIR)
|
| 36 |
+
p.add_argument("--output-dir", type=Path,
|
| 37 |
+
default=Path("/home/ubuntu/eval_results/lvbench_minicpmo"))
|
| 38 |
+
p.add_argument("--max-samples", type=int, default=-1)
|
| 39 |
+
p.add_argument("--max-new-tokens", type=int, default=32)
|
| 40 |
+
p.add_argument("--temperature", type=float, default=0.0)
|
| 41 |
+
p.add_argument("--label", type=str, default="minicpmo_lvbench")
|
| 42 |
+
p.add_argument("--max-frames", type=int, default=96,
|
| 43 |
+
help="LVBench has long videos; larger frame budget helps.")
|
| 44 |
+
p.add_argument("--fps", type=float, default=0.5)
|
| 45 |
+
p.add_argument("--attn", type=str, default="flash_attention_2",
|
| 46 |
+
choices=["sdpa", "flash_attention_2", "eager"])
|
| 47 |
+
# vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
|
| 48 |
+
p.add_argument("--vllm", action="store_true", default=False,
|
| 49 |
+
help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
|
| 50 |
+
p.add_argument("--tp", type=int, default=None)
|
| 51 |
+
p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
|
| 52 |
+
p.add_argument("--max-model-len", type=int, default=65536)
|
| 53 |
+
p.add_argument("--batch-size", type=int, default=32)
|
| 54 |
+
# Data-parallel sharding
|
| 55 |
+
p.add_argument("--shard", type=int, default=0)
|
| 56 |
+
p.add_argument("--num-shards", type=int, default=1)
|
| 57 |
+
return p.parse_args()
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def main() -> None:
|
| 61 |
+
args = parse_args()
|
| 62 |
+
out_dir = args.output_dir / args.label
|
| 63 |
+
out_dir.mkdir(parents=True, exist_ok=True)
|
| 64 |
+
shard_suffix = (f".shard{args.shard}of{args.num_shards}"
|
| 65 |
+
if args.num_shards > 1 else "")
|
| 66 |
+
results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
|
| 67 |
+
metrics_json = out_dir / "metrics.json"
|
| 68 |
+
summary_txt = out_dir / "summary.txt"
|
| 69 |
+
|
| 70 |
+
if args.vllm:
|
| 71 |
+
print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
|
| 72 |
+
"supported upstream yet; falling back to transformers.")
|
| 73 |
+
print("[data] Loading LVBench dataset...")
|
| 74 |
+
test_data = load_lvbench(args.video_dir, args.max_samples)
|
| 75 |
+
if args.num_shards > 1:
|
| 76 |
+
test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
|
| 77 |
+
print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
|
| 78 |
+
else:
|
| 79 |
+
print(f"[data] {len(test_data)} questions ready")
|
| 80 |
+
|
| 81 |
+
processed: set = set()
|
| 82 |
+
if results_jsonl.exists():
|
| 83 |
+
with open(results_jsonl) as f:
|
| 84 |
+
for line in f:
|
| 85 |
+
obj = json.loads(line)
|
| 86 |
+
processed.add(obj["uid"])
|
| 87 |
+
print(f"[resume] {len(processed)} already processed")
|
| 88 |
+
|
| 89 |
+
model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
|
| 90 |
+
init_audio=False)
|
| 91 |
+
|
| 92 |
+
for item in tqdm(test_data, desc="LVBench", unit="q"):
|
| 93 |
+
if item["uid"] in processed:
|
| 94 |
+
continue
|
| 95 |
+
try:
|
| 96 |
+
raw_output = run_inference(
|
| 97 |
+
model, tokenizer,
|
| 98 |
+
video_path=item["video_path"],
|
| 99 |
+
audio_path=None,
|
| 100 |
+
prompt=item["prompt"],
|
| 101 |
+
max_new_tokens=args.max_new_tokens,
|
| 102 |
+
temperature=args.temperature,
|
| 103 |
+
max_frames=args.max_frames,
|
| 104 |
+
fps=args.fps,
|
| 105 |
+
)
|
| 106 |
+
except Exception as exc:
|
| 107 |
+
import traceback
|
| 108 |
+
print(f" [error] {item['uid']}: {exc}")
|
| 109 |
+
traceback.print_exc()
|
| 110 |
+
raw_output = ""
|
| 111 |
+
|
| 112 |
+
pred = extract_answer(raw_output)
|
| 113 |
+
result = {
|
| 114 |
+
"uid": item["uid"],
|
| 115 |
+
"video_id": item["video_id"],
|
| 116 |
+
"video_type": item["video_type"],
|
| 117 |
+
"question_type": item["question_type"],
|
| 118 |
+
"question": item["question"],
|
| 119 |
+
"gt_answer": item["gt_answer"],
|
| 120 |
+
"time_reference": item.get("time_reference", ""),
|
| 121 |
+
"pred_answer": pred,
|
| 122 |
+
"raw_output": raw_output,
|
| 123 |
+
}
|
| 124 |
+
with open(results_jsonl, "a", encoding="utf-8") as f:
|
| 125 |
+
f.write(json.dumps(result, ensure_ascii=False) + "\n")
|
| 126 |
+
|
| 127 |
+
processed.add(item["uid"])
|
| 128 |
+
gc.collect()
|
| 129 |
+
torch.cuda.empty_cache()
|
| 130 |
+
|
| 131 |
+
if args.num_shards > 1:
|
| 132 |
+
print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
|
| 133 |
+
print(f"[shard] Run merge_shards.py --bench lvbench --label-dir {out_dir}")
|
| 134 |
+
return
|
| 135 |
+
|
| 136 |
+
all_results = []
|
| 137 |
+
if results_jsonl.exists():
|
| 138 |
+
with open(results_jsonl) as f:
|
| 139 |
+
for line in f:
|
| 140 |
+
all_results.append(json.loads(line))
|
| 141 |
+
|
| 142 |
+
metrics = compute_metrics(all_results)
|
| 143 |
+
metrics["eval_config"] = {
|
| 144 |
+
"model_id": args.model_id,
|
| 145 |
+
"video_dir": str(args.video_dir),
|
| 146 |
+
"max_new_tokens": args.max_new_tokens,
|
| 147 |
+
"temperature": args.temperature,
|
| 148 |
+
"max_frames": args.max_frames,
|
| 149 |
+
"fps": args.fps,
|
| 150 |
+
"attn": args.attn,
|
| 151 |
+
}
|
| 152 |
+
with open(metrics_json, "w", encoding="utf-8") as f:
|
| 153 |
+
json.dump(metrics, f, indent=2, ensure_ascii=False)
|
| 154 |
+
|
| 155 |
+
print_summary(metrics, args.label)
|
| 156 |
+
with open(summary_txt, "w", encoding="utf-8") as f:
|
| 157 |
+
buf = io.StringIO()
|
| 158 |
+
with contextlib.redirect_stdout(buf):
|
| 159 |
+
print_summary(metrics, args.label)
|
| 160 |
+
f.write(buf.getvalue())
|
| 161 |
+
|
| 162 |
+
print(f"\n[output] Results: {results_jsonl}")
|
| 163 |
+
print(f"[output] Metrics: {metrics_json}")
|
| 164 |
+
print(f"[output] Summary: {summary_txt}")
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
if __name__ == "__main__":
|
| 168 |
+
main()
|
scripts/eval_vggsoundsync.py
ADDED
|
@@ -0,0 +1,195 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Evaluate MiniCPM-o 4.5 on VGG-Sound Sync (out-of-domain sync).
|
| 3 |
+
|
| 4 |
+
Reuses the data loader, MCQ / freetext prompts, answer parsers, GPT judge,
|
| 5 |
+
and metrics from CleverHans-Evaluation's eval_vggsoundsync.py. Only the
|
| 6 |
+
inference path is replaced with MiniCPM-o.
|
| 7 |
+
"""
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
import _common
|
| 11 |
+
|
| 12 |
+
import argparse
|
| 13 |
+
import gc
|
| 14 |
+
import io
|
| 15 |
+
import contextlib
|
| 16 |
+
import json
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
|
| 19 |
+
import torch
|
| 20 |
+
from tqdm import tqdm
|
| 21 |
+
|
| 22 |
+
ch = _common.ch("vggsoundsync")
|
| 23 |
+
MCQ_PROMPT = ch.MCQ_PROMPT
|
| 24 |
+
FREETEXT_PROMPT = ch.FREETEXT_PROMPT
|
| 25 |
+
load_test_data = ch.load_test_data
|
| 26 |
+
extract_mcq_answer = ch.extract_mcq_answer
|
| 27 |
+
extract_freetext_prediction = ch.extract_freetext_prediction
|
| 28 |
+
gpt_extract_prediction = ch.gpt_extract_prediction
|
| 29 |
+
_get_openai_client = ch._get_openai_client
|
| 30 |
+
compute_metrics = ch.compute_metrics
|
| 31 |
+
print_summary = ch.print_summary
|
| 32 |
+
_build_result = ch._build_result
|
| 33 |
+
DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
|
| 34 |
+
|
| 35 |
+
from minicpmo_inference import load_model, run_inference
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def parse_args() -> argparse.Namespace:
|
| 39 |
+
p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on VGG-Sound Sync.")
|
| 40 |
+
p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
|
| 41 |
+
p.add_argument("--test-jsonl", type=Path, required=True,
|
| 42 |
+
help="test.jsonl from prepare_vggsoundsync.py")
|
| 43 |
+
p.add_argument("--output-dir", type=Path,
|
| 44 |
+
default=Path("/home/ubuntu/eval_results/vggsoundsync_minicpmo"))
|
| 45 |
+
p.add_argument("--mode", choices=["mcq", "freetext"], default="mcq")
|
| 46 |
+
p.add_argument("--max-samples", type=int, default=-1)
|
| 47 |
+
p.add_argument("--max-new-tokens", type=int, default=64)
|
| 48 |
+
p.add_argument("--temperature", type=float, default=0.0)
|
| 49 |
+
p.add_argument("--label", type=str, default="minicpmo_vggsync")
|
| 50 |
+
p.add_argument("--max-frames", type=int, default=32)
|
| 51 |
+
p.add_argument("--fps", type=float, default=2.0)
|
| 52 |
+
p.add_argument("--attn", type=str, default="flash_attention_2",
|
| 53 |
+
choices=["sdpa", "flash_attention_2", "eager"])
|
| 54 |
+
# vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
|
| 55 |
+
p.add_argument("--vllm", action="store_true", default=False,
|
| 56 |
+
help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
|
| 57 |
+
p.add_argument("--tp", type=int, default=None)
|
| 58 |
+
p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
|
| 59 |
+
p.add_argument("--max-model-len", type=int, default=65536)
|
| 60 |
+
p.add_argument("--batch-size", type=int, default=16)
|
| 61 |
+
p.add_argument("--gpt-judge", action="store_true", default=False)
|
| 62 |
+
p.add_argument("--openai-api-key", type=str, default=None)
|
| 63 |
+
p.add_argument("--gpt-model", type=str, default="gpt-5.4")
|
| 64 |
+
# Data-parallel sharding
|
| 65 |
+
p.add_argument("--shard", type=int, default=0)
|
| 66 |
+
p.add_argument("--num-shards", type=int, default=1)
|
| 67 |
+
return p.parse_args()
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
def _extract_pred(raw_output, mode, gpt_judge, api_key, gpt_model, answer_map=None):
|
| 71 |
+
if mode == "mcq":
|
| 72 |
+
return extract_mcq_answer(raw_output, answer_map=answer_map)
|
| 73 |
+
if gpt_judge and raw_output:
|
| 74 |
+
gpt_pred = gpt_extract_prediction(raw_output, api_key=api_key, model=gpt_model)
|
| 75 |
+
if gpt_pred is not None:
|
| 76 |
+
return gpt_pred
|
| 77 |
+
return extract_freetext_prediction(raw_output)
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def main() -> None:
|
| 81 |
+
args = parse_args()
|
| 82 |
+
default_prompt = MCQ_PROMPT if args.mode == "mcq" else FREETEXT_PROMPT
|
| 83 |
+
|
| 84 |
+
if args.vllm:
|
| 85 |
+
print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
|
| 86 |
+
"supported upstream yet; falling back to transformers.")
|
| 87 |
+
|
| 88 |
+
if args.gpt_judge and args.mode == "freetext":
|
| 89 |
+
if _get_openai_client(args.openai_api_key) is None:
|
| 90 |
+
print("[ERROR] --gpt-judge requires OPENAI_API_KEY or --openai-api-key.")
|
| 91 |
+
raise SystemExit(1)
|
| 92 |
+
|
| 93 |
+
out_dir = args.output_dir / args.label
|
| 94 |
+
out_dir.mkdir(parents=True, exist_ok=True)
|
| 95 |
+
shard_suffix = (f".shard{args.shard}of{args.num_shards}"
|
| 96 |
+
if args.num_shards > 1 else "")
|
| 97 |
+
results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
|
| 98 |
+
metrics_json = out_dir / "metrics.json"
|
| 99 |
+
summary_txt = out_dir / "summary.txt"
|
| 100 |
+
|
| 101 |
+
test_data = load_test_data(args.test_jsonl, args.max_samples)
|
| 102 |
+
if args.num_shards > 1:
|
| 103 |
+
test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
|
| 104 |
+
print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} samples (mode={args.mode})")
|
| 105 |
+
else:
|
| 106 |
+
print(f"[data] {len(test_data)} samples loaded (mode={args.mode})")
|
| 107 |
+
|
| 108 |
+
processed: set = set()
|
| 109 |
+
if results_jsonl.exists():
|
| 110 |
+
with open(results_jsonl) as f:
|
| 111 |
+
for line in f:
|
| 112 |
+
obj = json.loads(line)
|
| 113 |
+
processed.add(obj["uid"])
|
| 114 |
+
print(f"[resume] {len(processed)} already processed")
|
| 115 |
+
|
| 116 |
+
model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
|
| 117 |
+
init_audio=True)
|
| 118 |
+
|
| 119 |
+
for item in tqdm(test_data, desc="VGGSync", unit="sample"):
|
| 120 |
+
if item["uid"] in processed:
|
| 121 |
+
continue
|
| 122 |
+
|
| 123 |
+
item_prompt = item.get("mcq_prompt", default_prompt) if args.mode == "mcq" else default_prompt
|
| 124 |
+
item_answer_map = item.get("mcq_answer_map") if args.mode == "mcq" else None
|
| 125 |
+
|
| 126 |
+
try:
|
| 127 |
+
raw_output = run_inference(
|
| 128 |
+
model, tokenizer,
|
| 129 |
+
video_path=item["video_path"],
|
| 130 |
+
audio_path=item["audio_path"],
|
| 131 |
+
prompt=item_prompt,
|
| 132 |
+
max_new_tokens=args.max_new_tokens,
|
| 133 |
+
temperature=args.temperature,
|
| 134 |
+
max_frames=args.max_frames,
|
| 135 |
+
fps=args.fps,
|
| 136 |
+
)
|
| 137 |
+
except Exception as exc:
|
| 138 |
+
import traceback
|
| 139 |
+
print(f" [error] {item['uid']}: {exc}")
|
| 140 |
+
traceback.print_exc()
|
| 141 |
+
raw_output = ""
|
| 142 |
+
|
| 143 |
+
pred = _extract_pred(raw_output, args.mode, args.gpt_judge,
|
| 144 |
+
args.openai_api_key, args.gpt_model,
|
| 145 |
+
answer_map=item_answer_map)
|
| 146 |
+
result = _build_result(item, pred, raw_output, args.mode)
|
| 147 |
+
|
| 148 |
+
with open(results_jsonl, "a", encoding="utf-8") as f:
|
| 149 |
+
f.write(json.dumps(result, ensure_ascii=False) + "\n")
|
| 150 |
+
|
| 151 |
+
processed.add(item["uid"])
|
| 152 |
+
gc.collect()
|
| 153 |
+
torch.cuda.empty_cache()
|
| 154 |
+
|
| 155 |
+
if args.num_shards > 1:
|
| 156 |
+
print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
|
| 157 |
+
print(f"[shard] Run merge_shards.py --bench vggsoundsync --label-dir {out_dir}")
|
| 158 |
+
return
|
| 159 |
+
|
| 160 |
+
all_results = []
|
| 161 |
+
if results_jsonl.exists():
|
| 162 |
+
with open(results_jsonl) as f:
|
| 163 |
+
for line in f:
|
| 164 |
+
all_results.append(json.loads(line))
|
| 165 |
+
|
| 166 |
+
metrics = compute_metrics(all_results)
|
| 167 |
+
metrics["eval_config"] = {
|
| 168 |
+
"model_id": args.model_id,
|
| 169 |
+
"mode": args.mode,
|
| 170 |
+
"test_jsonl": str(args.test_jsonl),
|
| 171 |
+
"max_new_tokens": args.max_new_tokens,
|
| 172 |
+
"temperature": args.temperature,
|
| 173 |
+
"max_frames": args.max_frames,
|
| 174 |
+
"fps": args.fps,
|
| 175 |
+
"attn": args.attn,
|
| 176 |
+
"gpt_judge": args.gpt_judge,
|
| 177 |
+
}
|
| 178 |
+
|
| 179 |
+
with open(metrics_json, "w", encoding="utf-8") as f:
|
| 180 |
+
json.dump(metrics, f, indent=2, ensure_ascii=False)
|
| 181 |
+
|
| 182 |
+
print_summary(metrics, args.label)
|
| 183 |
+
with open(summary_txt, "w", encoding="utf-8") as f:
|
| 184 |
+
buf = io.StringIO()
|
| 185 |
+
with contextlib.redirect_stdout(buf):
|
| 186 |
+
print_summary(metrics, args.label)
|
| 187 |
+
f.write(buf.getvalue())
|
| 188 |
+
|
| 189 |
+
print(f"\n[output] Results: {results_jsonl}")
|
| 190 |
+
print(f"[output] Metrics: {metrics_json}")
|
| 191 |
+
print(f"[output] Summary: {summary_txt}")
|
| 192 |
+
|
| 193 |
+
|
| 194 |
+
if __name__ == "__main__":
|
| 195 |
+
main()
|
scripts/eval_videomme.py
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Evaluate MiniCPM-o 4.5 on Video-MME.
|
| 3 |
+
|
| 4 |
+
Reuses the data loader and metrics from CleverHans-Evaluation's Qwen3-Omni
|
| 5 |
+
eval_videomme.py and swaps out the inference with MiniCPM-o. Video-MME is
|
| 6 |
+
video-only (no audio), so we do NOT pass audio in.
|
| 7 |
+
"""
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
import _common
|
| 11 |
+
|
| 12 |
+
import argparse
|
| 13 |
+
import gc
|
| 14 |
+
import io
|
| 15 |
+
import contextlib
|
| 16 |
+
import json
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
|
| 19 |
+
import torch
|
| 20 |
+
from tqdm import tqdm
|
| 21 |
+
|
| 22 |
+
ch = _common.ch("videomme")
|
| 23 |
+
load_videomme = ch.load_videomme
|
| 24 |
+
extract_answer = ch.extract_answer
|
| 25 |
+
compute_metrics = ch.compute_metrics
|
| 26 |
+
print_summary = ch.print_summary
|
| 27 |
+
DEFAULT_VIDEO_DIR = ch.DEFAULT_VIDEO_DIR
|
| 28 |
+
DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
|
| 29 |
+
|
| 30 |
+
from minicpmo_inference import load_model, run_inference
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
def parse_args() -> argparse.Namespace:
|
| 34 |
+
p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on Video-MME.")
|
| 35 |
+
p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
|
| 36 |
+
p.add_argument("--video-dir", type=Path, default=DEFAULT_VIDEO_DIR)
|
| 37 |
+
p.add_argument("--output-dir", type=Path,
|
| 38 |
+
default=Path("/home/ubuntu/eval_results/videomme_minicpmo"))
|
| 39 |
+
p.add_argument("--max-samples", type=int, default=-1)
|
| 40 |
+
p.add_argument("--max-new-tokens", type=int, default=32)
|
| 41 |
+
p.add_argument("--temperature", type=float, default=0.0)
|
| 42 |
+
p.add_argument("--label", type=str, default="minicpmo_videomme")
|
| 43 |
+
p.add_argument("--max-frames", type=int, default=64,
|
| 44 |
+
help="Max frames sampled from each video (MiniCPM-o uses "
|
| 45 |
+
"PIL images).")
|
| 46 |
+
p.add_argument("--fps", type=float, default=1.0)
|
| 47 |
+
p.add_argument("--attn", type=str, default="flash_attention_2",
|
| 48 |
+
choices=["sdpa", "flash_attention_2", "eager"])
|
| 49 |
+
# vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
|
| 50 |
+
p.add_argument("--vllm", action="store_true", default=False,
|
| 51 |
+
help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
|
| 52 |
+
p.add_argument("--tp", type=int, default=None)
|
| 53 |
+
p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
|
| 54 |
+
p.add_argument("--max-model-len", type=int, default=65536)
|
| 55 |
+
p.add_argument("--batch-size", type=int, default=32)
|
| 56 |
+
# Data-parallel sharding: split test set into K slices, process slice N
|
| 57 |
+
p.add_argument("--shard", type=int, default=0)
|
| 58 |
+
p.add_argument("--num-shards", type=int, default=1)
|
| 59 |
+
return p.parse_args()
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def main() -> None:
|
| 63 |
+
args = parse_args()
|
| 64 |
+
out_dir = args.output_dir / args.label
|
| 65 |
+
out_dir.mkdir(parents=True, exist_ok=True)
|
| 66 |
+
shard_suffix = (f".shard{args.shard}of{args.num_shards}"
|
| 67 |
+
if args.num_shards > 1 else "")
|
| 68 |
+
results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
|
| 69 |
+
metrics_json = out_dir / "metrics.json"
|
| 70 |
+
summary_txt = out_dir / "summary.txt"
|
| 71 |
+
|
| 72 |
+
if args.vllm:
|
| 73 |
+
print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
|
| 74 |
+
"supported upstream yet; falling back to transformers.")
|
| 75 |
+
print("[data] Loading Video-MME dataset...")
|
| 76 |
+
test_data = load_videomme(args.video_dir, args.max_samples)
|
| 77 |
+
if args.num_shards > 1:
|
| 78 |
+
test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
|
| 79 |
+
print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
|
| 80 |
+
else:
|
| 81 |
+
print(f"[data] {len(test_data)} questions ready")
|
| 82 |
+
|
| 83 |
+
processed: set = set()
|
| 84 |
+
if results_jsonl.exists():
|
| 85 |
+
with open(results_jsonl) as f:
|
| 86 |
+
for line in f:
|
| 87 |
+
obj = json.loads(line)
|
| 88 |
+
processed.add(obj["question_id"])
|
| 89 |
+
print(f"[resume] {len(processed)} already processed, skipping")
|
| 90 |
+
|
| 91 |
+
model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
|
| 92 |
+
init_audio=False)
|
| 93 |
+
|
| 94 |
+
for item in tqdm(test_data, desc="Video-MME", unit="q"):
|
| 95 |
+
if item["question_id"] in processed:
|
| 96 |
+
continue
|
| 97 |
+
try:
|
| 98 |
+
raw_output = run_inference(
|
| 99 |
+
model, tokenizer,
|
| 100 |
+
video_path=item["video_path"],
|
| 101 |
+
audio_path=None,
|
| 102 |
+
prompt=item["prompt"],
|
| 103 |
+
max_new_tokens=args.max_new_tokens,
|
| 104 |
+
temperature=args.temperature,
|
| 105 |
+
max_frames=args.max_frames,
|
| 106 |
+
fps=args.fps,
|
| 107 |
+
)
|
| 108 |
+
except Exception as exc:
|
| 109 |
+
import traceback
|
| 110 |
+
print(f" [error] {item['question_id']}: {exc}")
|
| 111 |
+
traceback.print_exc()
|
| 112 |
+
raw_output = ""
|
| 113 |
+
|
| 114 |
+
pred = extract_answer(raw_output)
|
| 115 |
+
result = {
|
| 116 |
+
"question_id": item["question_id"],
|
| 117 |
+
"video_id": item["video_id"],
|
| 118 |
+
"duration": item["duration"],
|
| 119 |
+
"domain": item["domain"],
|
| 120 |
+
"sub_category": item["sub_category"],
|
| 121 |
+
"task_type": item["task_type"],
|
| 122 |
+
"question": item["question"],
|
| 123 |
+
"options": item["options"],
|
| 124 |
+
"gt_answer": item["gt_answer"],
|
| 125 |
+
"pred_answer": pred,
|
| 126 |
+
"raw_output": raw_output,
|
| 127 |
+
}
|
| 128 |
+
with open(results_jsonl, "a", encoding="utf-8") as f:
|
| 129 |
+
f.write(json.dumps(result, ensure_ascii=False) + "\n")
|
| 130 |
+
|
| 131 |
+
processed.add(item["question_id"])
|
| 132 |
+
gc.collect()
|
| 133 |
+
torch.cuda.empty_cache()
|
| 134 |
+
|
| 135 |
+
if args.num_shards > 1:
|
| 136 |
+
print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
|
| 137 |
+
print(f"[shard] Run merge_shards.py --bench videomme --label-dir {out_dir}")
|
| 138 |
+
return
|
| 139 |
+
|
| 140 |
+
all_results = []
|
| 141 |
+
if results_jsonl.exists():
|
| 142 |
+
with open(results_jsonl) as f:
|
| 143 |
+
for line in f:
|
| 144 |
+
all_results.append(json.loads(line))
|
| 145 |
+
|
| 146 |
+
metrics = compute_metrics(all_results)
|
| 147 |
+
metrics["eval_config"] = {
|
| 148 |
+
"model_id": args.model_id,
|
| 149 |
+
"video_dir": str(args.video_dir),
|
| 150 |
+
"max_new_tokens": args.max_new_tokens,
|
| 151 |
+
"temperature": args.temperature,
|
| 152 |
+
"max_frames": args.max_frames,
|
| 153 |
+
"fps": args.fps,
|
| 154 |
+
"attn": args.attn,
|
| 155 |
+
}
|
| 156 |
+
|
| 157 |
+
with open(metrics_json, "w", encoding="utf-8") as f:
|
| 158 |
+
json.dump(metrics, f, indent=2, ensure_ascii=False)
|
| 159 |
+
|
| 160 |
+
print_summary(metrics, args.label)
|
| 161 |
+
with open(summary_txt, "w", encoding="utf-8") as f:
|
| 162 |
+
buf = io.StringIO()
|
| 163 |
+
with contextlib.redirect_stdout(buf):
|
| 164 |
+
print_summary(metrics, args.label)
|
| 165 |
+
f.write(buf.getvalue())
|
| 166 |
+
|
| 167 |
+
print(f"\n[output] Results: {results_jsonl}")
|
| 168 |
+
print(f"[output] Metrics: {metrics_json}")
|
| 169 |
+
print(f"[output] Summary: {summary_txt}")
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
if __name__ == "__main__":
|
| 173 |
+
main()
|
scripts/eval_worldsense.py
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Evaluate MiniCPM-o 4.5 on WorldSense.
|
| 3 |
+
|
| 4 |
+
WorldSense videos have embedded audio; we extract it via ffmpeg and feed
|
| 5 |
+
both the video frames and the audio waveform to MiniCPM-o.
|
| 6 |
+
"""
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
import _common
|
| 10 |
+
|
| 11 |
+
import argparse
|
| 12 |
+
import gc
|
| 13 |
+
import io
|
| 14 |
+
import contextlib
|
| 15 |
+
import json
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
|
| 18 |
+
import torch
|
| 19 |
+
from tqdm import tqdm
|
| 20 |
+
|
| 21 |
+
ch = _common.ch("worldsense")
|
| 22 |
+
load_worldsense = ch.load_worldsense
|
| 23 |
+
extract_answer = ch.extract_answer
|
| 24 |
+
compute_metrics = ch.compute_metrics
|
| 25 |
+
print_summary = ch.print_summary
|
| 26 |
+
DEFAULT_DATA_DIR = ch.DEFAULT_DATA_DIR
|
| 27 |
+
DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
|
| 28 |
+
|
| 29 |
+
from minicpmo_inference import load_model, run_inference
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def parse_args() -> argparse.Namespace:
|
| 33 |
+
p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on WorldSense.")
|
| 34 |
+
p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
|
| 35 |
+
p.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR)
|
| 36 |
+
p.add_argument("--output-dir", type=Path,
|
| 37 |
+
default=Path("/home/ubuntu/eval_results/worldsense_minicpmo"))
|
| 38 |
+
p.add_argument("--max-samples", type=int, default=-1)
|
| 39 |
+
p.add_argument("--max-new-tokens", type=int, default=32)
|
| 40 |
+
p.add_argument("--temperature", type=float, default=0.0)
|
| 41 |
+
p.add_argument("--label", type=str, default="minicpmo_worldsense")
|
| 42 |
+
p.add_argument("--max-frames", type=int, default=64)
|
| 43 |
+
p.add_argument("--fps", type=float, default=1.0)
|
| 44 |
+
p.add_argument("--attn", type=str, default="flash_attention_2",
|
| 45 |
+
choices=["sdpa", "flash_attention_2", "eager"])
|
| 46 |
+
p.add_argument("--no-audio", action="store_true",
|
| 47 |
+
help="Video-only mode (skip audio extraction).")
|
| 48 |
+
# vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
|
| 49 |
+
p.add_argument("--vllm", action="store_true", default=False,
|
| 50 |
+
help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
|
| 51 |
+
p.add_argument("--tp", type=int, default=None)
|
| 52 |
+
p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
|
| 53 |
+
p.add_argument("--max-model-len", type=int, default=65536)
|
| 54 |
+
p.add_argument("--batch-size", type=int, default=32)
|
| 55 |
+
# Data-parallel sharding
|
| 56 |
+
p.add_argument("--shard", type=int, default=0)
|
| 57 |
+
p.add_argument("--num-shards", type=int, default=1)
|
| 58 |
+
return p.parse_args()
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def main() -> None:
|
| 62 |
+
args = parse_args()
|
| 63 |
+
out_dir = args.output_dir / args.label
|
| 64 |
+
out_dir.mkdir(parents=True, exist_ok=True)
|
| 65 |
+
shard_suffix = (f".shard{args.shard}of{args.num_shards}"
|
| 66 |
+
if args.num_shards > 1 else "")
|
| 67 |
+
results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
|
| 68 |
+
metrics_json = out_dir / "metrics.json"
|
| 69 |
+
summary_txt = out_dir / "summary.txt"
|
| 70 |
+
|
| 71 |
+
if args.vllm:
|
| 72 |
+
print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
|
| 73 |
+
"supported upstream yet; falling back to transformers.")
|
| 74 |
+
print("[data] Loading WorldSense dataset...")
|
| 75 |
+
test_data = load_worldsense(args.data_dir, args.max_samples)
|
| 76 |
+
if args.num_shards > 1:
|
| 77 |
+
test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
|
| 78 |
+
print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
|
| 79 |
+
else:
|
| 80 |
+
print(f"[data] {len(test_data)} questions ready")
|
| 81 |
+
|
| 82 |
+
processed: set = set()
|
| 83 |
+
if results_jsonl.exists():
|
| 84 |
+
with open(results_jsonl) as f:
|
| 85 |
+
for line in f:
|
| 86 |
+
obj = json.loads(line)
|
| 87 |
+
processed.add(obj["question_id"])
|
| 88 |
+
print(f"[resume] {len(processed)} already processed")
|
| 89 |
+
|
| 90 |
+
model, tokenizer = load_model(
|
| 91 |
+
args.model_id, attn_implementation=args.attn, init_audio=not args.no_audio,
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
for item in tqdm(test_data, desc="WorldSense", unit="q"):
|
| 95 |
+
if item["question_id"] in processed:
|
| 96 |
+
continue
|
| 97 |
+
try:
|
| 98 |
+
raw_output = run_inference(
|
| 99 |
+
model, tokenizer,
|
| 100 |
+
video_path=item["video_path"],
|
| 101 |
+
audio_path=None,
|
| 102 |
+
prompt=item["prompt"],
|
| 103 |
+
max_new_tokens=args.max_new_tokens,
|
| 104 |
+
temperature=args.temperature,
|
| 105 |
+
max_frames=args.max_frames,
|
| 106 |
+
fps=args.fps,
|
| 107 |
+
use_audio_from_video=not args.no_audio,
|
| 108 |
+
)
|
| 109 |
+
except Exception as exc:
|
| 110 |
+
import traceback
|
| 111 |
+
print(f" [error] {item['question_id']}: {exc}")
|
| 112 |
+
traceback.print_exc()
|
| 113 |
+
raw_output = ""
|
| 114 |
+
|
| 115 |
+
pred = extract_answer(raw_output)
|
| 116 |
+
result = {
|
| 117 |
+
"question_id": item["question_id"],
|
| 118 |
+
"video_id": item["video_id"],
|
| 119 |
+
"duration": item["duration"],
|
| 120 |
+
"domain": item["domain"],
|
| 121 |
+
"sub_category": item["sub_category"],
|
| 122 |
+
"task_domain": item["task_domain"],
|
| 123 |
+
"task_type": item["task_type"],
|
| 124 |
+
"question": item["question"],
|
| 125 |
+
"candidates": item["candidates"],
|
| 126 |
+
"gt_answer": item["gt_answer"],
|
| 127 |
+
"pred_answer": pred,
|
| 128 |
+
"raw_output": raw_output,
|
| 129 |
+
}
|
| 130 |
+
with open(results_jsonl, "a", encoding="utf-8") as f:
|
| 131 |
+
f.write(json.dumps(result, ensure_ascii=False) + "\n")
|
| 132 |
+
|
| 133 |
+
processed.add(item["question_id"])
|
| 134 |
+
gc.collect()
|
| 135 |
+
torch.cuda.empty_cache()
|
| 136 |
+
|
| 137 |
+
if args.num_shards > 1:
|
| 138 |
+
print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
|
| 139 |
+
print(f"[shard] Run merge_shards.py --bench worldsense --label-dir {out_dir}")
|
| 140 |
+
return
|
| 141 |
+
|
| 142 |
+
all_results = []
|
| 143 |
+
if results_jsonl.exists():
|
| 144 |
+
with open(results_jsonl) as f:
|
| 145 |
+
for line in f:
|
| 146 |
+
all_results.append(json.loads(line))
|
| 147 |
+
|
| 148 |
+
metrics = compute_metrics(all_results)
|
| 149 |
+
metrics["eval_config"] = {
|
| 150 |
+
"model_id": args.model_id,
|
| 151 |
+
"data_dir": str(args.data_dir),
|
| 152 |
+
"max_new_tokens": args.max_new_tokens,
|
| 153 |
+
"temperature": args.temperature,
|
| 154 |
+
"max_frames": args.max_frames,
|
| 155 |
+
"fps": args.fps,
|
| 156 |
+
"attn": args.attn,
|
| 157 |
+
"no_audio": args.no_audio,
|
| 158 |
+
}
|
| 159 |
+
with open(metrics_json, "w", encoding="utf-8") as f:
|
| 160 |
+
json.dump(metrics, f, indent=2, ensure_ascii=False)
|
| 161 |
+
|
| 162 |
+
print_summary(metrics, args.label)
|
| 163 |
+
with open(summary_txt, "w", encoding="utf-8") as f:
|
| 164 |
+
buf = io.StringIO()
|
| 165 |
+
with contextlib.redirect_stdout(buf):
|
| 166 |
+
print_summary(metrics, args.label)
|
| 167 |
+
f.write(buf.getvalue())
|
| 168 |
+
|
| 169 |
+
print(f"\n[output] Results: {results_jsonl}")
|
| 170 |
+
print(f"[output] Metrics: {metrics_json}")
|
| 171 |
+
print(f"[output] Summary: {summary_txt}")
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
if __name__ == "__main__":
|
| 175 |
+
main()
|
scripts/merge_shards.py
ADDED
|
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Merge sharded eval_results.shard*.jsonl files and recompute metrics.
|
| 3 |
+
|
| 4 |
+
Usage:
|
| 5 |
+
python merge_shards.py --bench videomme \
|
| 6 |
+
--label-dir /home/ubuntu/eval_results/videomme/vmme_minicpmo_4_5
|
| 7 |
+
|
| 8 |
+
The script finds all `eval_results.shard*.jsonl` under `--label-dir`,
|
| 9 |
+
concatenates them into `eval_results.jsonl` (deduping by a bench-specific
|
| 10 |
+
primary key), then re-runs the bench's `compute_metrics` + `print_summary`.
|
| 11 |
+
Final outputs: `eval_results.jsonl`, `metrics.json`, `summary.txt`.
|
| 12 |
+
"""
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import _common # noqa: F401
|
| 16 |
+
|
| 17 |
+
import argparse
|
| 18 |
+
import contextlib
|
| 19 |
+
import io
|
| 20 |
+
import json
|
| 21 |
+
import sys
|
| 22 |
+
from pathlib import Path
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
# Primary key per bench (must match the field written by each eval script).
|
| 26 |
+
PK = {
|
| 27 |
+
"videomme": "question_id",
|
| 28 |
+
"lvbench": "uid",
|
| 29 |
+
"worldsense": "question_id",
|
| 30 |
+
"daily_omni": "question_id",
|
| 31 |
+
"dpo_sync": "video",
|
| 32 |
+
"vggsoundsync": "uid",
|
| 33 |
+
}
|
| 34 |
+
|
| 35 |
+
# Extra label used when printing the summary
|
| 36 |
+
LABEL_HINT = {
|
| 37 |
+
"videomme": "Video-MME",
|
| 38 |
+
"lvbench": "LVBench",
|
| 39 |
+
"worldsense": "WorldSense",
|
| 40 |
+
"daily_omni": "Daily-Omni",
|
| 41 |
+
"dpo_sync": "Sync",
|
| 42 |
+
"vggsoundsync": "VGGSoundSync",
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def main() -> int:
|
| 47 |
+
p = argparse.ArgumentParser()
|
| 48 |
+
p.add_argument("--bench", required=True,
|
| 49 |
+
choices=list(PK.keys()),
|
| 50 |
+
help="Which benchmark this label-dir belongs to.")
|
| 51 |
+
p.add_argument("--label-dir", type=Path, required=True,
|
| 52 |
+
help="Eval output dir containing eval_results.shard*.jsonl.")
|
| 53 |
+
args = p.parse_args()
|
| 54 |
+
|
| 55 |
+
ch = _common.ch(args.bench)
|
| 56 |
+
pk = PK[args.bench]
|
| 57 |
+
|
| 58 |
+
shard_files = sorted(args.label_dir.glob("eval_results.shard*.jsonl"))
|
| 59 |
+
if not shard_files:
|
| 60 |
+
print(f"[merge] ERROR: no eval_results.shard*.jsonl in {args.label_dir}",
|
| 61 |
+
file=sys.stderr)
|
| 62 |
+
return 1
|
| 63 |
+
|
| 64 |
+
print(f"[merge] Found {len(shard_files)} shard file(s):")
|
| 65 |
+
for sf in shard_files:
|
| 66 |
+
print(f" - {sf.name}")
|
| 67 |
+
|
| 68 |
+
merged_path = args.label_dir / "eval_results.jsonl"
|
| 69 |
+
all_results = []
|
| 70 |
+
seen: set = set()
|
| 71 |
+
n_dup = 0
|
| 72 |
+
with open(merged_path, "w", encoding="utf-8") as out:
|
| 73 |
+
for sf in shard_files:
|
| 74 |
+
with open(sf) as f:
|
| 75 |
+
for line in f:
|
| 76 |
+
line = line.strip()
|
| 77 |
+
if not line:
|
| 78 |
+
continue
|
| 79 |
+
obj = json.loads(line)
|
| 80 |
+
key = obj.get(pk)
|
| 81 |
+
if key in seen:
|
| 82 |
+
n_dup += 1
|
| 83 |
+
continue
|
| 84 |
+
seen.add(key)
|
| 85 |
+
out.write(line + "\n")
|
| 86 |
+
all_results.append(obj)
|
| 87 |
+
|
| 88 |
+
print(f"[merge] Merged {len(all_results)} unique results "
|
| 89 |
+
f"({n_dup} duplicates skipped) -> {merged_path}")
|
| 90 |
+
|
| 91 |
+
metrics = ch.compute_metrics(all_results)
|
| 92 |
+
# Preserve eval_config from any shard if present
|
| 93 |
+
for sf in shard_files:
|
| 94 |
+
try:
|
| 95 |
+
with open(sf) as f:
|
| 96 |
+
first = f.readline().strip()
|
| 97 |
+
if first:
|
| 98 |
+
obj = json.loads(first)
|
| 99 |
+
if "eval_config" in obj:
|
| 100 |
+
metrics["eval_config"] = obj["eval_config"]
|
| 101 |
+
break
|
| 102 |
+
except Exception:
|
| 103 |
+
pass
|
| 104 |
+
|
| 105 |
+
metrics_json = args.label_dir / "metrics.json"
|
| 106 |
+
summary_txt = args.label_dir / "summary.txt"
|
| 107 |
+
|
| 108 |
+
with open(metrics_json, "w", encoding="utf-8") as f:
|
| 109 |
+
json.dump(metrics, f, indent=2, ensure_ascii=False)
|
| 110 |
+
|
| 111 |
+
label = args.label_dir.name
|
| 112 |
+
ch.print_summary(metrics, label)
|
| 113 |
+
|
| 114 |
+
buf = io.StringIO()
|
| 115 |
+
with contextlib.redirect_stdout(buf):
|
| 116 |
+
ch.print_summary(metrics, label)
|
| 117 |
+
with open(summary_txt, "w", encoding="utf-8") as f:
|
| 118 |
+
f.write(buf.getvalue())
|
| 119 |
+
|
| 120 |
+
print(f"\n[merge] Done.")
|
| 121 |
+
print(f" Results: {merged_path}")
|
| 122 |
+
print(f" Metrics: {metrics_json}")
|
| 123 |
+
print(f" Summary: {summary_txt}")
|
| 124 |
+
return 0
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
if __name__ == "__main__":
|
| 128 |
+
sys.exit(main())
|
scripts/minicpmo_inference.py
ADDED
|
@@ -0,0 +1,264 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Common inference wrapper for MiniCPM-o 4.5.
|
| 3 |
+
|
| 4 |
+
MiniCPM-o's API is `model.chat(msgs=[...], tokenizer=...)` where `msgs` is a
|
| 5 |
+
list of `{"role": ..., "content": [image, audio, ..., text]}`. This module
|
| 6 |
+
hides that detail behind `run_inference(model, tokenizer, video, audio,
|
| 7 |
+
prompt)` so the 6 benchmark eval scripts can share one inference code path.
|
| 8 |
+
|
| 9 |
+
Also runs the compatibility patcher on import so users who haven't run
|
| 10 |
+
`setup_env.sh` still get a working model.
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import os
|
| 16 |
+
import subprocess
|
| 17 |
+
import tempfile
|
| 18 |
+
from pathlib import Path
|
| 19 |
+
from typing import Any, List, Optional, Tuple
|
| 20 |
+
|
| 21 |
+
import numpy as np
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
# ---------------------------------------------------------------------------
|
| 25 |
+
# Apply transformers>=4.52 compatibility patches lazily on import.
|
| 26 |
+
# Safe to call multiple times; idempotent.
|
| 27 |
+
# ---------------------------------------------------------------------------
|
| 28 |
+
def _maybe_patch_once() -> None:
|
| 29 |
+
try:
|
| 30 |
+
from patch_minicpmo import (
|
| 31 |
+
_find_modeling_file,
|
| 32 |
+
_find_processing_file,
|
| 33 |
+
patch_file,
|
| 34 |
+
patch_processing_file,
|
| 35 |
+
)
|
| 36 |
+
except ImportError:
|
| 37 |
+
return
|
| 38 |
+
path = _find_modeling_file()
|
| 39 |
+
if path is not None:
|
| 40 |
+
try:
|
| 41 |
+
patch_file(path)
|
| 42 |
+
except Exception as exc: # pragma: no cover
|
| 43 |
+
print(f"[minicpmo] (warn) patch failed: {exc}")
|
| 44 |
+
proc = _find_processing_file()
|
| 45 |
+
if proc is not None:
|
| 46 |
+
try:
|
| 47 |
+
patch_processing_file(proc)
|
| 48 |
+
except Exception as exc: # pragma: no cover
|
| 49 |
+
print(f"[minicpmo] (warn) processing patch failed: {exc}")
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
_maybe_patch_once()
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def _max_inp_length_for_chat(model: Any, max_new_tokens: int) -> int:
|
| 56 |
+
"""Upper bound for ``model.chat(..., max_inp_length=...)`` (defaults to 8192).
|
| 57 |
+
|
| 58 |
+
Many frames × per-frame image placeholders can exceed 8k text tokens; the
|
| 59 |
+
processor then truncates ``input_ids`` and image start/end counts diverge,
|
| 60 |
+
causing ``RuntimeError`` in ``processing_minicpmo._convert``.
|
| 61 |
+
"""
|
| 62 |
+
reserve = int(max_new_tokens) + 1024
|
| 63 |
+
best = 32768
|
| 64 |
+
for cfg in (
|
| 65 |
+
getattr(model, "config", None),
|
| 66 |
+
getattr(getattr(model, "llm", None), "config", None),
|
| 67 |
+
):
|
| 68 |
+
if cfg is None:
|
| 69 |
+
continue
|
| 70 |
+
npos = getattr(cfg, "max_position_embeddings", None)
|
| 71 |
+
if isinstance(npos, int) and npos > 8192:
|
| 72 |
+
best = min(best, max(npos - reserve, 16384))
|
| 73 |
+
return best
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
# ---------------------------------------------------------------------------
|
| 77 |
+
# Frame / audio loaders
|
| 78 |
+
# ---------------------------------------------------------------------------
|
| 79 |
+
def load_video_frames(video_path: str, max_frames: int = 32,
|
| 80 |
+
fps: float = 1.0) -> List:
|
| 81 |
+
"""Sample PIL RGB frames uniformly from a video.
|
| 82 |
+
|
| 83 |
+
MiniCPM-o expects a list of PIL Images (not a tensor). `fps=1.0,
|
| 84 |
+
max_frames=32` covers ~32s; longer videos get sparser sampling.
|
| 85 |
+
"""
|
| 86 |
+
from PIL import Image
|
| 87 |
+
import decord
|
| 88 |
+
|
| 89 |
+
vr = decord.VideoReader(video_path, num_threads=1)
|
| 90 |
+
total_frames = len(vr)
|
| 91 |
+
video_fps = vr.get_avg_fps()
|
| 92 |
+
duration = total_frames / max(video_fps, 1e-6)
|
| 93 |
+
|
| 94 |
+
target = max(int(round(fps * duration)), 2)
|
| 95 |
+
target = min(target, max_frames)
|
| 96 |
+
target = min(target, total_frames)
|
| 97 |
+
|
| 98 |
+
idx = np.linspace(0, total_frames - 1, target).round().astype(int).tolist()
|
| 99 |
+
frames = vr.get_batch(idx).asnumpy()
|
| 100 |
+
return [Image.fromarray(f).convert("RGB") for f in frames]
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def load_audio_waveform(audio_path: str, target_sr: int = 16000) -> np.ndarray:
|
| 104 |
+
"""Load audio as float32 numpy in [-1, 1] at `target_sr`."""
|
| 105 |
+
import librosa
|
| 106 |
+
y, _ = librosa.load(audio_path, sr=target_sr, mono=True)
|
| 107 |
+
return y.astype(np.float32)
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
def extract_audio_from_video(video_path: str, target_sr: int = 16000,
|
| 111 |
+
tmp_dir: Optional[str] = None) -> Optional[str]:
|
| 112 |
+
"""Extract the audio track from a video file to a temp .wav via ffmpeg.
|
| 113 |
+
|
| 114 |
+
Returns the path to the .wav file, or None if the video has no audio
|
| 115 |
+
track or extraction fails. Caller is responsible for cleanup.
|
| 116 |
+
"""
|
| 117 |
+
tmp_dir = tmp_dir or tempfile.mkdtemp(prefix="mo_audio_")
|
| 118 |
+
out = os.path.join(tmp_dir, "audio.wav")
|
| 119 |
+
try:
|
| 120 |
+
subprocess.run(
|
| 121 |
+
["ffmpeg", "-y", "-loglevel", "error", "-i", video_path,
|
| 122 |
+
"-vn", "-ac", "1", "-ar", str(target_sr), out],
|
| 123 |
+
check=True,
|
| 124 |
+
stdout=subprocess.DEVNULL,
|
| 125 |
+
stderr=subprocess.PIPE,
|
| 126 |
+
timeout=120,
|
| 127 |
+
)
|
| 128 |
+
except Exception:
|
| 129 |
+
return None
|
| 130 |
+
if not os.path.isfile(out) or os.path.getsize(out) < 64:
|
| 131 |
+
return None
|
| 132 |
+
return out
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
# ---------------------------------------------------------------------------
|
| 136 |
+
# Model loading
|
| 137 |
+
# ---------------------------------------------------------------------------
|
| 138 |
+
def load_model(model_id: str = "openbmb/MiniCPM-o-4_5",
|
| 139 |
+
device: str = "cuda",
|
| 140 |
+
dtype: str = "bfloat16",
|
| 141 |
+
init_audio: bool = True,
|
| 142 |
+
attn_implementation: str = "flash_attention_2"):
|
| 143 |
+
"""Load MiniCPM-o model + tokenizer. Returns (model, tokenizer).
|
| 144 |
+
|
| 145 |
+
Tries `attn_implementation` first; if flash_attention_2 isn't installed or
|
| 146 |
+
the backbone doesn't support it, falls back to sdpa automatically.
|
| 147 |
+
"""
|
| 148 |
+
import torch
|
| 149 |
+
from transformers import AutoModel, AutoTokenizer
|
| 150 |
+
|
| 151 |
+
torch_dtype = {"bfloat16": torch.bfloat16, "float16": torch.float16,
|
| 152 |
+
"float32": torch.float32}[dtype]
|
| 153 |
+
|
| 154 |
+
def _try_load(attn: str):
|
| 155 |
+
print(f"[minicpmo] Loading {model_id} (dtype={dtype}, device={device}, "
|
| 156 |
+
f"init_audio={init_audio}, attn={attn})...")
|
| 157 |
+
return AutoModel.from_pretrained(
|
| 158 |
+
model_id,
|
| 159 |
+
trust_remote_code=True,
|
| 160 |
+
attn_implementation=attn,
|
| 161 |
+
torch_dtype=torch_dtype,
|
| 162 |
+
init_vision=True,
|
| 163 |
+
init_audio=init_audio,
|
| 164 |
+
init_tts=False,
|
| 165 |
+
)
|
| 166 |
+
|
| 167 |
+
try:
|
| 168 |
+
model = _try_load(attn_implementation)
|
| 169 |
+
except Exception as exc:
|
| 170 |
+
if attn_implementation != "sdpa":
|
| 171 |
+
print(f"[minicpmo] (warn) {attn_implementation} failed ({exc}); falling back to sdpa.")
|
| 172 |
+
model = _try_load("sdpa")
|
| 173 |
+
else:
|
| 174 |
+
raise
|
| 175 |
+
|
| 176 |
+
model = model.eval().to(device)
|
| 177 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 178 |
+
print(f"[minicpmo] Model ready.")
|
| 179 |
+
return model, tokenizer
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
# ---------------------------------------------------------------------------
|
| 183 |
+
# Inference
|
| 184 |
+
# ---------------------------------------------------------------------------
|
| 185 |
+
def run_inference(
|
| 186 |
+
model,
|
| 187 |
+
tokenizer,
|
| 188 |
+
video_path: Optional[str],
|
| 189 |
+
audio_path: Optional[str],
|
| 190 |
+
prompt: str,
|
| 191 |
+
max_new_tokens: int = 256,
|
| 192 |
+
temperature: float = 0.0,
|
| 193 |
+
max_frames: int = 32,
|
| 194 |
+
fps: float = 1.0,
|
| 195 |
+
use_audio_from_video: bool = False,
|
| 196 |
+
) -> str:
|
| 197 |
+
"""Run MiniCPM-o chat inference.
|
| 198 |
+
|
| 199 |
+
Args:
|
| 200 |
+
video_path: optional path to an mp4/etc. file.
|
| 201 |
+
audio_path: optional path to a wav file. If `use_audio_from_video` is
|
| 202 |
+
True and `audio_path` is None, we extract audio from the video.
|
| 203 |
+
prompt: user instruction text.
|
| 204 |
+
temperature: 0 means greedy.
|
| 205 |
+
use_audio_from_video: if True, extract audio from the video automatically
|
| 206 |
+
(useful for WorldSense / Daily-Omni where video has embedded audio but
|
| 207 |
+
no separate wav is provided).
|
| 208 |
+
"""
|
| 209 |
+
content: List[Any] = []
|
| 210 |
+
tmp_audio_dir: Optional[str] = None
|
| 211 |
+
|
| 212 |
+
if video_path is not None:
|
| 213 |
+
frames = load_video_frames(video_path, max_frames=max_frames, fps=fps)
|
| 214 |
+
content.extend(frames)
|
| 215 |
+
|
| 216 |
+
if audio_path is None and use_audio_from_video and video_path is not None:
|
| 217 |
+
tmp_audio_dir = tempfile.mkdtemp(prefix="mo_audio_")
|
| 218 |
+
audio_path = extract_audio_from_video(video_path, tmp_dir=tmp_audio_dir)
|
| 219 |
+
|
| 220 |
+
if audio_path is not None:
|
| 221 |
+
try:
|
| 222 |
+
audio = load_audio_waveform(audio_path, target_sr=16000)
|
| 223 |
+
if audio.size > 0:
|
| 224 |
+
content.append(audio)
|
| 225 |
+
except Exception as exc:
|
| 226 |
+
print(f" [minicpmo] (warn) audio load failed: {exc}")
|
| 227 |
+
|
| 228 |
+
content.append(prompt)
|
| 229 |
+
|
| 230 |
+
msgs = [{"role": "user", "content": content}]
|
| 231 |
+
|
| 232 |
+
# Critical defaults for video understanding (see MiniCPM-o 4.5 HF README
|
| 233 |
+
# "Chat with Video"): without ``use_image_id=False, max_slice_nums=1`` the
|
| 234 |
+
# processor treats each frame as an independent HD image, slicing it into
|
| 235 |
+
# multiple sub-images with per-image ID tokens. That token distribution is
|
| 236 |
+
# OOD for the video-trained model and produces degenerate output (repeated
|
| 237 |
+
# training-data fragments, e.g. "the image description of the first image
|
| 238 |
+
# you see as a brief description ...").
|
| 239 |
+
gen_kwargs = dict(
|
| 240 |
+
max_new_tokens=max_new_tokens,
|
| 241 |
+
do_sample=temperature > 0,
|
| 242 |
+
temperature=temperature if temperature > 0 else 1.0,
|
| 243 |
+
top_p=0.9 if temperature > 0 else 1.0,
|
| 244 |
+
max_inp_length=_max_inp_length_for_chat(model, max_new_tokens),
|
| 245 |
+
use_tts_template=False,
|
| 246 |
+
enable_thinking=False,
|
| 247 |
+
)
|
| 248 |
+
if video_path is not None:
|
| 249 |
+
gen_kwargs["use_image_id"] = False
|
| 250 |
+
gen_kwargs["max_slice_nums"] = 1
|
| 251 |
+
if use_audio_from_video and video_path is not None:
|
| 252 |
+
gen_kwargs.setdefault("omni_mode", True)
|
| 253 |
+
try:
|
| 254 |
+
res = model.chat(msgs=msgs, tokenizer=tokenizer, **gen_kwargs)
|
| 255 |
+
except TypeError:
|
| 256 |
+
res = model.chat(msgs=msgs, tokenizer=tokenizer)
|
| 257 |
+
|
| 258 |
+
if tmp_audio_dir is not None:
|
| 259 |
+
import shutil
|
| 260 |
+
shutil.rmtree(tmp_audio_dir, ignore_errors=True)
|
| 261 |
+
|
| 262 |
+
if isinstance(res, tuple):
|
| 263 |
+
res = res[0]
|
| 264 |
+
return str(res).strip()
|
scripts/patch_minicpmo.py
ADDED
|
@@ -0,0 +1,255 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Patch MiniCPM-o 4.5 custom code in the Hugging Face modules cache.
|
| 3 |
+
|
| 4 |
+
``modeling_minicpmo.py`` (transformers >= 4.52):
|
| 5 |
+
|
| 6 |
+
1. `WhisperEncoderLayer.forward` unpacks 3 values from `self.self_attn(...)`,
|
| 7 |
+
but new `WhisperAttention.forward` returns 2 values.
|
| 8 |
+
2. `prepare_inputs_for_generation` reads `past_key_values.seen_tokens`, which
|
| 9 |
+
was removed from `DynamicCache`.
|
| 10 |
+
3. `chat()` force-sets ``use_tts_template = True`` whenever audio is in the
|
| 11 |
+
``content`` list. That appends ``<|tts_bos|>`` to the assistant prefix
|
| 12 |
+
and the model then generates **audio (TTS codec) ids**; decoded as text
|
| 13 |
+
they look like ``<think>`` floods / gibberish. We want audio-in +
|
| 14 |
+
**text-out** for benchmark eval, so respect the caller's kwarg instead.
|
| 15 |
+
|
| 16 |
+
``processing_minicpmo.py``:
|
| 17 |
+
|
| 18 |
+
4. `_convert` used ``max(len(image_start_idx), len(image_end_idx))`` when
|
| 19 |
+
building ``image_bounds``; after ``max_length`` truncation start/end counts
|
| 20 |
+
can differ by one and ``torch.hstack`` raises (common with many video
|
| 21 |
+
frames under the default ``chat(..., max_inp_length=8192)``). Use ``min``.
|
| 22 |
+
|
| 23 |
+
Idempotent. Also downloads model code on demand so files exist before patching.
|
| 24 |
+
"""
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
+
import os
|
| 28 |
+
import sys
|
| 29 |
+
from pathlib import Path
|
| 30 |
+
|
| 31 |
+
MODEL_ID = "openbmb/MiniCPM-o-4_5"
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def _find_modeling_file() -> Path | None:
|
| 35 |
+
"""Locate the cached modeling_minicpmo.py (matches HF's module dir naming)."""
|
| 36 |
+
home = Path(os.path.expanduser("~"))
|
| 37 |
+
candidates = [
|
| 38 |
+
home / ".cache" / "huggingface" / "modules" / "transformers_modules",
|
| 39 |
+
]
|
| 40 |
+
hits: list[Path] = []
|
| 41 |
+
for root in candidates:
|
| 42 |
+
if not root.exists():
|
| 43 |
+
continue
|
| 44 |
+
for p in root.rglob("modeling_minicpmo.py"):
|
| 45 |
+
hits.append(p)
|
| 46 |
+
if not hits:
|
| 47 |
+
return None
|
| 48 |
+
# Prefer the deepest (snapshot-hashed) one.
|
| 49 |
+
hits.sort(key=lambda p: len(p.parts), reverse=True)
|
| 50 |
+
return hits[0]
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def _find_processing_file() -> Path | None:
|
| 54 |
+
"""``processing_minicpmo.py`` lives next to the cached ``modeling_minicpmo.py``."""
|
| 55 |
+
modeling = _find_modeling_file()
|
| 56 |
+
if modeling is None:
|
| 57 |
+
return None
|
| 58 |
+
proc = modeling.parent / "processing_minicpmo.py"
|
| 59 |
+
return proc if proc.is_file() else None
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def _download_model_code() -> None:
|
| 63 |
+
"""Force HF to download MiniCPM-o's custom code so the file is cached.
|
| 64 |
+
|
| 65 |
+
We only need the Python files + config (not weights) for patching. We use
|
| 66 |
+
`hf_hub_download` for the individual code files to avoid fetching the
|
| 67 |
+
multi-GB safetensors shards just to patch a .py file.
|
| 68 |
+
"""
|
| 69 |
+
try:
|
| 70 |
+
from huggingface_hub import hf_hub_download
|
| 71 |
+
except ImportError:
|
| 72 |
+
print("[patch] huggingface_hub not installed; skipping auto-download.")
|
| 73 |
+
return
|
| 74 |
+
|
| 75 |
+
for fn in [
|
| 76 |
+
"config.json",
|
| 77 |
+
"configuration_minicpm.py",
|
| 78 |
+
"modeling_minicpmo.py",
|
| 79 |
+
"modeling_navit_siglip.py",
|
| 80 |
+
"processing_minicpmo.py",
|
| 81 |
+
"resampler.py",
|
| 82 |
+
"utils.py",
|
| 83 |
+
]:
|
| 84 |
+
try:
|
| 85 |
+
hf_hub_download(repo_id=MODEL_ID, filename=fn)
|
| 86 |
+
except Exception as exc:
|
| 87 |
+
# Some files may not exist in every revision; that's fine.
|
| 88 |
+
print(f"[patch] (warn) could not fetch {fn}: {exc}")
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def patch_whisper_unpack(text: str) -> tuple[str, bool]:
|
| 92 |
+
"""Fix #1: WhisperAttention now returns 2 values, not 3."""
|
| 93 |
+
OLD = (
|
| 94 |
+
" hidden_states, attn_weights, past_key_values = self.self_attn(\n"
|
| 95 |
+
" hidden_states=hidden_states,\n"
|
| 96 |
+
" attention_mask=attention_mask,\n"
|
| 97 |
+
" layer_head_mask=layer_head_mask,\n"
|
| 98 |
+
" output_attentions=output_attentions,\n"
|
| 99 |
+
" past_key_value=past_key_values,\n"
|
| 100 |
+
" )"
|
| 101 |
+
)
|
| 102 |
+
NEW = (
|
| 103 |
+
" _attn_out = self.self_attn(\n"
|
| 104 |
+
" hidden_states=hidden_states,\n"
|
| 105 |
+
" attention_mask=attention_mask,\n"
|
| 106 |
+
" layer_head_mask=layer_head_mask,\n"
|
| 107 |
+
" output_attentions=output_attentions,\n"
|
| 108 |
+
" past_key_value=past_key_values,\n"
|
| 109 |
+
" )\n"
|
| 110 |
+
" if len(_attn_out) == 3:\n"
|
| 111 |
+
" hidden_states, attn_weights, past_key_values = _attn_out\n"
|
| 112 |
+
" else:\n"
|
| 113 |
+
" hidden_states, attn_weights = _attn_out"
|
| 114 |
+
)
|
| 115 |
+
if NEW.split("\n", 1)[0] in text:
|
| 116 |
+
return text, False # already patched
|
| 117 |
+
if OLD not in text:
|
| 118 |
+
return text, False # not applicable (different revision?)
|
| 119 |
+
return text.replace(OLD, NEW), True
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def patch_seen_tokens(text: str) -> tuple[str, bool]:
|
| 123 |
+
"""Fix #2: DynamicCache.seen_tokens was removed in newer transformers."""
|
| 124 |
+
OLD = (
|
| 125 |
+
" cache_length = past_key_values.get_seq_length()\n"
|
| 126 |
+
" past_length = past_key_values.seen_tokens"
|
| 127 |
+
)
|
| 128 |
+
NEW = (
|
| 129 |
+
" cache_length = past_key_values.get_seq_length()\n"
|
| 130 |
+
" past_length = getattr(past_key_values, \"seen_tokens\", cache_length)"
|
| 131 |
+
)
|
| 132 |
+
if 'getattr(past_key_values, "seen_tokens"' in text:
|
| 133 |
+
return text, False # already patched
|
| 134 |
+
if OLD not in text:
|
| 135 |
+
return text, False
|
| 136 |
+
return text.replace(OLD, NEW), True
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
def patch_chat_force_tts_template(text: str) -> tuple[str, bool]:
|
| 140 |
+
"""Fix #3: don't force ``use_tts_template=True`` on audio-containing content.
|
| 141 |
+
|
| 142 |
+
MiniCPM-o's ``chat()`` assumes "audio in implies TTS audio out". For MCQ /
|
| 143 |
+
freetext eval we want a text answer; the caller's ``use_tts_template`` kwarg
|
| 144 |
+
(default ``False``) must win so the assistant prefix doesn't get
|
| 145 |
+
``<|tts_bos|>`` appended (which causes the LM to emit audio-codec ids that
|
| 146 |
+
look like ``<think>`` repetitions when text-decoded).
|
| 147 |
+
"""
|
| 148 |
+
OLD = (
|
| 149 |
+
' elif isinstance(c, np.ndarray): # audio\n'
|
| 150 |
+
' audios.append(c)\n'
|
| 151 |
+
' audio_parts.append(i)\n'
|
| 152 |
+
' cur_msgs.append("<audio>./</audio>")\n'
|
| 153 |
+
' use_tts_template = True\n'
|
| 154 |
+
)
|
| 155 |
+
NEW = (
|
| 156 |
+
' elif isinstance(c, np.ndarray): # audio\n'
|
| 157 |
+
' audios.append(c)\n'
|
| 158 |
+
' audio_parts.append(i)\n'
|
| 159 |
+
' cur_msgs.append("<audio>./</audio>")\n'
|
| 160 |
+
' # PATCHED: honour caller-provided use_tts_template.\n'
|
| 161 |
+
' # Upstream force-sets True on any audio, which makes the model\n'
|
| 162 |
+
' # generate TTS codec ids (look like <think> noise as text).\n'
|
| 163 |
+
)
|
| 164 |
+
if "PATCHED: honour caller-provided use_tts_template" in text:
|
| 165 |
+
return text, False
|
| 166 |
+
if OLD not in text:
|
| 167 |
+
return text, False
|
| 168 |
+
return text.replace(OLD, NEW), True
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
def patch_processor_image_bounds(text: str) -> tuple[str, bool]:
|
| 172 |
+
"""Fix ``image_bounds`` when start/end marker counts disagree (truncation)."""
|
| 173 |
+
OLD = " valid_image_nums = max(len(image_start_idx), len(image_end_idx))"
|
| 174 |
+
NEW = (
|
| 175 |
+
" # Pair only complete spans; max() breaks torch.hstack if counts differ.\n"
|
| 176 |
+
" valid_image_nums = min(len(image_start_idx), len(image_end_idx))"
|
| 177 |
+
)
|
| 178 |
+
if "valid_image_nums = min(len(image_start_idx), len(image_end_idx))" in text:
|
| 179 |
+
return text, False
|
| 180 |
+
if OLD not in text:
|
| 181 |
+
return text, False
|
| 182 |
+
return text.replace(OLD, NEW), True
|
| 183 |
+
|
| 184 |
+
|
| 185 |
+
def patch_file(path: Path) -> bool:
|
| 186 |
+
original = path.read_text()
|
| 187 |
+
text = original
|
| 188 |
+
any_change = False
|
| 189 |
+
|
| 190 |
+
text, c1 = patch_whisper_unpack(text)
|
| 191 |
+
any_change |= c1
|
| 192 |
+
text, c2 = patch_seen_tokens(text)
|
| 193 |
+
any_change |= c2
|
| 194 |
+
text, c3 = patch_chat_force_tts_template(text)
|
| 195 |
+
any_change |= c3
|
| 196 |
+
|
| 197 |
+
if any_change:
|
| 198 |
+
backup = path.with_suffix(path.suffix + ".bak")
|
| 199 |
+
if not backup.exists():
|
| 200 |
+
backup.write_text(original)
|
| 201 |
+
print(f"[patch] Backup -> {backup}")
|
| 202 |
+
path.write_text(text)
|
| 203 |
+
print(f"[patch] Patched {path.name}: "
|
| 204 |
+
f"whisper_unpack={c1}, seen_tokens={c2}, chat_tts_template={c3}")
|
| 205 |
+
else:
|
| 206 |
+
print(f"[patch] No changes needed (already patched or unknown revision)")
|
| 207 |
+
return any_change
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
def patch_processing_file(path: Path) -> bool:
|
| 211 |
+
"""Patch ``processing_minicpmo.py`` (image_bounds hstack)."""
|
| 212 |
+
original = path.read_text()
|
| 213 |
+
text = original
|
| 214 |
+
text, c = patch_processor_image_bounds(text)
|
| 215 |
+
if not c:
|
| 216 |
+
print(f"[patch] {path.name}: image_bounds already patched or pattern missing")
|
| 217 |
+
return False
|
| 218 |
+
backup = path.with_suffix(path.suffix + ".bak")
|
| 219 |
+
if not backup.exists():
|
| 220 |
+
backup.write_text(original)
|
| 221 |
+
print(f"[patch] Backup -> {backup}")
|
| 222 |
+
path.write_text(text)
|
| 223 |
+
print(f"[patch] Patched {path.name}: image_bounds min() fix")
|
| 224 |
+
return True
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
def main() -> int:
|
| 228 |
+
path = _find_modeling_file()
|
| 229 |
+
if path is None:
|
| 230 |
+
print("[patch] modeling_minicpmo.py not cached yet; fetching from HF...")
|
| 231 |
+
_download_model_code()
|
| 232 |
+
path = _find_modeling_file()
|
| 233 |
+
if path is None:
|
| 234 |
+
print("[patch] ERROR: could not locate modeling_minicpmo.py", file=sys.stderr)
|
| 235 |
+
return 1
|
| 236 |
+
print(f"[patch] Target: {path}")
|
| 237 |
+
patch_file(path)
|
| 238 |
+
|
| 239 |
+
proc = _find_processing_file()
|
| 240 |
+
if proc is not None:
|
| 241 |
+
print(f"[patch] Target: {proc}")
|
| 242 |
+
patch_processing_file(proc)
|
| 243 |
+
else:
|
| 244 |
+
print("[patch] (warn) processing_minicpmo.py not found next to modeling; "
|
| 245 |
+
"run once with HF cache populated")
|
| 246 |
+
|
| 247 |
+
# Invalidate __pycache__ so the edited file is re-imported.
|
| 248 |
+
for pc in path.parent.rglob("__pycache__"):
|
| 249 |
+
import shutil
|
| 250 |
+
shutil.rmtree(pc, ignore_errors=True)
|
| 251 |
+
return 0
|
| 252 |
+
|
| 253 |
+
|
| 254 |
+
if __name__ == "__main__":
|
| 255 |
+
sys.exit(main())
|
scripts/test_minicpmo.py
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Sanity check: load MiniCPM-o 4.5 and run a single sample through it.
|
| 4 |
+
|
| 5 |
+
Picks one video from sync eval set, passes video + audio + prompt, prints
|
| 6 |
+
the model's response.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
import os
|
| 12 |
+
import sys
|
| 13 |
+
from pathlib import Path
|
| 14 |
+
|
| 15 |
+
sys.path.insert(0, str(Path(__file__).parent))
|
| 16 |
+
from minicpmo_inference import load_model, run_inference
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def main():
|
| 20 |
+
# Pick the first original video in the sync eval set
|
| 21 |
+
original_root = Path("/opt/dlami/nvme/video_source/original/uag_oops")
|
| 22 |
+
audio_root = Path("/opt/dlami/nvme/video_source/extracted_audio/original/uag_oops")
|
| 23 |
+
|
| 24 |
+
videos = sorted(original_root.glob("*.mp4"))
|
| 25 |
+
if not videos:
|
| 26 |
+
print(f"ERROR: no videos found at {original_root}")
|
| 27 |
+
sys.exit(1)
|
| 28 |
+
|
| 29 |
+
video_path = videos[0]
|
| 30 |
+
audio_path = audio_root / f"{video_path.stem}.wav"
|
| 31 |
+
if not audio_path.exists():
|
| 32 |
+
print(f"ERROR: audio not found for {video_path.name}")
|
| 33 |
+
sys.exit(1)
|
| 34 |
+
|
| 35 |
+
print(f"Video: {video_path}")
|
| 36 |
+
print(f"Audio: {audio_path}")
|
| 37 |
+
print()
|
| 38 |
+
|
| 39 |
+
model, tokenizer = load_model()
|
| 40 |
+
|
| 41 |
+
prompt = (
|
| 42 |
+
"Watch this video and listen to its audio carefully. "
|
| 43 |
+
"Determine whether the audio and video tracks are synchronized. "
|
| 44 |
+
"Explain your reasoning."
|
| 45 |
+
)
|
| 46 |
+
|
| 47 |
+
print("=== Running inference ===")
|
| 48 |
+
response = run_inference(
|
| 49 |
+
model, tokenizer,
|
| 50 |
+
video_path=str(video_path),
|
| 51 |
+
audio_path=str(audio_path),
|
| 52 |
+
prompt=prompt,
|
| 53 |
+
max_new_tokens=128,
|
| 54 |
+
temperature=0.0,
|
| 55 |
+
)
|
| 56 |
+
print()
|
| 57 |
+
print("=== Response ===")
|
| 58 |
+
print(response)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
if __name__ == "__main__":
|
| 62 |
+
main()
|
scripts/upload_to_hf_model.py
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Create or update a Hugging Face **model** repo with this evaluation codebase.
|
| 3 |
+
|
| 4 |
+
This upload is **code and docs only** (no MiniCPM-o weights). HF allows model
|
| 5 |
+
repos to host auxiliary artifacts; use ``--private`` if you do not want the
|
| 6 |
+
scripts public.
|
| 7 |
+
|
| 8 |
+
Prerequisites::
|
| 9 |
+
|
| 10 |
+
pip install huggingface_hub
|
| 11 |
+
export HF_TOKEN=hf_... # or: huggingface-cli login
|
| 12 |
+
|
| 13 |
+
Usage::
|
| 14 |
+
|
| 15 |
+
cd MiniCPM-Evaluation
|
| 16 |
+
python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation
|
| 17 |
+
|
| 18 |
+
Private repo::
|
| 19 |
+
|
| 20 |
+
python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation --private
|
| 21 |
+
"""
|
| 22 |
+
from __future__ import annotations
|
| 23 |
+
|
| 24 |
+
import argparse
|
| 25 |
+
import sys
|
| 26 |
+
from pathlib import Path
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def main() -> int:
|
| 30 |
+
p = argparse.ArgumentParser(description=__doc__)
|
| 31 |
+
p.add_argument(
|
| 32 |
+
"--repo-id",
|
| 33 |
+
required=True,
|
| 34 |
+
help="HF repository id, e.g. username/MiniCPM-Evaluation",
|
| 35 |
+
)
|
| 36 |
+
p.add_argument(
|
| 37 |
+
"--private",
|
| 38 |
+
action="store_true",
|
| 39 |
+
help="Create the repo as private (only for first create; ignored if repo exists).",
|
| 40 |
+
)
|
| 41 |
+
p.add_argument(
|
| 42 |
+
"--token",
|
| 43 |
+
default=None,
|
| 44 |
+
help="HF token (default: HF_TOKEN env or cached huggingface-cli login).",
|
| 45 |
+
)
|
| 46 |
+
args = p.parse_args()
|
| 47 |
+
|
| 48 |
+
root = Path(__file__).resolve().parent.parent
|
| 49 |
+
if not (root / "README.md").is_file():
|
| 50 |
+
print(f"error: unexpected layout; expected README.md under {root}", file=sys.stderr)
|
| 51 |
+
return 1
|
| 52 |
+
|
| 53 |
+
try:
|
| 54 |
+
from huggingface_hub import HfApi, create_repo
|
| 55 |
+
except ImportError:
|
| 56 |
+
print("error: install huggingface_hub: pip install huggingface_hub", file=sys.stderr)
|
| 57 |
+
return 1
|
| 58 |
+
|
| 59 |
+
api = HfApi(token=args.token)
|
| 60 |
+
create_repo(
|
| 61 |
+
repo_id=args.repo_id,
|
| 62 |
+
repo_type="model",
|
| 63 |
+
private=args.private,
|
| 64 |
+
exist_ok=True,
|
| 65 |
+
)
|
| 66 |
+
api.upload_folder(
|
| 67 |
+
folder_path=str(root),
|
| 68 |
+
repo_id=args.repo_id,
|
| 69 |
+
repo_type="model",
|
| 70 |
+
ignore_patterns=[
|
| 71 |
+
".git/**",
|
| 72 |
+
".git",
|
| 73 |
+
"**/__pycache__/**",
|
| 74 |
+
"**/*.pyc",
|
| 75 |
+
"**/.DS_Store",
|
| 76 |
+
],
|
| 77 |
+
)
|
| 78 |
+
print(f"Uploaded: {root}")
|
| 79 |
+
print(f"URL: https://huggingface.co/{args.repo_id}")
|
| 80 |
+
return 0
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
if __name__ == "__main__":
|
| 84 |
+
raise SystemExit(main())
|
setup_env.sh
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# MiniCPM-o 4.5 evaluation environment setup.
|
| 3 |
+
#
|
| 4 |
+
# Creates a separate conda env 'minicpmo' because MiniCPM-o has its own
|
| 5 |
+
# dependency stack (librosa, decord, sentencepiece pin, etc.) that may conflict
|
| 6 |
+
# with the Qwen3-Omni 'video' env. Safer to keep them isolated.
|
| 7 |
+
#
|
| 8 |
+
# Usage:
|
| 9 |
+
# bash setup_env.sh
|
| 10 |
+
#
|
| 11 |
+
set -euo pipefail
|
| 12 |
+
|
| 13 |
+
CONDA_ENV="${CONDA_ENV:-minicpmo}"
|
| 14 |
+
PYTHON_VER="${PYTHON_VER:-3.12}"
|
| 15 |
+
INSTALL_DIR="${INSTALL_DIR:-${HOME}/anaconda3}"
|
| 16 |
+
|
| 17 |
+
log() { echo "[setup_env] $*"; }
|
| 18 |
+
|
| 19 |
+
log "Bootstrapping conda..."
|
| 20 |
+
if ! command -v conda &>/dev/null; then
|
| 21 |
+
if [[ -f "${INSTALL_DIR}/etc/profile.d/conda.sh" ]]; then
|
| 22 |
+
source "${INSTALL_DIR}/etc/profile.d/conda.sh"
|
| 23 |
+
else
|
| 24 |
+
echo "Error: conda not found. Install Anaconda first (see CleverHans-Evaluation/setup_env.sh)."
|
| 25 |
+
exit 1
|
| 26 |
+
fi
|
| 27 |
+
fi
|
| 28 |
+
eval "$(conda shell.bash hook)"
|
| 29 |
+
|
| 30 |
+
log "Creating conda env '${CONDA_ENV}' (python=${PYTHON_VER})..."
|
| 31 |
+
if conda env list | awk '{print $1}' | grep -Fxq "${CONDA_ENV}"; then
|
| 32 |
+
log "Env '${CONDA_ENV}' already exists; activating."
|
| 33 |
+
conda activate "${CONDA_ENV}"
|
| 34 |
+
else
|
| 35 |
+
conda create -n "${CONDA_ENV}" "python=${PYTHON_VER}" -y
|
| 36 |
+
conda activate "${CONDA_ENV}"
|
| 37 |
+
fi
|
| 38 |
+
|
| 39 |
+
log "Installing FFmpeg 6 (for audio/video decoding)..."
|
| 40 |
+
conda install -y -c conda-forge 'ffmpeg>=6,<7' || log "Warning: conda-forge ffmpeg failed."
|
| 41 |
+
|
| 42 |
+
log "Installing PyTorch 2.6 (MiniCPM-o stable target; newer torch may work)..."
|
| 43 |
+
pip install --upgrade pip
|
| 44 |
+
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
|
| 45 |
+
--index-url https://download.pytorch.org/whl/cu124
|
| 46 |
+
|
| 47 |
+
log "Installing MiniCPM-o core dependencies..."
|
| 48 |
+
# MiniCPM-o 4.5 uses Qwen3Config (needs transformers >=4.52).
|
| 49 |
+
pip install 'transformers>=4.52,<4.58' accelerate==0.33.0
|
| 50 |
+
pip install Pillow==10.4.0
|
| 51 |
+
pip install sentencepiece==0.2.0
|
| 52 |
+
pip install decord==0.6.0 librosa==0.10.2 soundfile==0.12.1 moviepy==1.0.3
|
| 53 |
+
pip install vocos==0.1.0
|
| 54 |
+
pip install huggingface_hub==0.26.5
|
| 55 |
+
pip install einops==0.8.0
|
| 56 |
+
pip install tqdm openai
|
| 57 |
+
|
| 58 |
+
# CleverHans-Evaluation loaders used by MiniCPM-o eval scripts (imported via _common.ch):
|
| 59 |
+
# - eval_worldsense.py → pandas + pyarrow (parquet)
|
| 60 |
+
# - eval_videomme.py → datasets (lmms-lab/Video-MME)
|
| 61 |
+
# - eval_lvbench.py → datasets (lmms-lab/LVBench)
|
| 62 |
+
log "Installing eval data-loader deps (datasets, pandas, pyarrow)..."
|
| 63 |
+
pip install datasets pandas pyarrow
|
| 64 |
+
|
| 65 |
+
# MiniCPM-o 4.5 custom modeling file imports 'minicpmo' (PyPI package) for TTS utils.
|
| 66 |
+
# The package drags in cosyvoice + stepaudio2 which need these downstream deps.
|
| 67 |
+
pip install minicpmo==0.1.2
|
| 68 |
+
pip install onnx onnxruntime hyperpyyaml diffusers
|
| 69 |
+
|
| 70 |
+
log "Patching MiniCPM-o modeling file for transformers>=4.52 compatibility..."
|
| 71 |
+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
| 72 |
+
python "${SCRIPT_DIR}/scripts/patch_minicpmo.py" || log "Warning: patch_minicpmo.py failed (non-fatal; see errors above)."
|
| 73 |
+
|
| 74 |
+
log "Done."
|
| 75 |
+
echo ""
|
| 76 |
+
echo " Active env: ${CONDA_ENV}"
|
| 77 |
+
echo " Python: $(command -v python)"
|
| 78 |
+
echo ""
|
| 79 |
+
echo "Next: conda activate ${CONDA_ENV}"
|
| 80 |
+
echo " Then try: python scripts/test_minicpmo.py"
|