Rakancorle11 commited on
Commit
b2c2640
·
verified ·
1 Parent(s): b2adc53

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MiniCPM-o 4.5 Evaluation
2
+
3
+ Evaluation scripts for `openbmb/MiniCPM-o-4_5` on the same 6 benchmarks as `CleverHans-Evaluation`:
4
+
5
+ - Sync (DPO test set: synced / delay / early)
6
+ - VGGSoundSync (3k freetext)
7
+ - VideoMME (MCQ A/B/C/D)
8
+ - LVBench (MCQ)
9
+ - WorldSense (MCQ)
10
+ - Daily-Omni (MCQ)
11
+
12
+ ## Why a separate folder
13
+
14
+ MiniCPM-o 4.5 has a completely different architecture (SigLip2 + Whisper + Qwen3-8B, 9B params) and API (`model.chat(msgs=...)` style) vs Qwen3-Omni (`generate()` + `qwen_omni_utils`). Sharing code is impractical; data loading / metrics can still be reused from the other repo.
15
+
16
+ ## Setup
17
+
18
+ ```bash
19
+ bash setup_env.sh # install MiniCPM-o dependencies in conda env 'minicpmo'
20
+ ```
21
+
22
+ ## Layout
23
+
24
+ ```
25
+ MiniCPM-Evaluation/
26
+ ├── README.md
27
+ ├── setup_env.sh
28
+ └── scripts/
29
+ ├── minicpmo_inference.py # common inference wrapper
30
+ ├── test_minicpmo.py # quick sanity check (single sample)
31
+ ├── eval_videomme.py # per-benchmark evaluators
32
+ ├── eval_lvbench.py
33
+ ├── eval_worldsense.py
34
+ ├── eval_daily_omni.py
35
+ ├── eval_vggsoundsync.py
36
+ └── eval_dpo_sync.py
37
+ ```
38
+
39
+ ## Quick Start
40
+
41
+ ```bash
42
+ conda activate minicpmo
43
+ cd /home/ubuntu/MiniCPM-Evaluation
44
+
45
+ # 1. Sanity check: single-sample inference
46
+ python scripts/test_minicpmo.py
47
+
48
+ # 2. Run a full benchmark (e.g. Daily-Omni)
49
+ python scripts/eval_daily_omni.py \
50
+ --data-dir /opt/dlami/nvme/daily_omni \
51
+ --output-dir /home/ubuntu/eval_results/daily_omni \
52
+ --label do_minicpmo_45
53
+ ```
54
+
55
+ ## Publish to Hugging Face (model repo)
56
+
57
+ This tree is **evaluation code only** (no model weights). You can still host it
58
+ under a Hugging Face **model** repo as a snapshot (e.g. next to weight releases).
59
+
60
+ ```bash
61
+ pip install huggingface_hub
62
+ export HF_TOKEN=hf_... # or: huggingface-cli login
63
+ cd MiniCPM-Evaluation
64
+ python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation
65
+ ```
66
+
67
+ Private repo:
68
+
69
+ ```bash
70
+ python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation --private
71
+ ```
72
+
73
+ ## Data paths (reused from CleverHans-Evaluation)
74
+
75
+ | Benchmark | Path |
76
+ |---|---|
77
+ | Sync videos | `/opt/dlami/nvme/video_source/{original,random_shift_video,extracted_audio}` |
78
+ | VGGSoundSync | `/opt/dlami/nvme/vggsoundsync_test/` |
79
+ | VideoMME | `/opt/dlami/nvme/videomme/data/data/` |
80
+ | LVBench | `/opt/dlami/nvme/lvbench/` |
81
+ | WorldSense | `/opt/dlami/nvme/worldsense/` |
82
+ | Daily-Omni | `/opt/dlami/nvme/daily_omni/` |
run_all.sh ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Run all 6 benchmarks for MiniCPM-o 4.5 with 4-GPU data parallelism.
3
+ #
4
+ # For each bench, launches NUM_SHARDS python workers simultaneously (one per
5
+ # GPU), each processes 1/NUM_SHARDS of the samples. After all shards finish,
6
+ # merge_shards.py aggregates the per-shard jsonls and computes metrics.
7
+ # Only ONE bench runs at a time; benches run sequentially.
8
+ #
9
+ # Two sync benches use freetext + GPT judge (matches Qwen3-Omni reference).
10
+ #
11
+ # Usage:
12
+ # export OPENAI_API_KEY=sk-...
13
+ # bash run_all.sh
14
+ #
15
+ # Override via env vars, e.g.:
16
+ # CUDA_VISIBLE_DEVICES=4,5,6,7 LABEL=minicpmo_ckpt200 bash run_all.sh
17
+ # NUM_SHARDS=2 CUDA_VISIBLE_DEVICES=6,7 bash run_all.sh
18
+
19
+ set -uo pipefail # no -e: one bench failure shouldn't block the rest
20
+
21
+ # ── Config ─────────────────────────────────────────────────────────────────────
22
+ export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-4,5,6,7}"
23
+ MODEL="${MODEL:-openbmb/MiniCPM-o-4_5}"
24
+ LABEL="${LABEL:-minicpmo_4_5}"
25
+ SCRIPTS="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/scripts"
26
+ CONDA_ENV="${CONDA_ENV:-minicpmo}"
27
+
28
+ # Data parallel: how many shards (= number of GPUs to use)
29
+ IFS=',' read -ra GPU_ARR <<< "$CUDA_VISIBLE_DEVICES"
30
+ NUM_SHARDS="${NUM_SHARDS:-${#GPU_ARR[@]}}"
31
+
32
+ # Data paths (match Qwen3-Omni reference config)
33
+ DATA_ROOT="${DATA_ROOT:-/opt/dlami/nvme/video_source}"
34
+ SYNC_TEST_JSONL="${SYNC_TEST_JSONL:-/home/ubuntu/CleverHans-Evaluation/data/kto_training_data_v2_test.jsonl}"
35
+ VGG_TEST_JSONL="${VGG_TEST_JSONL:-/opt/dlami/nvme/vggsoundsync_test/test_3k.jsonl}"
36
+ WORLDSENSE_DIR="${WORLDSENSE_DIR:-/opt/dlami/nvme/worldsense}"
37
+ DAILY_OMNI_DIR="${DAILY_OMNI_DIR:-/opt/dlami/nvme/daily_omni}"
38
+ VIDEOMME_DIR="${VIDEOMME_DIR:-/opt/dlami/nvme/videomme/data/data}"
39
+ LVBENCH_DIR="${LVBENCH_DIR:-/opt/dlami/nvme/lvbench}"
40
+
41
+ EVAL_ROOT="${EVAL_ROOT:-/home/ubuntu/eval_results}"
42
+
43
+ # ── Conda ──────────────────────────────────────────────────────────────────────
44
+ if [[ -f "${HOME}/anaconda3/etc/profile.d/conda.sh" ]]; then
45
+ source "${HOME}/anaconda3/etc/profile.d/conda.sh"
46
+ fi
47
+ conda activate "${CONDA_ENV}"
48
+
49
+ echo "=== Model: $MODEL | Label: $LABEL"
50
+ echo "=== GPUs: ${GPU_ARR[*]} | Shards: $NUM_SHARDS"
51
+
52
+ # ── Helper: run one bench with data-parallel sharding ─────────────────────────
53
+ # $1 -- bench short name (matches merge_shards.py --bench)
54
+ # $2 -- eval script path
55
+ # $3 -- full label (e.g. sync_minicpmo_4_5)
56
+ # $4 -- output-dir root (e.g. $EVAL_ROOT/sync)
57
+ # $5+ -- extra args passed to each eval script
58
+ run_bench_dp() {
59
+ local bench="$1"; shift
60
+ local script="$1"; shift
61
+ local full_label="$1"; shift
62
+ local out_root="$1"; shift
63
+ local label_dir="${out_root}/${full_label}"
64
+ mkdir -p "${label_dir}/logs"
65
+
66
+ echo ""
67
+ echo "==== [$(date +%T)] Bench: $bench | Label: $full_label ===="
68
+ local pids=()
69
+ for (( i=0; i<NUM_SHARDS; i++ )); do
70
+ local gpu="${GPU_ARR[$i]}"
71
+ local log="${label_dir}/logs/shard${i}of${NUM_SHARDS}.log"
72
+ echo " → shard $i on GPU $gpu (log: $log)"
73
+ CUDA_VISIBLE_DEVICES="$gpu" python "$script" \
74
+ "$@" \
75
+ --output-dir "$out_root" \
76
+ --label "$full_label" \
77
+ --shard "$i" --num-shards "$NUM_SHARDS" \
78
+ > "$log" 2>&1 &
79
+ pids+=($!)
80
+ done
81
+
82
+ # Wait for all shard workers
83
+ local fail=0
84
+ for pid in "${pids[@]}"; do
85
+ wait "$pid" || fail=$((fail+1))
86
+ done
87
+ if (( fail > 0 )); then
88
+ echo " !! $fail shard(s) exited with error; check ${label_dir}/logs/"
89
+ fi
90
+
91
+ # Merge
92
+ echo " → merging shards ..."
93
+ python "$SCRIPTS/merge_shards.py" \
94
+ --bench "$bench" \
95
+ --label-dir "$label_dir" || echo " !! merge failed"
96
+ }
97
+
98
+ # ── 1/6 Sync (in-domain) — freetext + GPT judge ────────────────────────────
99
+ run_bench_dp dpo_sync "$SCRIPTS/eval_dpo_sync.py" \
100
+ "sync_${LABEL}" "$EVAL_ROOT/sync" \
101
+ --model-id "$MODEL" \
102
+ --data-root "$DATA_ROOT" \
103
+ --test-jsonl "$SYNC_TEST_JSONL" \
104
+ --gpt-judge
105
+
106
+ # ── 2/6 VGGSoundSync — freetext + GPT judge ────────────────────────────────
107
+ run_bench_dp vggsoundsync "$SCRIPTS/eval_vggsoundsync.py" \
108
+ "vggsync_freetext_${LABEL}_3k" "$EVAL_ROOT/vggsoundsync" \
109
+ --model-id "$MODEL" \
110
+ --test-jsonl "$VGG_TEST_JSONL" \
111
+ --mode freetext --gpt-judge
112
+
113
+ # ── 3/6 WorldSense ───────────────────────────────────────────────────────────
114
+ run_bench_dp worldsense "$SCRIPTS/eval_worldsense.py" \
115
+ "ws_${LABEL}" "$EVAL_ROOT/worldsense" \
116
+ --model-id "$MODEL" \
117
+ --data-dir "$WORLDSENSE_DIR" \
118
+ --max-samples -1
119
+
120
+ # ── 4/6 Daily-Omni ───────────────────────────────────────────────────────────
121
+ run_bench_dp daily_omni "$SCRIPTS/eval_daily_omni.py" \
122
+ "do_${LABEL}" "$EVAL_ROOT/daily_omni" \
123
+ --model-id "$MODEL" \
124
+ --data-dir "$DAILY_OMNI_DIR" \
125
+ --max-samples -1
126
+
127
+ # ── 5/6 Video-MME ────────────────────────────────────────────────────────────
128
+ run_bench_dp videomme "$SCRIPTS/eval_videomme.py" \
129
+ "vmme_${LABEL}" "$EVAL_ROOT/videomme" \
130
+ --model-id "$MODEL" \
131
+ --video-dir "$VIDEOMME_DIR" \
132
+ --max-samples -1
133
+
134
+ # ── 6/6 LVBench ──────────────────────────────────────────────────────────────
135
+ run_bench_dp lvbench "$SCRIPTS/eval_lvbench.py" \
136
+ "lvb_${LABEL}" "$EVAL_ROOT/lvbench" \
137
+ --model-id "$MODEL" \
138
+ --video-dir "$LVBENCH_DIR" \
139
+ --max-samples -1
140
+
141
+ echo ""
142
+ echo "=== All done: $LABEL ==="
143
+ for b_out in \
144
+ "$EVAL_ROOT/sync/sync_${LABEL}" \
145
+ "$EVAL_ROOT/vggsoundsync/vggsync_freetext_${LABEL}_3k" \
146
+ "$EVAL_ROOT/worldsense/ws_${LABEL}" \
147
+ "$EVAL_ROOT/daily_omni/do_${LABEL}" \
148
+ "$EVAL_ROOT/videomme/vmme_${LABEL}" \
149
+ "$EVAL_ROOT/lvbench/lvb_${LABEL}"; do
150
+ echo " ${b_out}/metrics.json"
151
+ done
run_video_rerun.sh ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Re-run the four benchmarks that were broken by the missing video-chat flags
3
+ # (`use_image_id=False, max_slice_nums=1`) and the audio -> TTS template
4
+ # force-enable. After patches in `patch_minicpmo.py` + `minicpmo_inference.py`
5
+ # these should now produce real MCQ answers (smoke-test-verified).
6
+ #
7
+ # Usage: CUDA_VISIBLE_DEVICES=4,5,6,7 bash run_video_rerun.sh
8
+
9
+ set -uo pipefail
10
+
11
+ export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-4,5,6,7}"
12
+ MODEL="${MODEL:-openbmb/MiniCPM-o-4_5}"
13
+ LABEL="${LABEL:-minicpmo_4_5}"
14
+ SCRIPTS="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/scripts"
15
+ CONDA_ENV="${CONDA_ENV:-minicpmo}"
16
+
17
+ IFS=',' read -ra GPU_ARR <<< "$CUDA_VISIBLE_DEVICES"
18
+ NUM_SHARDS="${NUM_SHARDS:-${#GPU_ARR[@]}}"
19
+
20
+ WORLDSENSE_DIR="${WORLDSENSE_DIR:-/opt/dlami/nvme/worldsense}"
21
+ DAILY_OMNI_DIR="${DAILY_OMNI_DIR:-/opt/dlami/nvme/daily_omni}"
22
+ VIDEOMME_DIR="${VIDEOMME_DIR:-/opt/dlami/nvme/videomme/data/data}"
23
+ LVBENCH_DIR="${LVBENCH_DIR:-/opt/dlami/nvme/lvbench}"
24
+ EVAL_ROOT="${EVAL_ROOT:-/home/ubuntu/eval_results}"
25
+
26
+ if [[ -f "${HOME}/anaconda3/etc/profile.d/conda.sh" ]]; then
27
+ source "${HOME}/anaconda3/etc/profile.d/conda.sh"
28
+ fi
29
+ conda activate "${CONDA_ENV}"
30
+
31
+ echo "=== Re-running video benches with fixed inference"
32
+ echo "=== Model: $MODEL | Label: $LABEL | GPUs: ${GPU_ARR[*]} | Shards: $NUM_SHARDS"
33
+
34
+ run_bench_dp() {
35
+ local bench="$1"; shift
36
+ local script="$1"; shift
37
+ local full_label="$1"; shift
38
+ local out_root="$1"; shift
39
+ local label_dir="${out_root}/${full_label}"
40
+ mkdir -p "${label_dir}/logs"
41
+
42
+ echo ""
43
+ echo "==== [$(date +%T)] Bench: $bench | Label: $full_label ===="
44
+ local pids=()
45
+ for (( i=0; i<NUM_SHARDS; i++ )); do
46
+ local gpu="${GPU_ARR[$i]}"
47
+ local log="${label_dir}/logs/shard${i}of${NUM_SHARDS}.log"
48
+ echo " -> shard $i on GPU $gpu (log: $log)"
49
+ CUDA_VISIBLE_DEVICES="$gpu" python "$script" \
50
+ "$@" \
51
+ --output-dir "$out_root" \
52
+ --label "$full_label" \
53
+ --shard "$i" --num-shards "$NUM_SHARDS" \
54
+ > "$log" 2>&1 &
55
+ pids+=($!)
56
+ done
57
+ local fail=0
58
+ for pid in "${pids[@]}"; do
59
+ wait "$pid" || fail=$((fail+1))
60
+ done
61
+ if (( fail > 0 )); then
62
+ echo " !! $fail shard(s) exited with error; check ${label_dir}/logs/"
63
+ fi
64
+ echo " -> merging shards ..."
65
+ python "$SCRIPTS/merge_shards.py" \
66
+ --bench "$bench" \
67
+ --label-dir "$label_dir" || echo " !! merge failed"
68
+ }
69
+
70
+ run_bench_dp worldsense "$SCRIPTS/eval_worldsense.py" \
71
+ "ws_${LABEL}" "$EVAL_ROOT/worldsense" \
72
+ --model-id "$MODEL" --data-dir "$WORLDSENSE_DIR" --max-samples -1
73
+
74
+ run_bench_dp daily_omni "$SCRIPTS/eval_daily_omni.py" \
75
+ "do_${LABEL}" "$EVAL_ROOT/daily_omni" \
76
+ --model-id "$MODEL" --data-dir "$DAILY_OMNI_DIR" --max-samples -1
77
+
78
+ run_bench_dp videomme "$SCRIPTS/eval_videomme.py" \
79
+ "vmme_${LABEL}" "$EVAL_ROOT/videomme" \
80
+ --model-id "$MODEL" --video-dir "$VIDEOMME_DIR" --max-samples -1
81
+
82
+ run_bench_dp lvbench "$SCRIPTS/eval_lvbench.py" \
83
+ "lvb_${LABEL}" "$EVAL_ROOT/lvbench" \
84
+ --model-id "$MODEL" --video-dir "$LVBENCH_DIR" --max-samples -1
85
+
86
+ echo ""
87
+ echo "=== Rerun done: $LABEL ==="
88
+ for d in \
89
+ "$EVAL_ROOT/worldsense/ws_${LABEL}" \
90
+ "$EVAL_ROOT/daily_omni/do_${LABEL}" \
91
+ "$EVAL_ROOT/videomme/vmme_${LABEL}" \
92
+ "$EVAL_ROOT/lvbench/lvb_${LABEL}"; do
93
+ echo " ${d}/metrics.json"
94
+ done
scripts/_common.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Shared glue for all MiniCPM-o eval scripts.
2
+
3
+ Loads the CleverHans-Evaluation counterpart scripts under aliased module
4
+ names (prefixed with `ch_`), so the MiniCPM-o eval scripts can import their
5
+ data loaders / metric functions without filename shadowing.
6
+
7
+ Usage in an eval script:
8
+
9
+ import _common # noqa: F401
10
+ from _common import ch # namespace holding ch_eval_videomme etc.
11
+
12
+ ch_videomme = ch("videomme")
13
+ data = ch_videomme.load_videomme(...)
14
+ """
15
+ from __future__ import annotations
16
+
17
+ import importlib.util
18
+ import os
19
+ import sys
20
+ import types
21
+ from pathlib import Path
22
+
23
+
24
+ _HERE = Path(__file__).resolve().parent
25
+ _CLEVERHANS_SCRIPTS = Path(
26
+ os.environ.get(
27
+ "CLEVERHANS_SCRIPTS",
28
+ "/home/ubuntu/CleverHans-Evaluation/scripts",
29
+ )
30
+ ).resolve()
31
+
32
+ # Make local (MiniCPM-o) modules importable without package setup.
33
+ if str(_HERE) not in sys.path:
34
+ sys.path.insert(0, str(_HERE))
35
+
36
+
37
+ _CACHE: dict[str, types.ModuleType] = {}
38
+
39
+
40
+ def ch(short_name: str) -> types.ModuleType:
41
+ """Load a CleverHans-Evaluation script by short name (e.g., 'videomme',
42
+ 'lvbench', 'dpo_sync'). Returns the module object.
43
+
44
+ Loaded under an aliased module name `ch_eval_<short_name>` so it doesn't
45
+ collide with same-named files in this directory.
46
+ """
47
+ cache_key = short_name
48
+ if cache_key in _CACHE:
49
+ return _CACHE[cache_key]
50
+
51
+ script_path = _CLEVERHANS_SCRIPTS / f"eval_{short_name}.py"
52
+ if not script_path.is_file():
53
+ raise FileNotFoundError(
54
+ f"CleverHans-Evaluation script not found: {script_path}\n"
55
+ f"Set CLEVERHANS_SCRIPTS env var to the correct directory."
56
+ )
57
+
58
+ alias = f"ch_eval_{short_name}"
59
+ spec = importlib.util.spec_from_file_location(alias, str(script_path))
60
+ if spec is None or spec.loader is None:
61
+ raise ImportError(f"Could not create spec for {script_path}")
62
+ module = importlib.util.module_from_spec(spec)
63
+ sys.modules[alias] = module
64
+ spec.loader.exec_module(module)
65
+ _CACHE[cache_key] = module
66
+ return module
scripts/eval_daily_omni.py ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Evaluate MiniCPM-o 4.5 on Daily-Omni.
3
+
4
+ Daily-Omni videos include embedded audio; we extract it and feed both frames
5
+ and waveform to MiniCPM-o.
6
+ """
7
+ from __future__ import annotations
8
+
9
+ import _common
10
+
11
+ import argparse
12
+ import gc
13
+ import io
14
+ import contextlib
15
+ import json
16
+ from pathlib import Path
17
+
18
+ import torch
19
+ from tqdm import tqdm
20
+
21
+ ch = _common.ch("daily_omni")
22
+ load_daily_omni = ch.load_daily_omni
23
+ extract_answer = ch.extract_answer
24
+ compute_metrics = ch.compute_metrics
25
+ print_summary = ch.print_summary
26
+ DEFAULT_DATA_DIR = ch.DEFAULT_DATA_DIR
27
+ DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
28
+
29
+ from minicpmo_inference import load_model, run_inference
30
+
31
+
32
+ def parse_args() -> argparse.Namespace:
33
+ p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on Daily-Omni.")
34
+ p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
35
+ p.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR)
36
+ p.add_argument("--output-dir", type=Path,
37
+ default=Path("/home/ubuntu/eval_results/daily_omni_minicpmo"))
38
+ p.add_argument("--max-samples", type=int, default=-1)
39
+ p.add_argument("--max-new-tokens", type=int, default=32)
40
+ p.add_argument("--temperature", type=float, default=0.0)
41
+ p.add_argument("--label", type=str, default="minicpmo_daily_omni")
42
+ p.add_argument("--max-frames", type=int, default=64)
43
+ p.add_argument("--fps", type=float, default=1.0)
44
+ p.add_argument("--attn", type=str, default="flash_attention_2",
45
+ choices=["sdpa", "flash_attention_2", "eager"])
46
+ p.add_argument("--no-audio", action="store_true",
47
+ help="Video-only mode (skip audio extraction).")
48
+ p.add_argument(
49
+ "--skip-audio-durations",
50
+ type=str,
51
+ default="",
52
+ help=(
53
+ "Comma-separated `video_duration` values from the dataset for which "
54
+ "audio is omitted (video-only for those clips). Useful when "
55
+ "MiniCPM-o forward fails on some lengths with audio+vision "
56
+ '(e.g. empty `raw_output` and log errors like "Expected size 122 '
57
+ 'but got size 121"). Example: --skip-audio-durations 60s'
58
+ ),
59
+ )
60
+ # vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
61
+ p.add_argument("--vllm", action="store_true", default=False,
62
+ help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
63
+ p.add_argument("--tp", type=int, default=None)
64
+ p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
65
+ p.add_argument("--max-model-len", type=int, default=65536)
66
+ p.add_argument("--batch-size", type=int, default=32)
67
+ # Data-parallel sharding
68
+ p.add_argument("--shard", type=int, default=0)
69
+ p.add_argument("--num-shards", type=int, default=1)
70
+ return p.parse_args()
71
+
72
+
73
+ def main() -> None:
74
+ args = parse_args()
75
+ out_dir = args.output_dir / args.label
76
+ out_dir.mkdir(parents=True, exist_ok=True)
77
+ shard_suffix = (f".shard{args.shard}of{args.num_shards}"
78
+ if args.num_shards > 1 else "")
79
+ results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
80
+ metrics_json = out_dir / "metrics.json"
81
+ summary_txt = out_dir / "summary.txt"
82
+
83
+ if args.vllm:
84
+ print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
85
+ "supported upstream yet; falling back to transformers.")
86
+ print("[data] Loading Daily-Omni dataset...")
87
+ test_data = load_daily_omni(args.data_dir, args.max_samples)
88
+ if args.num_shards > 1:
89
+ test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
90
+ print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
91
+ else:
92
+ print(f"[data] {len(test_data)} questions ready")
93
+
94
+ processed: set = set()
95
+ if results_jsonl.exists():
96
+ with open(results_jsonl) as f:
97
+ for line in f:
98
+ obj = json.loads(line)
99
+ processed.add(obj["question_id"])
100
+ print(f"[resume] {len(processed)} already processed")
101
+
102
+ model, tokenizer = load_model(
103
+ args.model_id, attn_implementation=args.attn, init_audio=not args.no_audio,
104
+ )
105
+
106
+ skip_audio_durs = {
107
+ x.strip()
108
+ for x in args.skip_audio_durations.split(",")
109
+ if x.strip()
110
+ }
111
+
112
+ for item in tqdm(test_data, desc="Daily-Omni", unit="q"):
113
+ if item["question_id"] in processed:
114
+ continue
115
+ use_audio = not args.no_audio and (
116
+ item.get("video_duration", "") not in skip_audio_durs
117
+ )
118
+ try:
119
+ raw_output = run_inference(
120
+ model, tokenizer,
121
+ video_path=item["video_path"],
122
+ audio_path=None,
123
+ prompt=item["prompt"],
124
+ max_new_tokens=args.max_new_tokens,
125
+ temperature=args.temperature,
126
+ max_frames=args.max_frames,
127
+ fps=args.fps,
128
+ use_audio_from_video=use_audio,
129
+ )
130
+ except Exception as exc:
131
+ import traceback
132
+ print(f" [error] {item['question_id']}: {exc}")
133
+ traceback.print_exc()
134
+ raw_output = ""
135
+
136
+ pred = extract_answer(raw_output)
137
+ result = {
138
+ "question_id": item["question_id"],
139
+ "video_id": item["video_id"],
140
+ "question_type": item.get("question_type", ""),
141
+ "content_parent_category": item.get("content_parent_category", ""),
142
+ "content_fine_category": item.get("content_fine_category", ""),
143
+ "video_category": item.get("video_category", ""),
144
+ "video_duration": item.get("video_duration", ""),
145
+ "question": item["question"],
146
+ "choices": item["choices"],
147
+ "gt_answer": item["gt_answer"],
148
+ "pred_answer": pred,
149
+ "raw_output": raw_output,
150
+ }
151
+ with open(results_jsonl, "a", encoding="utf-8") as f:
152
+ f.write(json.dumps(result, ensure_ascii=False) + "\n")
153
+
154
+ processed.add(item["question_id"])
155
+ gc.collect()
156
+ torch.cuda.empty_cache()
157
+
158
+ if args.num_shards > 1:
159
+ print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
160
+ print(f"[shard] Run merge_shards.py --bench daily_omni --label-dir {out_dir}")
161
+ return
162
+
163
+ all_results = []
164
+ if results_jsonl.exists():
165
+ with open(results_jsonl) as f:
166
+ for line in f:
167
+ all_results.append(json.loads(line))
168
+
169
+ metrics = compute_metrics(all_results)
170
+ metrics["eval_config"] = {
171
+ "model_id": args.model_id,
172
+ "data_dir": str(args.data_dir),
173
+ "max_new_tokens": args.max_new_tokens,
174
+ "temperature": args.temperature,
175
+ "max_frames": args.max_frames,
176
+ "fps": args.fps,
177
+ "attn": args.attn,
178
+ "no_audio": args.no_audio,
179
+ "skip_audio_durations": sorted(skip_audio_durs),
180
+ }
181
+ with open(metrics_json, "w", encoding="utf-8") as f:
182
+ json.dump(metrics, f, indent=2, ensure_ascii=False)
183
+
184
+ print_summary(metrics, args.label)
185
+ with open(summary_txt, "w", encoding="utf-8") as f:
186
+ buf = io.StringIO()
187
+ with contextlib.redirect_stdout(buf):
188
+ print_summary(metrics, args.label)
189
+ f.write(buf.getvalue())
190
+
191
+ print(f"\n[output] Results: {results_jsonl}")
192
+ print(f"[output] Metrics: {metrics_json}")
193
+ print(f"[output] Summary: {summary_txt}")
194
+
195
+
196
+ if __name__ == "__main__":
197
+ main()
scripts/eval_dpo_sync.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Evaluate MiniCPM-o 4.5 on the in-domain DPO sync test set.
3
+
4
+ Reuses the CleverHans-Evaluation dpo_sync eval_dpo_sync.py for data loading,
5
+ GT parsing, regex prediction extractor, optional GPT judge, and metrics.
6
+ Only the inference path is replaced with MiniCPM-o.
7
+ """
8
+ from __future__ import annotations
9
+
10
+ import _common
11
+
12
+ import argparse
13
+ import gc
14
+ import io
15
+ import contextlib
16
+ import json
17
+ from pathlib import Path
18
+
19
+ import torch
20
+ from tqdm import tqdm
21
+
22
+ ch = _common.ch("dpo_sync")
23
+ EVAL_PROMPT = ch.EVAL_PROMPT
24
+ load_test_data = ch.load_test_data
25
+ set_data_root = ch.set_data_root
26
+ extract_prediction = ch.extract_prediction
27
+ gpt_extract_prediction = ch.gpt_extract_prediction
28
+ _get_openai_client = ch._get_openai_client
29
+ compute_metrics = ch.compute_metrics
30
+ print_summary = ch.print_summary
31
+
32
+ from minicpmo_inference import load_model, run_inference
33
+
34
+
35
+ def parse_args() -> argparse.Namespace:
36
+ p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on DPO sync test set.")
37
+ p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
38
+ p.add_argument("--data-root", type=Path,
39
+ default=Path("/opt/dlami/nvme/video_source"))
40
+ p.add_argument("--test-jsonl", type=Path, default=None,
41
+ help="Default: <data-root>/kto_training_data_v2_test.jsonl")
42
+ p.add_argument("--output-dir", type=Path,
43
+ default=Path("/home/ubuntu/eval_results/sync_minicpmo"))
44
+ p.add_argument("--max-samples", type=int, default=-1)
45
+ p.add_argument("--max-new-tokens", type=int, default=256)
46
+ p.add_argument("--temperature", type=float, default=0.0)
47
+ p.add_argument("--label", type=str, default="minicpmo_sync")
48
+ p.add_argument("--max-frames", type=int, default=32,
49
+ help="Sync clips are short (<30s); 32 frames is plenty.")
50
+ p.add_argument("--fps", type=float, default=2.0)
51
+ p.add_argument("--attn", type=str, default="flash_attention_2",
52
+ choices=["sdpa", "flash_attention_2", "eager"])
53
+ # vLLM flags: accepted for CLI parity with Qwen3-Omni. MiniCPM-o 4.5
54
+ # multimodal vLLM support is not yet available upstream, so these are
55
+ # currently a no-op (we always run transformers). Kept so the same
56
+ # run_*.sh scripts work across the two models.
57
+ p.add_argument("--vllm", action="store_true", default=False,
58
+ help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
59
+ p.add_argument("--tp", type=int, default=None)
60
+ p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
61
+ p.add_argument("--max-model-len", type=int, default=65536)
62
+ p.add_argument("--gpt-judge", action="store_true", default=False)
63
+ p.add_argument("--openai-api-key", type=str, default=None)
64
+ p.add_argument("--gpt-model", type=str, default="gpt-5.4")
65
+ # Data-parallel sharding
66
+ p.add_argument("--shard", type=int, default=0)
67
+ p.add_argument("--num-shards", type=int, default=1)
68
+ return p.parse_args()
69
+
70
+
71
+ def main() -> None:
72
+ args = parse_args()
73
+ set_data_root(args.data_root)
74
+ test_jsonl = args.test_jsonl or (args.data_root / "kto_training_data_v2_test.jsonl")
75
+
76
+ if args.vllm:
77
+ print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
78
+ "supported upstream yet; falling back to transformers.")
79
+ if args.gpt_judge:
80
+ if _get_openai_client(args.openai_api_key) is None:
81
+ print("[ERROR] --gpt-judge requires OPENAI_API_KEY or --openai-api-key.")
82
+ raise SystemExit(1)
83
+
84
+ out_dir = args.output_dir / args.label
85
+ out_dir.mkdir(parents=True, exist_ok=True)
86
+ shard_suffix = (f".shard{args.shard}of{args.num_shards}"
87
+ if args.num_shards > 1 else "")
88
+ results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
89
+ metrics_json = out_dir / "metrics.json"
90
+ summary_txt = out_dir / "summary.txt"
91
+
92
+ test_data = load_test_data(test_jsonl, args.max_samples)
93
+ if args.num_shards > 1:
94
+ test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
95
+ print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} samples")
96
+ else:
97
+ print(f"[data] {len(test_data)} test samples")
98
+
99
+ processed: set = set()
100
+ if results_jsonl.exists():
101
+ with open(results_jsonl) as f:
102
+ for line in f:
103
+ obj = json.loads(line)
104
+ processed.add(obj["video"])
105
+ print(f"[resume] {len(processed)} already processed")
106
+
107
+ def _do_extract(raw_output: str):
108
+ if args.gpt_judge and raw_output:
109
+ gpt_pred = gpt_extract_prediction(
110
+ raw_output, api_key=args.openai_api_key, model=args.gpt_model,
111
+ )
112
+ if gpt_pred is not None:
113
+ return gpt_pred
114
+ return extract_prediction(raw_output)
115
+
116
+ model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
117
+ init_audio=True)
118
+
119
+ for item in tqdm(test_data, desc="Sync", unit="sample"):
120
+ if item["video"] in processed:
121
+ continue
122
+ try:
123
+ raw_output = run_inference(
124
+ model, tokenizer,
125
+ video_path=item["video_path"],
126
+ audio_path=item["audio_path"],
127
+ prompt=EVAL_PROMPT,
128
+ max_new_tokens=args.max_new_tokens,
129
+ temperature=args.temperature,
130
+ max_frames=args.max_frames,
131
+ fps=args.fps,
132
+ )
133
+ except Exception as exc:
134
+ import traceback
135
+ print(f" [error] {item['video']}: {exc}")
136
+ traceback.print_exc()
137
+ raw_output = ""
138
+
139
+ pred = _do_extract(raw_output)
140
+ result = {
141
+ "video": item["video"],
142
+ "video_path": item["video_path"],
143
+ "gt_synced": item["gt_synced"],
144
+ "gt_direction": item["gt_direction"],
145
+ "gt_offset_sec": item["gt_offset_sec"],
146
+ "gt_t_v": item["gt_t_v"],
147
+ "gt_t_a": item["gt_t_a"],
148
+ "pred_synced": pred["pred_synced"],
149
+ "pred_direction": pred["pred_direction"],
150
+ "pred_offset_sec": pred["pred_offset_sec"],
151
+ "pred_t_v": pred.get("pred_t_v"),
152
+ "pred_t_a": pred.get("pred_t_a"),
153
+ "pred_explanation": pred.get("pred_explanation", ""),
154
+ "parse_method": pred["parse_method"],
155
+ "raw_output": raw_output,
156
+ }
157
+ with open(results_jsonl, "a", encoding="utf-8") as f:
158
+ f.write(json.dumps(result, ensure_ascii=False) + "\n")
159
+
160
+ processed.add(item["video"])
161
+ gc.collect()
162
+ torch.cuda.empty_cache()
163
+
164
+ if args.num_shards > 1:
165
+ print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
166
+ print(f"[shard] Run merge_shards.py --bench dpo_sync --label-dir {out_dir}")
167
+ return
168
+
169
+ all_results = []
170
+ if results_jsonl.exists():
171
+ with open(results_jsonl) as f:
172
+ for line in f:
173
+ all_results.append(json.loads(line))
174
+
175
+ metrics = compute_metrics(all_results)
176
+ metrics["eval_config"] = {
177
+ "model_id": args.model_id,
178
+ "data_root": str(args.data_root),
179
+ "test_jsonl": str(test_jsonl),
180
+ "total_test_samples": len(test_data),
181
+ "max_new_tokens": args.max_new_tokens,
182
+ "temperature": args.temperature,
183
+ "max_frames": args.max_frames,
184
+ "fps": args.fps,
185
+ "attn": args.attn,
186
+ "gpt_judge": args.gpt_judge,
187
+ "gpt_model": args.gpt_model if args.gpt_judge else None,
188
+ }
189
+ with open(metrics_json, "w", encoding="utf-8") as f:
190
+ json.dump(metrics, f, indent=2, ensure_ascii=False)
191
+
192
+ print_summary(metrics, args.label)
193
+ with open(summary_txt, "w", encoding="utf-8") as f:
194
+ buf = io.StringIO()
195
+ with contextlib.redirect_stdout(buf):
196
+ print_summary(metrics, args.label)
197
+ f.write(buf.getvalue())
198
+
199
+ print(f"\n[output] Results: {results_jsonl}")
200
+ print(f"[output] Metrics: {metrics_json}")
201
+ print(f"[output] Summary: {summary_txt}")
202
+
203
+
204
+ if __name__ == "__main__":
205
+ main()
scripts/eval_lvbench.py ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Evaluate MiniCPM-o 4.5 on LVBench.
3
+
4
+ Reuses data loader and metrics from CleverHans-Evaluation's eval_lvbench.py.
5
+ LVBench is video-only (long video QA); no audio is passed.
6
+ """
7
+ from __future__ import annotations
8
+
9
+ import _common
10
+
11
+ import argparse
12
+ import gc
13
+ import io
14
+ import contextlib
15
+ import json
16
+ from pathlib import Path
17
+
18
+ import torch
19
+ from tqdm import tqdm
20
+
21
+ ch = _common.ch("lvbench")
22
+ load_lvbench = ch.load_lvbench
23
+ extract_answer = ch.extract_answer
24
+ compute_metrics = ch.compute_metrics
25
+ print_summary = ch.print_summary
26
+ DEFAULT_VIDEO_DIR = ch.DEFAULT_VIDEO_DIR
27
+ DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
28
+
29
+ from minicpmo_inference import load_model, run_inference
30
+
31
+
32
+ def parse_args() -> argparse.Namespace:
33
+ p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on LVBench.")
34
+ p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
35
+ p.add_argument("--video-dir", type=Path, default=DEFAULT_VIDEO_DIR)
36
+ p.add_argument("--output-dir", type=Path,
37
+ default=Path("/home/ubuntu/eval_results/lvbench_minicpmo"))
38
+ p.add_argument("--max-samples", type=int, default=-1)
39
+ p.add_argument("--max-new-tokens", type=int, default=32)
40
+ p.add_argument("--temperature", type=float, default=0.0)
41
+ p.add_argument("--label", type=str, default="minicpmo_lvbench")
42
+ p.add_argument("--max-frames", type=int, default=96,
43
+ help="LVBench has long videos; larger frame budget helps.")
44
+ p.add_argument("--fps", type=float, default=0.5)
45
+ p.add_argument("--attn", type=str, default="flash_attention_2",
46
+ choices=["sdpa", "flash_attention_2", "eager"])
47
+ # vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
48
+ p.add_argument("--vllm", action="store_true", default=False,
49
+ help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
50
+ p.add_argument("--tp", type=int, default=None)
51
+ p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
52
+ p.add_argument("--max-model-len", type=int, default=65536)
53
+ p.add_argument("--batch-size", type=int, default=32)
54
+ # Data-parallel sharding
55
+ p.add_argument("--shard", type=int, default=0)
56
+ p.add_argument("--num-shards", type=int, default=1)
57
+ return p.parse_args()
58
+
59
+
60
+ def main() -> None:
61
+ args = parse_args()
62
+ out_dir = args.output_dir / args.label
63
+ out_dir.mkdir(parents=True, exist_ok=True)
64
+ shard_suffix = (f".shard{args.shard}of{args.num_shards}"
65
+ if args.num_shards > 1 else "")
66
+ results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
67
+ metrics_json = out_dir / "metrics.json"
68
+ summary_txt = out_dir / "summary.txt"
69
+
70
+ if args.vllm:
71
+ print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
72
+ "supported upstream yet; falling back to transformers.")
73
+ print("[data] Loading LVBench dataset...")
74
+ test_data = load_lvbench(args.video_dir, args.max_samples)
75
+ if args.num_shards > 1:
76
+ test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
77
+ print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
78
+ else:
79
+ print(f"[data] {len(test_data)} questions ready")
80
+
81
+ processed: set = set()
82
+ if results_jsonl.exists():
83
+ with open(results_jsonl) as f:
84
+ for line in f:
85
+ obj = json.loads(line)
86
+ processed.add(obj["uid"])
87
+ print(f"[resume] {len(processed)} already processed")
88
+
89
+ model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
90
+ init_audio=False)
91
+
92
+ for item in tqdm(test_data, desc="LVBench", unit="q"):
93
+ if item["uid"] in processed:
94
+ continue
95
+ try:
96
+ raw_output = run_inference(
97
+ model, tokenizer,
98
+ video_path=item["video_path"],
99
+ audio_path=None,
100
+ prompt=item["prompt"],
101
+ max_new_tokens=args.max_new_tokens,
102
+ temperature=args.temperature,
103
+ max_frames=args.max_frames,
104
+ fps=args.fps,
105
+ )
106
+ except Exception as exc:
107
+ import traceback
108
+ print(f" [error] {item['uid']}: {exc}")
109
+ traceback.print_exc()
110
+ raw_output = ""
111
+
112
+ pred = extract_answer(raw_output)
113
+ result = {
114
+ "uid": item["uid"],
115
+ "video_id": item["video_id"],
116
+ "video_type": item["video_type"],
117
+ "question_type": item["question_type"],
118
+ "question": item["question"],
119
+ "gt_answer": item["gt_answer"],
120
+ "time_reference": item.get("time_reference", ""),
121
+ "pred_answer": pred,
122
+ "raw_output": raw_output,
123
+ }
124
+ with open(results_jsonl, "a", encoding="utf-8") as f:
125
+ f.write(json.dumps(result, ensure_ascii=False) + "\n")
126
+
127
+ processed.add(item["uid"])
128
+ gc.collect()
129
+ torch.cuda.empty_cache()
130
+
131
+ if args.num_shards > 1:
132
+ print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
133
+ print(f"[shard] Run merge_shards.py --bench lvbench --label-dir {out_dir}")
134
+ return
135
+
136
+ all_results = []
137
+ if results_jsonl.exists():
138
+ with open(results_jsonl) as f:
139
+ for line in f:
140
+ all_results.append(json.loads(line))
141
+
142
+ metrics = compute_metrics(all_results)
143
+ metrics["eval_config"] = {
144
+ "model_id": args.model_id,
145
+ "video_dir": str(args.video_dir),
146
+ "max_new_tokens": args.max_new_tokens,
147
+ "temperature": args.temperature,
148
+ "max_frames": args.max_frames,
149
+ "fps": args.fps,
150
+ "attn": args.attn,
151
+ }
152
+ with open(metrics_json, "w", encoding="utf-8") as f:
153
+ json.dump(metrics, f, indent=2, ensure_ascii=False)
154
+
155
+ print_summary(metrics, args.label)
156
+ with open(summary_txt, "w", encoding="utf-8") as f:
157
+ buf = io.StringIO()
158
+ with contextlib.redirect_stdout(buf):
159
+ print_summary(metrics, args.label)
160
+ f.write(buf.getvalue())
161
+
162
+ print(f"\n[output] Results: {results_jsonl}")
163
+ print(f"[output] Metrics: {metrics_json}")
164
+ print(f"[output] Summary: {summary_txt}")
165
+
166
+
167
+ if __name__ == "__main__":
168
+ main()
scripts/eval_vggsoundsync.py ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Evaluate MiniCPM-o 4.5 on VGG-Sound Sync (out-of-domain sync).
3
+
4
+ Reuses the data loader, MCQ / freetext prompts, answer parsers, GPT judge,
5
+ and metrics from CleverHans-Evaluation's eval_vggsoundsync.py. Only the
6
+ inference path is replaced with MiniCPM-o.
7
+ """
8
+ from __future__ import annotations
9
+
10
+ import _common
11
+
12
+ import argparse
13
+ import gc
14
+ import io
15
+ import contextlib
16
+ import json
17
+ from pathlib import Path
18
+
19
+ import torch
20
+ from tqdm import tqdm
21
+
22
+ ch = _common.ch("vggsoundsync")
23
+ MCQ_PROMPT = ch.MCQ_PROMPT
24
+ FREETEXT_PROMPT = ch.FREETEXT_PROMPT
25
+ load_test_data = ch.load_test_data
26
+ extract_mcq_answer = ch.extract_mcq_answer
27
+ extract_freetext_prediction = ch.extract_freetext_prediction
28
+ gpt_extract_prediction = ch.gpt_extract_prediction
29
+ _get_openai_client = ch._get_openai_client
30
+ compute_metrics = ch.compute_metrics
31
+ print_summary = ch.print_summary
32
+ _build_result = ch._build_result
33
+ DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
34
+
35
+ from minicpmo_inference import load_model, run_inference
36
+
37
+
38
+ def parse_args() -> argparse.Namespace:
39
+ p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on VGG-Sound Sync.")
40
+ p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
41
+ p.add_argument("--test-jsonl", type=Path, required=True,
42
+ help="test.jsonl from prepare_vggsoundsync.py")
43
+ p.add_argument("--output-dir", type=Path,
44
+ default=Path("/home/ubuntu/eval_results/vggsoundsync_minicpmo"))
45
+ p.add_argument("--mode", choices=["mcq", "freetext"], default="mcq")
46
+ p.add_argument("--max-samples", type=int, default=-1)
47
+ p.add_argument("--max-new-tokens", type=int, default=64)
48
+ p.add_argument("--temperature", type=float, default=0.0)
49
+ p.add_argument("--label", type=str, default="minicpmo_vggsync")
50
+ p.add_argument("--max-frames", type=int, default=32)
51
+ p.add_argument("--fps", type=float, default=2.0)
52
+ p.add_argument("--attn", type=str, default="flash_attention_2",
53
+ choices=["sdpa", "flash_attention_2", "eager"])
54
+ # vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
55
+ p.add_argument("--vllm", action="store_true", default=False,
56
+ help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
57
+ p.add_argument("--tp", type=int, default=None)
58
+ p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
59
+ p.add_argument("--max-model-len", type=int, default=65536)
60
+ p.add_argument("--batch-size", type=int, default=16)
61
+ p.add_argument("--gpt-judge", action="store_true", default=False)
62
+ p.add_argument("--openai-api-key", type=str, default=None)
63
+ p.add_argument("--gpt-model", type=str, default="gpt-5.4")
64
+ # Data-parallel sharding
65
+ p.add_argument("--shard", type=int, default=0)
66
+ p.add_argument("--num-shards", type=int, default=1)
67
+ return p.parse_args()
68
+
69
+
70
+ def _extract_pred(raw_output, mode, gpt_judge, api_key, gpt_model, answer_map=None):
71
+ if mode == "mcq":
72
+ return extract_mcq_answer(raw_output, answer_map=answer_map)
73
+ if gpt_judge and raw_output:
74
+ gpt_pred = gpt_extract_prediction(raw_output, api_key=api_key, model=gpt_model)
75
+ if gpt_pred is not None:
76
+ return gpt_pred
77
+ return extract_freetext_prediction(raw_output)
78
+
79
+
80
+ def main() -> None:
81
+ args = parse_args()
82
+ default_prompt = MCQ_PROMPT if args.mode == "mcq" else FREETEXT_PROMPT
83
+
84
+ if args.vllm:
85
+ print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
86
+ "supported upstream yet; falling back to transformers.")
87
+
88
+ if args.gpt_judge and args.mode == "freetext":
89
+ if _get_openai_client(args.openai_api_key) is None:
90
+ print("[ERROR] --gpt-judge requires OPENAI_API_KEY or --openai-api-key.")
91
+ raise SystemExit(1)
92
+
93
+ out_dir = args.output_dir / args.label
94
+ out_dir.mkdir(parents=True, exist_ok=True)
95
+ shard_suffix = (f".shard{args.shard}of{args.num_shards}"
96
+ if args.num_shards > 1 else "")
97
+ results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
98
+ metrics_json = out_dir / "metrics.json"
99
+ summary_txt = out_dir / "summary.txt"
100
+
101
+ test_data = load_test_data(args.test_jsonl, args.max_samples)
102
+ if args.num_shards > 1:
103
+ test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
104
+ print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} samples (mode={args.mode})")
105
+ else:
106
+ print(f"[data] {len(test_data)} samples loaded (mode={args.mode})")
107
+
108
+ processed: set = set()
109
+ if results_jsonl.exists():
110
+ with open(results_jsonl) as f:
111
+ for line in f:
112
+ obj = json.loads(line)
113
+ processed.add(obj["uid"])
114
+ print(f"[resume] {len(processed)} already processed")
115
+
116
+ model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
117
+ init_audio=True)
118
+
119
+ for item in tqdm(test_data, desc="VGGSync", unit="sample"):
120
+ if item["uid"] in processed:
121
+ continue
122
+
123
+ item_prompt = item.get("mcq_prompt", default_prompt) if args.mode == "mcq" else default_prompt
124
+ item_answer_map = item.get("mcq_answer_map") if args.mode == "mcq" else None
125
+
126
+ try:
127
+ raw_output = run_inference(
128
+ model, tokenizer,
129
+ video_path=item["video_path"],
130
+ audio_path=item["audio_path"],
131
+ prompt=item_prompt,
132
+ max_new_tokens=args.max_new_tokens,
133
+ temperature=args.temperature,
134
+ max_frames=args.max_frames,
135
+ fps=args.fps,
136
+ )
137
+ except Exception as exc:
138
+ import traceback
139
+ print(f" [error] {item['uid']}: {exc}")
140
+ traceback.print_exc()
141
+ raw_output = ""
142
+
143
+ pred = _extract_pred(raw_output, args.mode, args.gpt_judge,
144
+ args.openai_api_key, args.gpt_model,
145
+ answer_map=item_answer_map)
146
+ result = _build_result(item, pred, raw_output, args.mode)
147
+
148
+ with open(results_jsonl, "a", encoding="utf-8") as f:
149
+ f.write(json.dumps(result, ensure_ascii=False) + "\n")
150
+
151
+ processed.add(item["uid"])
152
+ gc.collect()
153
+ torch.cuda.empty_cache()
154
+
155
+ if args.num_shards > 1:
156
+ print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
157
+ print(f"[shard] Run merge_shards.py --bench vggsoundsync --label-dir {out_dir}")
158
+ return
159
+
160
+ all_results = []
161
+ if results_jsonl.exists():
162
+ with open(results_jsonl) as f:
163
+ for line in f:
164
+ all_results.append(json.loads(line))
165
+
166
+ metrics = compute_metrics(all_results)
167
+ metrics["eval_config"] = {
168
+ "model_id": args.model_id,
169
+ "mode": args.mode,
170
+ "test_jsonl": str(args.test_jsonl),
171
+ "max_new_tokens": args.max_new_tokens,
172
+ "temperature": args.temperature,
173
+ "max_frames": args.max_frames,
174
+ "fps": args.fps,
175
+ "attn": args.attn,
176
+ "gpt_judge": args.gpt_judge,
177
+ }
178
+
179
+ with open(metrics_json, "w", encoding="utf-8") as f:
180
+ json.dump(metrics, f, indent=2, ensure_ascii=False)
181
+
182
+ print_summary(metrics, args.label)
183
+ with open(summary_txt, "w", encoding="utf-8") as f:
184
+ buf = io.StringIO()
185
+ with contextlib.redirect_stdout(buf):
186
+ print_summary(metrics, args.label)
187
+ f.write(buf.getvalue())
188
+
189
+ print(f"\n[output] Results: {results_jsonl}")
190
+ print(f"[output] Metrics: {metrics_json}")
191
+ print(f"[output] Summary: {summary_txt}")
192
+
193
+
194
+ if __name__ == "__main__":
195
+ main()
scripts/eval_videomme.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Evaluate MiniCPM-o 4.5 on Video-MME.
3
+
4
+ Reuses the data loader and metrics from CleverHans-Evaluation's Qwen3-Omni
5
+ eval_videomme.py and swaps out the inference with MiniCPM-o. Video-MME is
6
+ video-only (no audio), so we do NOT pass audio in.
7
+ """
8
+ from __future__ import annotations
9
+
10
+ import _common
11
+
12
+ import argparse
13
+ import gc
14
+ import io
15
+ import contextlib
16
+ import json
17
+ from pathlib import Path
18
+
19
+ import torch
20
+ from tqdm import tqdm
21
+
22
+ ch = _common.ch("videomme")
23
+ load_videomme = ch.load_videomme
24
+ extract_answer = ch.extract_answer
25
+ compute_metrics = ch.compute_metrics
26
+ print_summary = ch.print_summary
27
+ DEFAULT_VIDEO_DIR = ch.DEFAULT_VIDEO_DIR
28
+ DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
29
+
30
+ from minicpmo_inference import load_model, run_inference
31
+
32
+
33
+ def parse_args() -> argparse.Namespace:
34
+ p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on Video-MME.")
35
+ p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
36
+ p.add_argument("--video-dir", type=Path, default=DEFAULT_VIDEO_DIR)
37
+ p.add_argument("--output-dir", type=Path,
38
+ default=Path("/home/ubuntu/eval_results/videomme_minicpmo"))
39
+ p.add_argument("--max-samples", type=int, default=-1)
40
+ p.add_argument("--max-new-tokens", type=int, default=32)
41
+ p.add_argument("--temperature", type=float, default=0.0)
42
+ p.add_argument("--label", type=str, default="minicpmo_videomme")
43
+ p.add_argument("--max-frames", type=int, default=64,
44
+ help="Max frames sampled from each video (MiniCPM-o uses "
45
+ "PIL images).")
46
+ p.add_argument("--fps", type=float, default=1.0)
47
+ p.add_argument("--attn", type=str, default="flash_attention_2",
48
+ choices=["sdpa", "flash_attention_2", "eager"])
49
+ # vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
50
+ p.add_argument("--vllm", action="store_true", default=False,
51
+ help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
52
+ p.add_argument("--tp", type=int, default=None)
53
+ p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
54
+ p.add_argument("--max-model-len", type=int, default=65536)
55
+ p.add_argument("--batch-size", type=int, default=32)
56
+ # Data-parallel sharding: split test set into K slices, process slice N
57
+ p.add_argument("--shard", type=int, default=0)
58
+ p.add_argument("--num-shards", type=int, default=1)
59
+ return p.parse_args()
60
+
61
+
62
+ def main() -> None:
63
+ args = parse_args()
64
+ out_dir = args.output_dir / args.label
65
+ out_dir.mkdir(parents=True, exist_ok=True)
66
+ shard_suffix = (f".shard{args.shard}of{args.num_shards}"
67
+ if args.num_shards > 1 else "")
68
+ results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
69
+ metrics_json = out_dir / "metrics.json"
70
+ summary_txt = out_dir / "summary.txt"
71
+
72
+ if args.vllm:
73
+ print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
74
+ "supported upstream yet; falling back to transformers.")
75
+ print("[data] Loading Video-MME dataset...")
76
+ test_data = load_videomme(args.video_dir, args.max_samples)
77
+ if args.num_shards > 1:
78
+ test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
79
+ print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
80
+ else:
81
+ print(f"[data] {len(test_data)} questions ready")
82
+
83
+ processed: set = set()
84
+ if results_jsonl.exists():
85
+ with open(results_jsonl) as f:
86
+ for line in f:
87
+ obj = json.loads(line)
88
+ processed.add(obj["question_id"])
89
+ print(f"[resume] {len(processed)} already processed, skipping")
90
+
91
+ model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
92
+ init_audio=False)
93
+
94
+ for item in tqdm(test_data, desc="Video-MME", unit="q"):
95
+ if item["question_id"] in processed:
96
+ continue
97
+ try:
98
+ raw_output = run_inference(
99
+ model, tokenizer,
100
+ video_path=item["video_path"],
101
+ audio_path=None,
102
+ prompt=item["prompt"],
103
+ max_new_tokens=args.max_new_tokens,
104
+ temperature=args.temperature,
105
+ max_frames=args.max_frames,
106
+ fps=args.fps,
107
+ )
108
+ except Exception as exc:
109
+ import traceback
110
+ print(f" [error] {item['question_id']}: {exc}")
111
+ traceback.print_exc()
112
+ raw_output = ""
113
+
114
+ pred = extract_answer(raw_output)
115
+ result = {
116
+ "question_id": item["question_id"],
117
+ "video_id": item["video_id"],
118
+ "duration": item["duration"],
119
+ "domain": item["domain"],
120
+ "sub_category": item["sub_category"],
121
+ "task_type": item["task_type"],
122
+ "question": item["question"],
123
+ "options": item["options"],
124
+ "gt_answer": item["gt_answer"],
125
+ "pred_answer": pred,
126
+ "raw_output": raw_output,
127
+ }
128
+ with open(results_jsonl, "a", encoding="utf-8") as f:
129
+ f.write(json.dumps(result, ensure_ascii=False) + "\n")
130
+
131
+ processed.add(item["question_id"])
132
+ gc.collect()
133
+ torch.cuda.empty_cache()
134
+
135
+ if args.num_shards > 1:
136
+ print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
137
+ print(f"[shard] Run merge_shards.py --bench videomme --label-dir {out_dir}")
138
+ return
139
+
140
+ all_results = []
141
+ if results_jsonl.exists():
142
+ with open(results_jsonl) as f:
143
+ for line in f:
144
+ all_results.append(json.loads(line))
145
+
146
+ metrics = compute_metrics(all_results)
147
+ metrics["eval_config"] = {
148
+ "model_id": args.model_id,
149
+ "video_dir": str(args.video_dir),
150
+ "max_new_tokens": args.max_new_tokens,
151
+ "temperature": args.temperature,
152
+ "max_frames": args.max_frames,
153
+ "fps": args.fps,
154
+ "attn": args.attn,
155
+ }
156
+
157
+ with open(metrics_json, "w", encoding="utf-8") as f:
158
+ json.dump(metrics, f, indent=2, ensure_ascii=False)
159
+
160
+ print_summary(metrics, args.label)
161
+ with open(summary_txt, "w", encoding="utf-8") as f:
162
+ buf = io.StringIO()
163
+ with contextlib.redirect_stdout(buf):
164
+ print_summary(metrics, args.label)
165
+ f.write(buf.getvalue())
166
+
167
+ print(f"\n[output] Results: {results_jsonl}")
168
+ print(f"[output] Metrics: {metrics_json}")
169
+ print(f"[output] Summary: {summary_txt}")
170
+
171
+
172
+ if __name__ == "__main__":
173
+ main()
scripts/eval_worldsense.py ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Evaluate MiniCPM-o 4.5 on WorldSense.
3
+
4
+ WorldSense videos have embedded audio; we extract it via ffmpeg and feed
5
+ both the video frames and the audio waveform to MiniCPM-o.
6
+ """
7
+ from __future__ import annotations
8
+
9
+ import _common
10
+
11
+ import argparse
12
+ import gc
13
+ import io
14
+ import contextlib
15
+ import json
16
+ from pathlib import Path
17
+
18
+ import torch
19
+ from tqdm import tqdm
20
+
21
+ ch = _common.ch("worldsense")
22
+ load_worldsense = ch.load_worldsense
23
+ extract_answer = ch.extract_answer
24
+ compute_metrics = ch.compute_metrics
25
+ print_summary = ch.print_summary
26
+ DEFAULT_DATA_DIR = ch.DEFAULT_DATA_DIR
27
+ DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
28
+
29
+ from minicpmo_inference import load_model, run_inference
30
+
31
+
32
+ def parse_args() -> argparse.Namespace:
33
+ p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on WorldSense.")
34
+ p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
35
+ p.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR)
36
+ p.add_argument("--output-dir", type=Path,
37
+ default=Path("/home/ubuntu/eval_results/worldsense_minicpmo"))
38
+ p.add_argument("--max-samples", type=int, default=-1)
39
+ p.add_argument("--max-new-tokens", type=int, default=32)
40
+ p.add_argument("--temperature", type=float, default=0.0)
41
+ p.add_argument("--label", type=str, default="minicpmo_worldsense")
42
+ p.add_argument("--max-frames", type=int, default=64)
43
+ p.add_argument("--fps", type=float, default=1.0)
44
+ p.add_argument("--attn", type=str, default="flash_attention_2",
45
+ choices=["sdpa", "flash_attention_2", "eager"])
46
+ p.add_argument("--no-audio", action="store_true",
47
+ help="Video-only mode (skip audio extraction).")
48
+ # vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
49
+ p.add_argument("--vllm", action="store_true", default=False,
50
+ help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
51
+ p.add_argument("--tp", type=int, default=None)
52
+ p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
53
+ p.add_argument("--max-model-len", type=int, default=65536)
54
+ p.add_argument("--batch-size", type=int, default=32)
55
+ # Data-parallel sharding
56
+ p.add_argument("--shard", type=int, default=0)
57
+ p.add_argument("--num-shards", type=int, default=1)
58
+ return p.parse_args()
59
+
60
+
61
+ def main() -> None:
62
+ args = parse_args()
63
+ out_dir = args.output_dir / args.label
64
+ out_dir.mkdir(parents=True, exist_ok=True)
65
+ shard_suffix = (f".shard{args.shard}of{args.num_shards}"
66
+ if args.num_shards > 1 else "")
67
+ results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
68
+ metrics_json = out_dir / "metrics.json"
69
+ summary_txt = out_dir / "summary.txt"
70
+
71
+ if args.vllm:
72
+ print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
73
+ "supported upstream yet; falling back to transformers.")
74
+ print("[data] Loading WorldSense dataset...")
75
+ test_data = load_worldsense(args.data_dir, args.max_samples)
76
+ if args.num_shards > 1:
77
+ test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
78
+ print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
79
+ else:
80
+ print(f"[data] {len(test_data)} questions ready")
81
+
82
+ processed: set = set()
83
+ if results_jsonl.exists():
84
+ with open(results_jsonl) as f:
85
+ for line in f:
86
+ obj = json.loads(line)
87
+ processed.add(obj["question_id"])
88
+ print(f"[resume] {len(processed)} already processed")
89
+
90
+ model, tokenizer = load_model(
91
+ args.model_id, attn_implementation=args.attn, init_audio=not args.no_audio,
92
+ )
93
+
94
+ for item in tqdm(test_data, desc="WorldSense", unit="q"):
95
+ if item["question_id"] in processed:
96
+ continue
97
+ try:
98
+ raw_output = run_inference(
99
+ model, tokenizer,
100
+ video_path=item["video_path"],
101
+ audio_path=None,
102
+ prompt=item["prompt"],
103
+ max_new_tokens=args.max_new_tokens,
104
+ temperature=args.temperature,
105
+ max_frames=args.max_frames,
106
+ fps=args.fps,
107
+ use_audio_from_video=not args.no_audio,
108
+ )
109
+ except Exception as exc:
110
+ import traceback
111
+ print(f" [error] {item['question_id']}: {exc}")
112
+ traceback.print_exc()
113
+ raw_output = ""
114
+
115
+ pred = extract_answer(raw_output)
116
+ result = {
117
+ "question_id": item["question_id"],
118
+ "video_id": item["video_id"],
119
+ "duration": item["duration"],
120
+ "domain": item["domain"],
121
+ "sub_category": item["sub_category"],
122
+ "task_domain": item["task_domain"],
123
+ "task_type": item["task_type"],
124
+ "question": item["question"],
125
+ "candidates": item["candidates"],
126
+ "gt_answer": item["gt_answer"],
127
+ "pred_answer": pred,
128
+ "raw_output": raw_output,
129
+ }
130
+ with open(results_jsonl, "a", encoding="utf-8") as f:
131
+ f.write(json.dumps(result, ensure_ascii=False) + "\n")
132
+
133
+ processed.add(item["question_id"])
134
+ gc.collect()
135
+ torch.cuda.empty_cache()
136
+
137
+ if args.num_shards > 1:
138
+ print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
139
+ print(f"[shard] Run merge_shards.py --bench worldsense --label-dir {out_dir}")
140
+ return
141
+
142
+ all_results = []
143
+ if results_jsonl.exists():
144
+ with open(results_jsonl) as f:
145
+ for line in f:
146
+ all_results.append(json.loads(line))
147
+
148
+ metrics = compute_metrics(all_results)
149
+ metrics["eval_config"] = {
150
+ "model_id": args.model_id,
151
+ "data_dir": str(args.data_dir),
152
+ "max_new_tokens": args.max_new_tokens,
153
+ "temperature": args.temperature,
154
+ "max_frames": args.max_frames,
155
+ "fps": args.fps,
156
+ "attn": args.attn,
157
+ "no_audio": args.no_audio,
158
+ }
159
+ with open(metrics_json, "w", encoding="utf-8") as f:
160
+ json.dump(metrics, f, indent=2, ensure_ascii=False)
161
+
162
+ print_summary(metrics, args.label)
163
+ with open(summary_txt, "w", encoding="utf-8") as f:
164
+ buf = io.StringIO()
165
+ with contextlib.redirect_stdout(buf):
166
+ print_summary(metrics, args.label)
167
+ f.write(buf.getvalue())
168
+
169
+ print(f"\n[output] Results: {results_jsonl}")
170
+ print(f"[output] Metrics: {metrics_json}")
171
+ print(f"[output] Summary: {summary_txt}")
172
+
173
+
174
+ if __name__ == "__main__":
175
+ main()
scripts/merge_shards.py ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Merge sharded eval_results.shard*.jsonl files and recompute metrics.
3
+
4
+ Usage:
5
+ python merge_shards.py --bench videomme \
6
+ --label-dir /home/ubuntu/eval_results/videomme/vmme_minicpmo_4_5
7
+
8
+ The script finds all `eval_results.shard*.jsonl` under `--label-dir`,
9
+ concatenates them into `eval_results.jsonl` (deduping by a bench-specific
10
+ primary key), then re-runs the bench's `compute_metrics` + `print_summary`.
11
+ Final outputs: `eval_results.jsonl`, `metrics.json`, `summary.txt`.
12
+ """
13
+ from __future__ import annotations
14
+
15
+ import _common # noqa: F401
16
+
17
+ import argparse
18
+ import contextlib
19
+ import io
20
+ import json
21
+ import sys
22
+ from pathlib import Path
23
+
24
+
25
+ # Primary key per bench (must match the field written by each eval script).
26
+ PK = {
27
+ "videomme": "question_id",
28
+ "lvbench": "uid",
29
+ "worldsense": "question_id",
30
+ "daily_omni": "question_id",
31
+ "dpo_sync": "video",
32
+ "vggsoundsync": "uid",
33
+ }
34
+
35
+ # Extra label used when printing the summary
36
+ LABEL_HINT = {
37
+ "videomme": "Video-MME",
38
+ "lvbench": "LVBench",
39
+ "worldsense": "WorldSense",
40
+ "daily_omni": "Daily-Omni",
41
+ "dpo_sync": "Sync",
42
+ "vggsoundsync": "VGGSoundSync",
43
+ }
44
+
45
+
46
+ def main() -> int:
47
+ p = argparse.ArgumentParser()
48
+ p.add_argument("--bench", required=True,
49
+ choices=list(PK.keys()),
50
+ help="Which benchmark this label-dir belongs to.")
51
+ p.add_argument("--label-dir", type=Path, required=True,
52
+ help="Eval output dir containing eval_results.shard*.jsonl.")
53
+ args = p.parse_args()
54
+
55
+ ch = _common.ch(args.bench)
56
+ pk = PK[args.bench]
57
+
58
+ shard_files = sorted(args.label_dir.glob("eval_results.shard*.jsonl"))
59
+ if not shard_files:
60
+ print(f"[merge] ERROR: no eval_results.shard*.jsonl in {args.label_dir}",
61
+ file=sys.stderr)
62
+ return 1
63
+
64
+ print(f"[merge] Found {len(shard_files)} shard file(s):")
65
+ for sf in shard_files:
66
+ print(f" - {sf.name}")
67
+
68
+ merged_path = args.label_dir / "eval_results.jsonl"
69
+ all_results = []
70
+ seen: set = set()
71
+ n_dup = 0
72
+ with open(merged_path, "w", encoding="utf-8") as out:
73
+ for sf in shard_files:
74
+ with open(sf) as f:
75
+ for line in f:
76
+ line = line.strip()
77
+ if not line:
78
+ continue
79
+ obj = json.loads(line)
80
+ key = obj.get(pk)
81
+ if key in seen:
82
+ n_dup += 1
83
+ continue
84
+ seen.add(key)
85
+ out.write(line + "\n")
86
+ all_results.append(obj)
87
+
88
+ print(f"[merge] Merged {len(all_results)} unique results "
89
+ f"({n_dup} duplicates skipped) -> {merged_path}")
90
+
91
+ metrics = ch.compute_metrics(all_results)
92
+ # Preserve eval_config from any shard if present
93
+ for sf in shard_files:
94
+ try:
95
+ with open(sf) as f:
96
+ first = f.readline().strip()
97
+ if first:
98
+ obj = json.loads(first)
99
+ if "eval_config" in obj:
100
+ metrics["eval_config"] = obj["eval_config"]
101
+ break
102
+ except Exception:
103
+ pass
104
+
105
+ metrics_json = args.label_dir / "metrics.json"
106
+ summary_txt = args.label_dir / "summary.txt"
107
+
108
+ with open(metrics_json, "w", encoding="utf-8") as f:
109
+ json.dump(metrics, f, indent=2, ensure_ascii=False)
110
+
111
+ label = args.label_dir.name
112
+ ch.print_summary(metrics, label)
113
+
114
+ buf = io.StringIO()
115
+ with contextlib.redirect_stdout(buf):
116
+ ch.print_summary(metrics, label)
117
+ with open(summary_txt, "w", encoding="utf-8") as f:
118
+ f.write(buf.getvalue())
119
+
120
+ print(f"\n[merge] Done.")
121
+ print(f" Results: {merged_path}")
122
+ print(f" Metrics: {metrics_json}")
123
+ print(f" Summary: {summary_txt}")
124
+ return 0
125
+
126
+
127
+ if __name__ == "__main__":
128
+ sys.exit(main())
scripts/minicpmo_inference.py ADDED
@@ -0,0 +1,264 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Common inference wrapper for MiniCPM-o 4.5.
3
+
4
+ MiniCPM-o's API is `model.chat(msgs=[...], tokenizer=...)` where `msgs` is a
5
+ list of `{"role": ..., "content": [image, audio, ..., text]}`. This module
6
+ hides that detail behind `run_inference(model, tokenizer, video, audio,
7
+ prompt)` so the 6 benchmark eval scripts can share one inference code path.
8
+
9
+ Also runs the compatibility patcher on import so users who haven't run
10
+ `setup_env.sh` still get a working model.
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import os
16
+ import subprocess
17
+ import tempfile
18
+ from pathlib import Path
19
+ from typing import Any, List, Optional, Tuple
20
+
21
+ import numpy as np
22
+
23
+
24
+ # ---------------------------------------------------------------------------
25
+ # Apply transformers>=4.52 compatibility patches lazily on import.
26
+ # Safe to call multiple times; idempotent.
27
+ # ---------------------------------------------------------------------------
28
+ def _maybe_patch_once() -> None:
29
+ try:
30
+ from patch_minicpmo import (
31
+ _find_modeling_file,
32
+ _find_processing_file,
33
+ patch_file,
34
+ patch_processing_file,
35
+ )
36
+ except ImportError:
37
+ return
38
+ path = _find_modeling_file()
39
+ if path is not None:
40
+ try:
41
+ patch_file(path)
42
+ except Exception as exc: # pragma: no cover
43
+ print(f"[minicpmo] (warn) patch failed: {exc}")
44
+ proc = _find_processing_file()
45
+ if proc is not None:
46
+ try:
47
+ patch_processing_file(proc)
48
+ except Exception as exc: # pragma: no cover
49
+ print(f"[minicpmo] (warn) processing patch failed: {exc}")
50
+
51
+
52
+ _maybe_patch_once()
53
+
54
+
55
+ def _max_inp_length_for_chat(model: Any, max_new_tokens: int) -> int:
56
+ """Upper bound for ``model.chat(..., max_inp_length=...)`` (defaults to 8192).
57
+
58
+ Many frames × per-frame image placeholders can exceed 8k text tokens; the
59
+ processor then truncates ``input_ids`` and image start/end counts diverge,
60
+ causing ``RuntimeError`` in ``processing_minicpmo._convert``.
61
+ """
62
+ reserve = int(max_new_tokens) + 1024
63
+ best = 32768
64
+ for cfg in (
65
+ getattr(model, "config", None),
66
+ getattr(getattr(model, "llm", None), "config", None),
67
+ ):
68
+ if cfg is None:
69
+ continue
70
+ npos = getattr(cfg, "max_position_embeddings", None)
71
+ if isinstance(npos, int) and npos > 8192:
72
+ best = min(best, max(npos - reserve, 16384))
73
+ return best
74
+
75
+
76
+ # ---------------------------------------------------------------------------
77
+ # Frame / audio loaders
78
+ # ---------------------------------------------------------------------------
79
+ def load_video_frames(video_path: str, max_frames: int = 32,
80
+ fps: float = 1.0) -> List:
81
+ """Sample PIL RGB frames uniformly from a video.
82
+
83
+ MiniCPM-o expects a list of PIL Images (not a tensor). `fps=1.0,
84
+ max_frames=32` covers ~32s; longer videos get sparser sampling.
85
+ """
86
+ from PIL import Image
87
+ import decord
88
+
89
+ vr = decord.VideoReader(video_path, num_threads=1)
90
+ total_frames = len(vr)
91
+ video_fps = vr.get_avg_fps()
92
+ duration = total_frames / max(video_fps, 1e-6)
93
+
94
+ target = max(int(round(fps * duration)), 2)
95
+ target = min(target, max_frames)
96
+ target = min(target, total_frames)
97
+
98
+ idx = np.linspace(0, total_frames - 1, target).round().astype(int).tolist()
99
+ frames = vr.get_batch(idx).asnumpy()
100
+ return [Image.fromarray(f).convert("RGB") for f in frames]
101
+
102
+
103
+ def load_audio_waveform(audio_path: str, target_sr: int = 16000) -> np.ndarray:
104
+ """Load audio as float32 numpy in [-1, 1] at `target_sr`."""
105
+ import librosa
106
+ y, _ = librosa.load(audio_path, sr=target_sr, mono=True)
107
+ return y.astype(np.float32)
108
+
109
+
110
+ def extract_audio_from_video(video_path: str, target_sr: int = 16000,
111
+ tmp_dir: Optional[str] = None) -> Optional[str]:
112
+ """Extract the audio track from a video file to a temp .wav via ffmpeg.
113
+
114
+ Returns the path to the .wav file, or None if the video has no audio
115
+ track or extraction fails. Caller is responsible for cleanup.
116
+ """
117
+ tmp_dir = tmp_dir or tempfile.mkdtemp(prefix="mo_audio_")
118
+ out = os.path.join(tmp_dir, "audio.wav")
119
+ try:
120
+ subprocess.run(
121
+ ["ffmpeg", "-y", "-loglevel", "error", "-i", video_path,
122
+ "-vn", "-ac", "1", "-ar", str(target_sr), out],
123
+ check=True,
124
+ stdout=subprocess.DEVNULL,
125
+ stderr=subprocess.PIPE,
126
+ timeout=120,
127
+ )
128
+ except Exception:
129
+ return None
130
+ if not os.path.isfile(out) or os.path.getsize(out) < 64:
131
+ return None
132
+ return out
133
+
134
+
135
+ # ---------------------------------------------------------------------------
136
+ # Model loading
137
+ # ---------------------------------------------------------------------------
138
+ def load_model(model_id: str = "openbmb/MiniCPM-o-4_5",
139
+ device: str = "cuda",
140
+ dtype: str = "bfloat16",
141
+ init_audio: bool = True,
142
+ attn_implementation: str = "flash_attention_2"):
143
+ """Load MiniCPM-o model + tokenizer. Returns (model, tokenizer).
144
+
145
+ Tries `attn_implementation` first; if flash_attention_2 isn't installed or
146
+ the backbone doesn't support it, falls back to sdpa automatically.
147
+ """
148
+ import torch
149
+ from transformers import AutoModel, AutoTokenizer
150
+
151
+ torch_dtype = {"bfloat16": torch.bfloat16, "float16": torch.float16,
152
+ "float32": torch.float32}[dtype]
153
+
154
+ def _try_load(attn: str):
155
+ print(f"[minicpmo] Loading {model_id} (dtype={dtype}, device={device}, "
156
+ f"init_audio={init_audio}, attn={attn})...")
157
+ return AutoModel.from_pretrained(
158
+ model_id,
159
+ trust_remote_code=True,
160
+ attn_implementation=attn,
161
+ torch_dtype=torch_dtype,
162
+ init_vision=True,
163
+ init_audio=init_audio,
164
+ init_tts=False,
165
+ )
166
+
167
+ try:
168
+ model = _try_load(attn_implementation)
169
+ except Exception as exc:
170
+ if attn_implementation != "sdpa":
171
+ print(f"[minicpmo] (warn) {attn_implementation} failed ({exc}); falling back to sdpa.")
172
+ model = _try_load("sdpa")
173
+ else:
174
+ raise
175
+
176
+ model = model.eval().to(device)
177
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
178
+ print(f"[minicpmo] Model ready.")
179
+ return model, tokenizer
180
+
181
+
182
+ # ---------------------------------------------------------------------------
183
+ # Inference
184
+ # ---------------------------------------------------------------------------
185
+ def run_inference(
186
+ model,
187
+ tokenizer,
188
+ video_path: Optional[str],
189
+ audio_path: Optional[str],
190
+ prompt: str,
191
+ max_new_tokens: int = 256,
192
+ temperature: float = 0.0,
193
+ max_frames: int = 32,
194
+ fps: float = 1.0,
195
+ use_audio_from_video: bool = False,
196
+ ) -> str:
197
+ """Run MiniCPM-o chat inference.
198
+
199
+ Args:
200
+ video_path: optional path to an mp4/etc. file.
201
+ audio_path: optional path to a wav file. If `use_audio_from_video` is
202
+ True and `audio_path` is None, we extract audio from the video.
203
+ prompt: user instruction text.
204
+ temperature: 0 means greedy.
205
+ use_audio_from_video: if True, extract audio from the video automatically
206
+ (useful for WorldSense / Daily-Omni where video has embedded audio but
207
+ no separate wav is provided).
208
+ """
209
+ content: List[Any] = []
210
+ tmp_audio_dir: Optional[str] = None
211
+
212
+ if video_path is not None:
213
+ frames = load_video_frames(video_path, max_frames=max_frames, fps=fps)
214
+ content.extend(frames)
215
+
216
+ if audio_path is None and use_audio_from_video and video_path is not None:
217
+ tmp_audio_dir = tempfile.mkdtemp(prefix="mo_audio_")
218
+ audio_path = extract_audio_from_video(video_path, tmp_dir=tmp_audio_dir)
219
+
220
+ if audio_path is not None:
221
+ try:
222
+ audio = load_audio_waveform(audio_path, target_sr=16000)
223
+ if audio.size > 0:
224
+ content.append(audio)
225
+ except Exception as exc:
226
+ print(f" [minicpmo] (warn) audio load failed: {exc}")
227
+
228
+ content.append(prompt)
229
+
230
+ msgs = [{"role": "user", "content": content}]
231
+
232
+ # Critical defaults for video understanding (see MiniCPM-o 4.5 HF README
233
+ # "Chat with Video"): without ``use_image_id=False, max_slice_nums=1`` the
234
+ # processor treats each frame as an independent HD image, slicing it into
235
+ # multiple sub-images with per-image ID tokens. That token distribution is
236
+ # OOD for the video-trained model and produces degenerate output (repeated
237
+ # training-data fragments, e.g. "the image description of the first image
238
+ # you see as a brief description ...").
239
+ gen_kwargs = dict(
240
+ max_new_tokens=max_new_tokens,
241
+ do_sample=temperature > 0,
242
+ temperature=temperature if temperature > 0 else 1.0,
243
+ top_p=0.9 if temperature > 0 else 1.0,
244
+ max_inp_length=_max_inp_length_for_chat(model, max_new_tokens),
245
+ use_tts_template=False,
246
+ enable_thinking=False,
247
+ )
248
+ if video_path is not None:
249
+ gen_kwargs["use_image_id"] = False
250
+ gen_kwargs["max_slice_nums"] = 1
251
+ if use_audio_from_video and video_path is not None:
252
+ gen_kwargs.setdefault("omni_mode", True)
253
+ try:
254
+ res = model.chat(msgs=msgs, tokenizer=tokenizer, **gen_kwargs)
255
+ except TypeError:
256
+ res = model.chat(msgs=msgs, tokenizer=tokenizer)
257
+
258
+ if tmp_audio_dir is not None:
259
+ import shutil
260
+ shutil.rmtree(tmp_audio_dir, ignore_errors=True)
261
+
262
+ if isinstance(res, tuple):
263
+ res = res[0]
264
+ return str(res).strip()
scripts/patch_minicpmo.py ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Patch MiniCPM-o 4.5 custom code in the Hugging Face modules cache.
3
+
4
+ ``modeling_minicpmo.py`` (transformers >= 4.52):
5
+
6
+ 1. `WhisperEncoderLayer.forward` unpacks 3 values from `self.self_attn(...)`,
7
+ but new `WhisperAttention.forward` returns 2 values.
8
+ 2. `prepare_inputs_for_generation` reads `past_key_values.seen_tokens`, which
9
+ was removed from `DynamicCache`.
10
+ 3. `chat()` force-sets ``use_tts_template = True`` whenever audio is in the
11
+ ``content`` list. That appends ``<|tts_bos|>`` to the assistant prefix
12
+ and the model then generates **audio (TTS codec) ids**; decoded as text
13
+ they look like ``<think>`` floods / gibberish. We want audio-in +
14
+ **text-out** for benchmark eval, so respect the caller's kwarg instead.
15
+
16
+ ``processing_minicpmo.py``:
17
+
18
+ 4. `_convert` used ``max(len(image_start_idx), len(image_end_idx))`` when
19
+ building ``image_bounds``; after ``max_length`` truncation start/end counts
20
+ can differ by one and ``torch.hstack`` raises (common with many video
21
+ frames under the default ``chat(..., max_inp_length=8192)``). Use ``min``.
22
+
23
+ Idempotent. Also downloads model code on demand so files exist before patching.
24
+ """
25
+ from __future__ import annotations
26
+
27
+ import os
28
+ import sys
29
+ from pathlib import Path
30
+
31
+ MODEL_ID = "openbmb/MiniCPM-o-4_5"
32
+
33
+
34
+ def _find_modeling_file() -> Path | None:
35
+ """Locate the cached modeling_minicpmo.py (matches HF's module dir naming)."""
36
+ home = Path(os.path.expanduser("~"))
37
+ candidates = [
38
+ home / ".cache" / "huggingface" / "modules" / "transformers_modules",
39
+ ]
40
+ hits: list[Path] = []
41
+ for root in candidates:
42
+ if not root.exists():
43
+ continue
44
+ for p in root.rglob("modeling_minicpmo.py"):
45
+ hits.append(p)
46
+ if not hits:
47
+ return None
48
+ # Prefer the deepest (snapshot-hashed) one.
49
+ hits.sort(key=lambda p: len(p.parts), reverse=True)
50
+ return hits[0]
51
+
52
+
53
+ def _find_processing_file() -> Path | None:
54
+ """``processing_minicpmo.py`` lives next to the cached ``modeling_minicpmo.py``."""
55
+ modeling = _find_modeling_file()
56
+ if modeling is None:
57
+ return None
58
+ proc = modeling.parent / "processing_minicpmo.py"
59
+ return proc if proc.is_file() else None
60
+
61
+
62
+ def _download_model_code() -> None:
63
+ """Force HF to download MiniCPM-o's custom code so the file is cached.
64
+
65
+ We only need the Python files + config (not weights) for patching. We use
66
+ `hf_hub_download` for the individual code files to avoid fetching the
67
+ multi-GB safetensors shards just to patch a .py file.
68
+ """
69
+ try:
70
+ from huggingface_hub import hf_hub_download
71
+ except ImportError:
72
+ print("[patch] huggingface_hub not installed; skipping auto-download.")
73
+ return
74
+
75
+ for fn in [
76
+ "config.json",
77
+ "configuration_minicpm.py",
78
+ "modeling_minicpmo.py",
79
+ "modeling_navit_siglip.py",
80
+ "processing_minicpmo.py",
81
+ "resampler.py",
82
+ "utils.py",
83
+ ]:
84
+ try:
85
+ hf_hub_download(repo_id=MODEL_ID, filename=fn)
86
+ except Exception as exc:
87
+ # Some files may not exist in every revision; that's fine.
88
+ print(f"[patch] (warn) could not fetch {fn}: {exc}")
89
+
90
+
91
+ def patch_whisper_unpack(text: str) -> tuple[str, bool]:
92
+ """Fix #1: WhisperAttention now returns 2 values, not 3."""
93
+ OLD = (
94
+ " hidden_states, attn_weights, past_key_values = self.self_attn(\n"
95
+ " hidden_states=hidden_states,\n"
96
+ " attention_mask=attention_mask,\n"
97
+ " layer_head_mask=layer_head_mask,\n"
98
+ " output_attentions=output_attentions,\n"
99
+ " past_key_value=past_key_values,\n"
100
+ " )"
101
+ )
102
+ NEW = (
103
+ " _attn_out = self.self_attn(\n"
104
+ " hidden_states=hidden_states,\n"
105
+ " attention_mask=attention_mask,\n"
106
+ " layer_head_mask=layer_head_mask,\n"
107
+ " output_attentions=output_attentions,\n"
108
+ " past_key_value=past_key_values,\n"
109
+ " )\n"
110
+ " if len(_attn_out) == 3:\n"
111
+ " hidden_states, attn_weights, past_key_values = _attn_out\n"
112
+ " else:\n"
113
+ " hidden_states, attn_weights = _attn_out"
114
+ )
115
+ if NEW.split("\n", 1)[0] in text:
116
+ return text, False # already patched
117
+ if OLD not in text:
118
+ return text, False # not applicable (different revision?)
119
+ return text.replace(OLD, NEW), True
120
+
121
+
122
+ def patch_seen_tokens(text: str) -> tuple[str, bool]:
123
+ """Fix #2: DynamicCache.seen_tokens was removed in newer transformers."""
124
+ OLD = (
125
+ " cache_length = past_key_values.get_seq_length()\n"
126
+ " past_length = past_key_values.seen_tokens"
127
+ )
128
+ NEW = (
129
+ " cache_length = past_key_values.get_seq_length()\n"
130
+ " past_length = getattr(past_key_values, \"seen_tokens\", cache_length)"
131
+ )
132
+ if 'getattr(past_key_values, "seen_tokens"' in text:
133
+ return text, False # already patched
134
+ if OLD not in text:
135
+ return text, False
136
+ return text.replace(OLD, NEW), True
137
+
138
+
139
+ def patch_chat_force_tts_template(text: str) -> tuple[str, bool]:
140
+ """Fix #3: don't force ``use_tts_template=True`` on audio-containing content.
141
+
142
+ MiniCPM-o's ``chat()`` assumes "audio in implies TTS audio out". For MCQ /
143
+ freetext eval we want a text answer; the caller's ``use_tts_template`` kwarg
144
+ (default ``False``) must win so the assistant prefix doesn't get
145
+ ``<|tts_bos|>`` appended (which causes the LM to emit audio-codec ids that
146
+ look like ``<think>`` repetitions when text-decoded).
147
+ """
148
+ OLD = (
149
+ ' elif isinstance(c, np.ndarray): # audio\n'
150
+ ' audios.append(c)\n'
151
+ ' audio_parts.append(i)\n'
152
+ ' cur_msgs.append("<audio>./</audio>")\n'
153
+ ' use_tts_template = True\n'
154
+ )
155
+ NEW = (
156
+ ' elif isinstance(c, np.ndarray): # audio\n'
157
+ ' audios.append(c)\n'
158
+ ' audio_parts.append(i)\n'
159
+ ' cur_msgs.append("<audio>./</audio>")\n'
160
+ ' # PATCHED: honour caller-provided use_tts_template.\n'
161
+ ' # Upstream force-sets True on any audio, which makes the model\n'
162
+ ' # generate TTS codec ids (look like <think> noise as text).\n'
163
+ )
164
+ if "PATCHED: honour caller-provided use_tts_template" in text:
165
+ return text, False
166
+ if OLD not in text:
167
+ return text, False
168
+ return text.replace(OLD, NEW), True
169
+
170
+
171
+ def patch_processor_image_bounds(text: str) -> tuple[str, bool]:
172
+ """Fix ``image_bounds`` when start/end marker counts disagree (truncation)."""
173
+ OLD = " valid_image_nums = max(len(image_start_idx), len(image_end_idx))"
174
+ NEW = (
175
+ " # Pair only complete spans; max() breaks torch.hstack if counts differ.\n"
176
+ " valid_image_nums = min(len(image_start_idx), len(image_end_idx))"
177
+ )
178
+ if "valid_image_nums = min(len(image_start_idx), len(image_end_idx))" in text:
179
+ return text, False
180
+ if OLD not in text:
181
+ return text, False
182
+ return text.replace(OLD, NEW), True
183
+
184
+
185
+ def patch_file(path: Path) -> bool:
186
+ original = path.read_text()
187
+ text = original
188
+ any_change = False
189
+
190
+ text, c1 = patch_whisper_unpack(text)
191
+ any_change |= c1
192
+ text, c2 = patch_seen_tokens(text)
193
+ any_change |= c2
194
+ text, c3 = patch_chat_force_tts_template(text)
195
+ any_change |= c3
196
+
197
+ if any_change:
198
+ backup = path.with_suffix(path.suffix + ".bak")
199
+ if not backup.exists():
200
+ backup.write_text(original)
201
+ print(f"[patch] Backup -> {backup}")
202
+ path.write_text(text)
203
+ print(f"[patch] Patched {path.name}: "
204
+ f"whisper_unpack={c1}, seen_tokens={c2}, chat_tts_template={c3}")
205
+ else:
206
+ print(f"[patch] No changes needed (already patched or unknown revision)")
207
+ return any_change
208
+
209
+
210
+ def patch_processing_file(path: Path) -> bool:
211
+ """Patch ``processing_minicpmo.py`` (image_bounds hstack)."""
212
+ original = path.read_text()
213
+ text = original
214
+ text, c = patch_processor_image_bounds(text)
215
+ if not c:
216
+ print(f"[patch] {path.name}: image_bounds already patched or pattern missing")
217
+ return False
218
+ backup = path.with_suffix(path.suffix + ".bak")
219
+ if not backup.exists():
220
+ backup.write_text(original)
221
+ print(f"[patch] Backup -> {backup}")
222
+ path.write_text(text)
223
+ print(f"[patch] Patched {path.name}: image_bounds min() fix")
224
+ return True
225
+
226
+
227
+ def main() -> int:
228
+ path = _find_modeling_file()
229
+ if path is None:
230
+ print("[patch] modeling_minicpmo.py not cached yet; fetching from HF...")
231
+ _download_model_code()
232
+ path = _find_modeling_file()
233
+ if path is None:
234
+ print("[patch] ERROR: could not locate modeling_minicpmo.py", file=sys.stderr)
235
+ return 1
236
+ print(f"[patch] Target: {path}")
237
+ patch_file(path)
238
+
239
+ proc = _find_processing_file()
240
+ if proc is not None:
241
+ print(f"[patch] Target: {proc}")
242
+ patch_processing_file(proc)
243
+ else:
244
+ print("[patch] (warn) processing_minicpmo.py not found next to modeling; "
245
+ "run once with HF cache populated")
246
+
247
+ # Invalidate __pycache__ so the edited file is re-imported.
248
+ for pc in path.parent.rglob("__pycache__"):
249
+ import shutil
250
+ shutil.rmtree(pc, ignore_errors=True)
251
+ return 0
252
+
253
+
254
+ if __name__ == "__main__":
255
+ sys.exit(main())
scripts/test_minicpmo.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Sanity check: load MiniCPM-o 4.5 and run a single sample through it.
4
+
5
+ Picks one video from sync eval set, passes video + audio + prompt, prints
6
+ the model's response.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import os
12
+ import sys
13
+ from pathlib import Path
14
+
15
+ sys.path.insert(0, str(Path(__file__).parent))
16
+ from minicpmo_inference import load_model, run_inference
17
+
18
+
19
+ def main():
20
+ # Pick the first original video in the sync eval set
21
+ original_root = Path("/opt/dlami/nvme/video_source/original/uag_oops")
22
+ audio_root = Path("/opt/dlami/nvme/video_source/extracted_audio/original/uag_oops")
23
+
24
+ videos = sorted(original_root.glob("*.mp4"))
25
+ if not videos:
26
+ print(f"ERROR: no videos found at {original_root}")
27
+ sys.exit(1)
28
+
29
+ video_path = videos[0]
30
+ audio_path = audio_root / f"{video_path.stem}.wav"
31
+ if not audio_path.exists():
32
+ print(f"ERROR: audio not found for {video_path.name}")
33
+ sys.exit(1)
34
+
35
+ print(f"Video: {video_path}")
36
+ print(f"Audio: {audio_path}")
37
+ print()
38
+
39
+ model, tokenizer = load_model()
40
+
41
+ prompt = (
42
+ "Watch this video and listen to its audio carefully. "
43
+ "Determine whether the audio and video tracks are synchronized. "
44
+ "Explain your reasoning."
45
+ )
46
+
47
+ print("=== Running inference ===")
48
+ response = run_inference(
49
+ model, tokenizer,
50
+ video_path=str(video_path),
51
+ audio_path=str(audio_path),
52
+ prompt=prompt,
53
+ max_new_tokens=128,
54
+ temperature=0.0,
55
+ )
56
+ print()
57
+ print("=== Response ===")
58
+ print(response)
59
+
60
+
61
+ if __name__ == "__main__":
62
+ main()
scripts/upload_to_hf_model.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Create or update a Hugging Face **model** repo with this evaluation codebase.
3
+
4
+ This upload is **code and docs only** (no MiniCPM-o weights). HF allows model
5
+ repos to host auxiliary artifacts; use ``--private`` if you do not want the
6
+ scripts public.
7
+
8
+ Prerequisites::
9
+
10
+ pip install huggingface_hub
11
+ export HF_TOKEN=hf_... # or: huggingface-cli login
12
+
13
+ Usage::
14
+
15
+ cd MiniCPM-Evaluation
16
+ python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation
17
+
18
+ Private repo::
19
+
20
+ python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation --private
21
+ """
22
+ from __future__ import annotations
23
+
24
+ import argparse
25
+ import sys
26
+ from pathlib import Path
27
+
28
+
29
+ def main() -> int:
30
+ p = argparse.ArgumentParser(description=__doc__)
31
+ p.add_argument(
32
+ "--repo-id",
33
+ required=True,
34
+ help="HF repository id, e.g. username/MiniCPM-Evaluation",
35
+ )
36
+ p.add_argument(
37
+ "--private",
38
+ action="store_true",
39
+ help="Create the repo as private (only for first create; ignored if repo exists).",
40
+ )
41
+ p.add_argument(
42
+ "--token",
43
+ default=None,
44
+ help="HF token (default: HF_TOKEN env or cached huggingface-cli login).",
45
+ )
46
+ args = p.parse_args()
47
+
48
+ root = Path(__file__).resolve().parent.parent
49
+ if not (root / "README.md").is_file():
50
+ print(f"error: unexpected layout; expected README.md under {root}", file=sys.stderr)
51
+ return 1
52
+
53
+ try:
54
+ from huggingface_hub import HfApi, create_repo
55
+ except ImportError:
56
+ print("error: install huggingface_hub: pip install huggingface_hub", file=sys.stderr)
57
+ return 1
58
+
59
+ api = HfApi(token=args.token)
60
+ create_repo(
61
+ repo_id=args.repo_id,
62
+ repo_type="model",
63
+ private=args.private,
64
+ exist_ok=True,
65
+ )
66
+ api.upload_folder(
67
+ folder_path=str(root),
68
+ repo_id=args.repo_id,
69
+ repo_type="model",
70
+ ignore_patterns=[
71
+ ".git/**",
72
+ ".git",
73
+ "**/__pycache__/**",
74
+ "**/*.pyc",
75
+ "**/.DS_Store",
76
+ ],
77
+ )
78
+ print(f"Uploaded: {root}")
79
+ print(f"URL: https://huggingface.co/{args.repo_id}")
80
+ return 0
81
+
82
+
83
+ if __name__ == "__main__":
84
+ raise SystemExit(main())
setup_env.sh ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # MiniCPM-o 4.5 evaluation environment setup.
3
+ #
4
+ # Creates a separate conda env 'minicpmo' because MiniCPM-o has its own
5
+ # dependency stack (librosa, decord, sentencepiece pin, etc.) that may conflict
6
+ # with the Qwen3-Omni 'video' env. Safer to keep them isolated.
7
+ #
8
+ # Usage:
9
+ # bash setup_env.sh
10
+ #
11
+ set -euo pipefail
12
+
13
+ CONDA_ENV="${CONDA_ENV:-minicpmo}"
14
+ PYTHON_VER="${PYTHON_VER:-3.12}"
15
+ INSTALL_DIR="${INSTALL_DIR:-${HOME}/anaconda3}"
16
+
17
+ log() { echo "[setup_env] $*"; }
18
+
19
+ log "Bootstrapping conda..."
20
+ if ! command -v conda &>/dev/null; then
21
+ if [[ -f "${INSTALL_DIR}/etc/profile.d/conda.sh" ]]; then
22
+ source "${INSTALL_DIR}/etc/profile.d/conda.sh"
23
+ else
24
+ echo "Error: conda not found. Install Anaconda first (see CleverHans-Evaluation/setup_env.sh)."
25
+ exit 1
26
+ fi
27
+ fi
28
+ eval "$(conda shell.bash hook)"
29
+
30
+ log "Creating conda env '${CONDA_ENV}' (python=${PYTHON_VER})..."
31
+ if conda env list | awk '{print $1}' | grep -Fxq "${CONDA_ENV}"; then
32
+ log "Env '${CONDA_ENV}' already exists; activating."
33
+ conda activate "${CONDA_ENV}"
34
+ else
35
+ conda create -n "${CONDA_ENV}" "python=${PYTHON_VER}" -y
36
+ conda activate "${CONDA_ENV}"
37
+ fi
38
+
39
+ log "Installing FFmpeg 6 (for audio/video decoding)..."
40
+ conda install -y -c conda-forge 'ffmpeg>=6,<7' || log "Warning: conda-forge ffmpeg failed."
41
+
42
+ log "Installing PyTorch 2.6 (MiniCPM-o stable target; newer torch may work)..."
43
+ pip install --upgrade pip
44
+ pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
45
+ --index-url https://download.pytorch.org/whl/cu124
46
+
47
+ log "Installing MiniCPM-o core dependencies..."
48
+ # MiniCPM-o 4.5 uses Qwen3Config (needs transformers >=4.52).
49
+ pip install 'transformers>=4.52,<4.58' accelerate==0.33.0
50
+ pip install Pillow==10.4.0
51
+ pip install sentencepiece==0.2.0
52
+ pip install decord==0.6.0 librosa==0.10.2 soundfile==0.12.1 moviepy==1.0.3
53
+ pip install vocos==0.1.0
54
+ pip install huggingface_hub==0.26.5
55
+ pip install einops==0.8.0
56
+ pip install tqdm openai
57
+
58
+ # CleverHans-Evaluation loaders used by MiniCPM-o eval scripts (imported via _common.ch):
59
+ # - eval_worldsense.py → pandas + pyarrow (parquet)
60
+ # - eval_videomme.py → datasets (lmms-lab/Video-MME)
61
+ # - eval_lvbench.py → datasets (lmms-lab/LVBench)
62
+ log "Installing eval data-loader deps (datasets, pandas, pyarrow)..."
63
+ pip install datasets pandas pyarrow
64
+
65
+ # MiniCPM-o 4.5 custom modeling file imports 'minicpmo' (PyPI package) for TTS utils.
66
+ # The package drags in cosyvoice + stepaudio2 which need these downstream deps.
67
+ pip install minicpmo==0.1.2
68
+ pip install onnx onnxruntime hyperpyyaml diffusers
69
+
70
+ log "Patching MiniCPM-o modeling file for transformers>=4.52 compatibility..."
71
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
72
+ python "${SCRIPT_DIR}/scripts/patch_minicpmo.py" || log "Warning: patch_minicpmo.py failed (non-fatal; see errors above)."
73
+
74
+ log "Done."
75
+ echo ""
76
+ echo " Active env: ${CONDA_ENV}"
77
+ echo " Python: $(command -v python)"
78
+ echo ""
79
+ echo "Next: conda activate ${CONDA_ENV}"
80
+ echo " Then try: python scripts/test_minicpmo.py"