Upload folder using huggingface_hub

Browse files

Files changed (16) hide show

README.md +82 -0
run_all.sh +151 -0
run_video_rerun.sh +94 -0
scripts/_common.py +66 -0
scripts/eval_daily_omni.py +197 -0
scripts/eval_dpo_sync.py +205 -0
scripts/eval_lvbench.py +168 -0
scripts/eval_vggsoundsync.py +195 -0
scripts/eval_videomme.py +173 -0
scripts/eval_worldsense.py +175 -0
scripts/merge_shards.py +128 -0
scripts/minicpmo_inference.py +264 -0
scripts/patch_minicpmo.py +255 -0
scripts/test_minicpmo.py +62 -0
scripts/upload_to_hf_model.py +84 -0
setup_env.sh +80 -0

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# MiniCPM-o 4.5 Evaluation
+Evaluation scripts for `openbmb/MiniCPM-o-4_5` on the same 6 benchmarks as `CleverHans-Evaluation`:
+- Sync (DPO test set: synced / delay / early)
+- VGGSoundSync (3k freetext)
+- VideoMME (MCQ A/B/C/D)
+- LVBench (MCQ)
+- WorldSense (MCQ)
+- Daily-Omni (MCQ)
+## Why a separate folder
+MiniCPM-o 4.5 has a completely different architecture (SigLip2 + Whisper + Qwen3-8B, 9B params) and API (`model.chat(msgs=...)` style) vs Qwen3-Omni (`generate()` + `qwen_omni_utils`). Sharing code is impractical; data loading / metrics can still be reused from the other repo.
+## Setup
+```bash
+bash setup_env.sh       # install MiniCPM-o dependencies in conda env 'minicpmo'
+```
+## Layout
+```
+MiniCPM-Evaluation/
+├── README.md
+├── setup_env.sh
+└── scripts/
+    ├── minicpmo_inference.py        # common inference wrapper
+    ├── test_minicpmo.py             # quick sanity check (single sample)
+    ├── eval_videomme.py             # per-benchmark evaluators
+    ├── eval_lvbench.py
+    ├── eval_worldsense.py
+    ├── eval_daily_omni.py
+    ├── eval_vggsoundsync.py
+    └── eval_dpo_sync.py
+```
+## Quick Start
+```bash
+conda activate minicpmo
+cd /home/ubuntu/MiniCPM-Evaluation
+# 1. Sanity check: single-sample inference
+python scripts/test_minicpmo.py
+# 2. Run a full benchmark (e.g. Daily-Omni)
+python scripts/eval_daily_omni.py \
+  --data-dir /opt/dlami/nvme/daily_omni \
+  --output-dir /home/ubuntu/eval_results/daily_omni \
+  --label do_minicpmo_45
+```
+## Publish to Hugging Face (model repo)
+This tree is **evaluation code only** (no model weights). You can still host it
+under a Hugging Face **model** repo as a snapshot (e.g. next to weight releases).
+```bash
+pip install huggingface_hub
+export HF_TOKEN=hf_...   # or: huggingface-cli login
+cd MiniCPM-Evaluation
+python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation
+```
+Private repo:
+```bash
+python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation --private
+```
+## Data paths (reused from CleverHans-Evaluation)
+| Benchmark | Path |
+|---|---|
+| Sync videos | `/opt/dlami/nvme/video_source/{original,random_shift_video,extracted_audio}` |
+| VGGSoundSync | `/opt/dlami/nvme/vggsoundsync_test/` |
+| VideoMME | `/opt/dlami/nvme/videomme/data/data/` |
+| LVBench | `/opt/dlami/nvme/lvbench/` |
+| WorldSense | `/opt/dlami/nvme/worldsense/` |
+| Daily-Omni | `/opt/dlami/nvme/daily_omni/` |

run_all.sh ADDED Viewed

	@@ -0,0 +1,151 @@

+#!/usr/bin/env bash
+# Run all 6 benchmarks for MiniCPM-o 4.5 with 4-GPU data parallelism.
+#
+# For each bench, launches NUM_SHARDS python workers simultaneously (one per
+# GPU), each processes 1/NUM_SHARDS of the samples. After all shards finish,
+# merge_shards.py aggregates the per-shard jsonls and computes metrics.
+# Only ONE bench runs at a time; benches run sequentially.
+#
+# Two sync benches use freetext + GPT judge (matches Qwen3-Omni reference).
+#
+# Usage:
+#   export OPENAI_API_KEY=sk-...
+#   bash run_all.sh
+#
+# Override via env vars, e.g.:
+#   CUDA_VISIBLE_DEVICES=4,5,6,7 LABEL=minicpmo_ckpt200 bash run_all.sh
+#   NUM_SHARDS=2 CUDA_VISIBLE_DEVICES=6,7 bash run_all.sh
+set -uo pipefail  # no -e: one bench failure shouldn't block the rest
+# ── Config ─────────────────────────────────────────────────────────────────────
+export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-4,5,6,7}"
+MODEL="${MODEL:-openbmb/MiniCPM-o-4_5}"
+LABEL="${LABEL:-minicpmo_4_5}"
+SCRIPTS="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/scripts"
+CONDA_ENV="${CONDA_ENV:-minicpmo}"
+# Data parallel: how many shards (= number of GPUs to use)
+IFS=',' read -ra GPU_ARR <<< "$CUDA_VISIBLE_DEVICES"
+NUM_SHARDS="${NUM_SHARDS:-${#GPU_ARR[@]}}"
+# Data paths (match Qwen3-Omni reference config)
+DATA_ROOT="${DATA_ROOT:-/opt/dlami/nvme/video_source}"
+SYNC_TEST_JSONL="${SYNC_TEST_JSONL:-/home/ubuntu/CleverHans-Evaluation/data/kto_training_data_v2_test.jsonl}"
+VGG_TEST_JSONL="${VGG_TEST_JSONL:-/opt/dlami/nvme/vggsoundsync_test/test_3k.jsonl}"
+WORLDSENSE_DIR="${WORLDSENSE_DIR:-/opt/dlami/nvme/worldsense}"
+DAILY_OMNI_DIR="${DAILY_OMNI_DIR:-/opt/dlami/nvme/daily_omni}"
+VIDEOMME_DIR="${VIDEOMME_DIR:-/opt/dlami/nvme/videomme/data/data}"
+LVBENCH_DIR="${LVBENCH_DIR:-/opt/dlami/nvme/lvbench}"
+EVAL_ROOT="${EVAL_ROOT:-/home/ubuntu/eval_results}"
+# ── Conda ──────────────────────────────────────────────────────────────────────
+if [[ -f "${HOME}/anaconda3/etc/profile.d/conda.sh" ]]; then
+  source "${HOME}/anaconda3/etc/profile.d/conda.sh"
+fi
+conda activate "${CONDA_ENV}"
+echo "=== Model: $MODEL  |  Label: $LABEL"
+echo "=== GPUs: ${GPU_ARR[*]}  |  Shards: $NUM_SHARDS"
+# ── Helper: run one bench with data-parallel sharding ─────────────────────────
+# $1  -- bench short name (matches merge_shards.py --bench)
+# $2  -- eval script path
+# $3  -- full label (e.g. sync_minicpmo_4_5)
+# $4  -- output-dir root (e.g. $EVAL_ROOT/sync)
+# $5+ -- extra args passed to each eval script
+run_bench_dp() {
+  local bench="$1"; shift
+  local script="$1"; shift
+  local full_label="$1"; shift
+  local out_root="$1"; shift
+  local label_dir="${out_root}/${full_label}"
+  mkdir -p "${label_dir}/logs"
+  echo ""
+  echo "==== [$(date +%T)] Bench: $bench | Label: $full_label ===="
+  local pids=()
+  for (( i=0; i<NUM_SHARDS; i++ )); do
+    local gpu="${GPU_ARR[$i]}"
+    local log="${label_dir}/logs/shard${i}of${NUM_SHARDS}.log"
+    echo "  → shard $i on GPU $gpu (log: $log)"
+    CUDA_VISIBLE_DEVICES="$gpu" python "$script" \
+      "$@" \
+      --output-dir "$out_root" \
+      --label "$full_label" \
+      --shard "$i" --num-shards "$NUM_SHARDS" \
+      > "$log" 2>&1 &
+    pids+=($!)
+  done
+  # Wait for all shard workers
+  local fail=0
+  for pid in "${pids[@]}"; do
+    wait "$pid" || fail=$((fail+1))
+  done
+  if (( fail > 0 )); then
+    echo "  !! $fail shard(s) exited with error; check ${label_dir}/logs/"
+  fi
+  # Merge
+  echo "  → merging shards ..."
+  python "$SCRIPTS/merge_shards.py" \
+    --bench "$bench" \
+    --label-dir "$label_dir" || echo "  !! merge failed"
+}
+# ── 1/6  Sync (in-domain)  —  freetext + GPT judge ────────────────────────────
+run_bench_dp dpo_sync "$SCRIPTS/eval_dpo_sync.py" \
+  "sync_${LABEL}" "$EVAL_ROOT/sync" \
+  --model-id "$MODEL" \
+  --data-root "$DATA_ROOT" \
+  --test-jsonl "$SYNC_TEST_JSONL" \
+  --gpt-judge
+# ── 2/6  VGGSoundSync  —  freetext + GPT judge ────────────────────────────────
+run_bench_dp vggsoundsync "$SCRIPTS/eval_vggsoundsync.py" \
+  "vggsync_freetext_${LABEL}_3k" "$EVAL_ROOT/vggsoundsync" \
+  --model-id "$MODEL" \
+  --test-jsonl "$VGG_TEST_JSONL" \
+  --mode freetext --gpt-judge
+# ── 3/6  WorldSense ───────────────────────────────────────────────────────────
+run_bench_dp worldsense "$SCRIPTS/eval_worldsense.py" \
+  "ws_${LABEL}" "$EVAL_ROOT/worldsense" \
+  --model-id "$MODEL" \
+  --data-dir "$WORLDSENSE_DIR" \
+  --max-samples -1
+# ── 4/6  Daily-Omni ───────────────────────────────────────────────────────────
+run_bench_dp daily_omni "$SCRIPTS/eval_daily_omni.py" \
+  "do_${LABEL}" "$EVAL_ROOT/daily_omni" \
+  --model-id "$MODEL" \
+  --data-dir "$DAILY_OMNI_DIR" \
+  --max-samples -1
+# ── 5/6  Video-MME ────────────────────────────────────────────────────────────
+run_bench_dp videomme "$SCRIPTS/eval_videomme.py" \
+  "vmme_${LABEL}" "$EVAL_ROOT/videomme" \
+  --model-id "$MODEL" \
+  --video-dir "$VIDEOMME_DIR" \
+  --max-samples -1
+# ── 6/6  LVBench ──────────────────────────────────────────────────────────────
+run_bench_dp lvbench "$SCRIPTS/eval_lvbench.py" \
+  "lvb_${LABEL}" "$EVAL_ROOT/lvbench" \
+  --model-id "$MODEL" \
+  --video-dir "$LVBENCH_DIR" \
+  --max-samples -1
+echo ""
+echo "=== All done: $LABEL ==="
+for b_out in \
+  "$EVAL_ROOT/sync/sync_${LABEL}" \
+  "$EVAL_ROOT/vggsoundsync/vggsync_freetext_${LABEL}_3k" \
+  "$EVAL_ROOT/worldsense/ws_${LABEL}" \
+  "$EVAL_ROOT/daily_omni/do_${LABEL}" \
+  "$EVAL_ROOT/videomme/vmme_${LABEL}" \
+  "$EVAL_ROOT/lvbench/lvb_${LABEL}"; do
+  echo "  ${b_out}/metrics.json"
+done

run_video_rerun.sh ADDED Viewed

	@@ -0,0 +1,94 @@

+#!/usr/bin/env bash
+# Re-run the four benchmarks that were broken by the missing video-chat flags
+# (`use_image_id=False, max_slice_nums=1`) and the audio -> TTS template
+# force-enable. After patches in `patch_minicpmo.py` + `minicpmo_inference.py`
+# these should now produce real MCQ answers (smoke-test-verified).
+#
+# Usage:  CUDA_VISIBLE_DEVICES=4,5,6,7 bash run_video_rerun.sh
+set -uo pipefail
+export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-4,5,6,7}"
+MODEL="${MODEL:-openbmb/MiniCPM-o-4_5}"
+LABEL="${LABEL:-minicpmo_4_5}"
+SCRIPTS="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/scripts"
+CONDA_ENV="${CONDA_ENV:-minicpmo}"
+IFS=',' read -ra GPU_ARR <<< "$CUDA_VISIBLE_DEVICES"
+NUM_SHARDS="${NUM_SHARDS:-${#GPU_ARR[@]}}"
+WORLDSENSE_DIR="${WORLDSENSE_DIR:-/opt/dlami/nvme/worldsense}"
+DAILY_OMNI_DIR="${DAILY_OMNI_DIR:-/opt/dlami/nvme/daily_omni}"
+VIDEOMME_DIR="${VIDEOMME_DIR:-/opt/dlami/nvme/videomme/data/data}"
+LVBENCH_DIR="${LVBENCH_DIR:-/opt/dlami/nvme/lvbench}"
+EVAL_ROOT="${EVAL_ROOT:-/home/ubuntu/eval_results}"
+if [[ -f "${HOME}/anaconda3/etc/profile.d/conda.sh" ]]; then
+  source "${HOME}/anaconda3/etc/profile.d/conda.sh"
+fi
+conda activate "${CONDA_ENV}"
+echo "=== Re-running video benches with fixed inference"
+echo "=== Model: $MODEL  |  Label: $LABEL  |  GPUs: ${GPU_ARR[*]}  |  Shards: $NUM_SHARDS"
+run_bench_dp() {
+  local bench="$1"; shift
+  local script="$1"; shift
+  local full_label="$1"; shift
+  local out_root="$1"; shift
+  local label_dir="${out_root}/${full_label}"
+  mkdir -p "${label_dir}/logs"
+  echo ""
+  echo "==== [$(date +%T)] Bench: $bench | Label: $full_label ===="
+  local pids=()
+  for (( i=0; i<NUM_SHARDS; i++ )); do
+    local gpu="${GPU_ARR[$i]}"
+    local log="${label_dir}/logs/shard${i}of${NUM_SHARDS}.log"
+    echo "  -> shard $i on GPU $gpu (log: $log)"
+    CUDA_VISIBLE_DEVICES="$gpu" python "$script" \
+      "$@" \
+      --output-dir "$out_root" \
+      --label "$full_label" \
+      --shard "$i" --num-shards "$NUM_SHARDS" \
+      > "$log" 2>&1 &
+    pids+=($!)
+  done
+  local fail=0
+  for pid in "${pids[@]}"; do
+    wait "$pid" || fail=$((fail+1))
+  done
+  if (( fail > 0 )); then
+    echo "  !! $fail shard(s) exited with error; check ${label_dir}/logs/"
+  fi
+  echo "  -> merging shards ..."
+  python "$SCRIPTS/merge_shards.py" \
+    --bench "$bench" \
+    --label-dir "$label_dir" || echo "  !! merge failed"
+}
+run_bench_dp worldsense "$SCRIPTS/eval_worldsense.py" \
+  "ws_${LABEL}" "$EVAL_ROOT/worldsense" \
+  --model-id "$MODEL" --data-dir "$WORLDSENSE_DIR" --max-samples -1
+run_bench_dp daily_omni "$SCRIPTS/eval_daily_omni.py" \
+  "do_${LABEL}" "$EVAL_ROOT/daily_omni" \
+  --model-id "$MODEL" --data-dir "$DAILY_OMNI_DIR" --max-samples -1
+run_bench_dp videomme "$SCRIPTS/eval_videomme.py" \
+  "vmme_${LABEL}" "$EVAL_ROOT/videomme" \
+  --model-id "$MODEL" --video-dir "$VIDEOMME_DIR" --max-samples -1
+run_bench_dp lvbench "$SCRIPTS/eval_lvbench.py" \
+  "lvb_${LABEL}" "$EVAL_ROOT/lvbench" \
+  --model-id "$MODEL" --video-dir "$LVBENCH_DIR" --max-samples -1
+echo ""
+echo "=== Rerun done: $LABEL ==="
+for d in \
+  "$EVAL_ROOT/worldsense/ws_${LABEL}" \
+  "$EVAL_ROOT/daily_omni/do_${LABEL}" \
+  "$EVAL_ROOT/videomme/vmme_${LABEL}" \
+  "$EVAL_ROOT/lvbench/lvb_${LABEL}"; do
+  echo "  ${d}/metrics.json"
+done

scripts/_common.py ADDED Viewed

	@@ -0,0 +1,66 @@

+"""Shared glue for all MiniCPM-o eval scripts.
+Loads the CleverHans-Evaluation counterpart scripts under aliased module
+names (prefixed with `ch_`), so the MiniCPM-o eval scripts can import their
+data loaders / metric functions without filename shadowing.
+Usage in an eval script:
+    import _common  # noqa: F401
+    from _common import ch  # namespace holding ch_eval_videomme etc.
+    ch_videomme = ch("videomme")
+    data = ch_videomme.load_videomme(...)
+"""
+from __future__ import annotations
+import importlib.util
+import os
+import sys
+import types
+from pathlib import Path
+_HERE = Path(__file__).resolve().parent
+_CLEVERHANS_SCRIPTS = Path(
+    os.environ.get(
+        "CLEVERHANS_SCRIPTS",
+        "/home/ubuntu/CleverHans-Evaluation/scripts",
+    )
+).resolve()
+# Make local (MiniCPM-o) modules importable without package setup.
+if str(_HERE) not in sys.path:
+    sys.path.insert(0, str(_HERE))
+_CACHE: dict[str, types.ModuleType] = {}
+def ch(short_name: str) -> types.ModuleType:
+    """Load a CleverHans-Evaluation script by short name (e.g., 'videomme',
+    'lvbench', 'dpo_sync'). Returns the module object.
+    Loaded under an aliased module name `ch_eval_<short_name>` so it doesn't
+    collide with same-named files in this directory.
+    """
+    cache_key = short_name
+    if cache_key in _CACHE:
+        return _CACHE[cache_key]
+    script_path = _CLEVERHANS_SCRIPTS / f"eval_{short_name}.py"
+    if not script_path.is_file():
+        raise FileNotFoundError(
+            f"CleverHans-Evaluation script not found: {script_path}\n"
+            f"Set CLEVERHANS_SCRIPTS env var to the correct directory."
+        )
+    alias = f"ch_eval_{short_name}"
+    spec = importlib.util.spec_from_file_location(alias, str(script_path))
+    if spec is None or spec.loader is None:
+        raise ImportError(f"Could not create spec for {script_path}")
+    module = importlib.util.module_from_spec(spec)
+    sys.modules[alias] = module
+    spec.loader.exec_module(module)
+    _CACHE[cache_key] = module
+    return module

scripts/eval_daily_omni.py ADDED Viewed

	@@ -0,0 +1,197 @@

+#!/usr/bin/env python3
+"""Evaluate MiniCPM-o 4.5 on Daily-Omni.
+Daily-Omni videos include embedded audio; we extract it and feed both frames
+and waveform to MiniCPM-o.
+"""
+from __future__ import annotations
+import _common
+import argparse
+import gc
+import io
+import contextlib
+import json
+from pathlib import Path
+import torch
+from tqdm import tqdm
+ch = _common.ch("daily_omni")
+load_daily_omni = ch.load_daily_omni
+extract_answer = ch.extract_answer
+compute_metrics = ch.compute_metrics
+print_summary = ch.print_summary
+DEFAULT_DATA_DIR = ch.DEFAULT_DATA_DIR
+DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
+from minicpmo_inference import load_model, run_inference
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on Daily-Omni.")
+    p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
+    p.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR)
+    p.add_argument("--output-dir", type=Path,
+                   default=Path("/home/ubuntu/eval_results/daily_omni_minicpmo"))
+    p.add_argument("--max-samples", type=int, default=-1)
+    p.add_argument("--max-new-tokens", type=int, default=32)
+    p.add_argument("--temperature", type=float, default=0.0)
+    p.add_argument("--label", type=str, default="minicpmo_daily_omni")
+    p.add_argument("--max-frames", type=int, default=64)
+    p.add_argument("--fps", type=float, default=1.0)
+    p.add_argument("--attn", type=str, default="flash_attention_2",
+                   choices=["sdpa", "flash_attention_2", "eager"])
+    p.add_argument("--no-audio", action="store_true",
+                   help="Video-only mode (skip audio extraction).")
+    p.add_argument(
+        "--skip-audio-durations",
+        type=str,
+        default="",
+        help=(
+            "Comma-separated `video_duration` values from the dataset for which "
+            "audio is omitted (video-only for those clips). Useful when "
+            "MiniCPM-o forward fails on some lengths with audio+vision "
+            '(e.g. empty `raw_output` and log errors like "Expected size 122 '
+            'but got size 121"). Example: --skip-audio-durations 60s'
+        ),
+    )
+    # vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
+    p.add_argument("--vllm", action="store_true", default=False,
+                   help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
+    p.add_argument("--tp", type=int, default=None)
+    p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
+    p.add_argument("--max-model-len", type=int, default=65536)
+    p.add_argument("--batch-size", type=int, default=32)
+    # Data-parallel sharding
+    p.add_argument("--shard", type=int, default=0)
+    p.add_argument("--num-shards", type=int, default=1)
+    return p.parse_args()
+def main() -> None:
+    args = parse_args()
+    out_dir = args.output_dir / args.label
+    out_dir.mkdir(parents=True, exist_ok=True)
+    shard_suffix = (f".shard{args.shard}of{args.num_shards}"
+                    if args.num_shards > 1 else "")
+    results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
+    metrics_json = out_dir / "metrics.json"
+    summary_txt = out_dir / "summary.txt"
+    if args.vllm:
+        print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
+              "supported upstream yet; falling back to transformers.")
+    print("[data] Loading Daily-Omni dataset...")
+    test_data = load_daily_omni(args.data_dir, args.max_samples)
+    if args.num_shards > 1:
+        test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
+        print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
+    else:
+        print(f"[data] {len(test_data)} questions ready")
+    processed: set = set()
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                obj = json.loads(line)
+                processed.add(obj["question_id"])
+        print(f"[resume] {len(processed)} already processed")
+    model, tokenizer = load_model(
+        args.model_id, attn_implementation=args.attn, init_audio=not args.no_audio,
+    )
+    skip_audio_durs = {
+        x.strip()
+        for x in args.skip_audio_durations.split(",")
+        if x.strip()
+    }
+    for item in tqdm(test_data, desc="Daily-Omni", unit="q"):
+        if item["question_id"] in processed:
+            continue
+        use_audio = not args.no_audio and (
+            item.get("video_duration", "") not in skip_audio_durs
+        )
+        try:
+            raw_output = run_inference(
+                model, tokenizer,
+                video_path=item["video_path"],
+                audio_path=None,
+                prompt=item["prompt"],
+                max_new_tokens=args.max_new_tokens,
+                temperature=args.temperature,
+                max_frames=args.max_frames,
+                fps=args.fps,
+                use_audio_from_video=use_audio,
+            )
+        except Exception as exc:
+            import traceback
+            print(f"  [error] {item['question_id']}: {exc}")
+            traceback.print_exc()
+            raw_output = ""
+        pred = extract_answer(raw_output)
+        result = {
+            "question_id": item["question_id"],
+            "video_id": item["video_id"],
+            "question_type": item.get("question_type", ""),
+            "content_parent_category": item.get("content_parent_category", ""),
+            "content_fine_category": item.get("content_fine_category", ""),
+            "video_category": item.get("video_category", ""),
+            "video_duration": item.get("video_duration", ""),
+            "question": item["question"],
+            "choices": item["choices"],
+            "gt_answer": item["gt_answer"],
+            "pred_answer": pred,
+            "raw_output": raw_output,
+        }
+        with open(results_jsonl, "a", encoding="utf-8") as f:
+            f.write(json.dumps(result, ensure_ascii=False) + "\n")
+        processed.add(item["question_id"])
+        gc.collect()
+        torch.cuda.empty_cache()
+    if args.num_shards > 1:
+        print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
+        print(f"[shard] Run merge_shards.py --bench daily_omni --label-dir {out_dir}")
+        return
+    all_results = []
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                all_results.append(json.loads(line))
+    metrics = compute_metrics(all_results)
+    metrics["eval_config"] = {
+        "model_id": args.model_id,
+        "data_dir": str(args.data_dir),
+        "max_new_tokens": args.max_new_tokens,
+        "temperature": args.temperature,
+        "max_frames": args.max_frames,
+        "fps": args.fps,
+        "attn": args.attn,
+        "no_audio": args.no_audio,
+        "skip_audio_durations": sorted(skip_audio_durs),
+    }
+    with open(metrics_json, "w", encoding="utf-8") as f:
+        json.dump(metrics, f, indent=2, ensure_ascii=False)
+    print_summary(metrics, args.label)
+    with open(summary_txt, "w", encoding="utf-8") as f:
+        buf = io.StringIO()
+        with contextlib.redirect_stdout(buf):
+            print_summary(metrics, args.label)
+        f.write(buf.getvalue())
+    print(f"\n[output] Results: {results_jsonl}")
+    print(f"[output] Metrics: {metrics_json}")
+    print(f"[output] Summary: {summary_txt}")
+if __name__ == "__main__":
+    main()

scripts/eval_dpo_sync.py ADDED Viewed

	@@ -0,0 +1,205 @@

+#!/usr/bin/env python3
+"""Evaluate MiniCPM-o 4.5 on the in-domain DPO sync test set.
+Reuses the CleverHans-Evaluation dpo_sync eval_dpo_sync.py for data loading,
+GT parsing, regex prediction extractor, optional GPT judge, and metrics.
+Only the inference path is replaced with MiniCPM-o.
+"""
+from __future__ import annotations
+import _common
+import argparse
+import gc
+import io
+import contextlib
+import json
+from pathlib import Path
+import torch
+from tqdm import tqdm
+ch = _common.ch("dpo_sync")
+EVAL_PROMPT = ch.EVAL_PROMPT
+load_test_data = ch.load_test_data
+set_data_root = ch.set_data_root
+extract_prediction = ch.extract_prediction
+gpt_extract_prediction = ch.gpt_extract_prediction
+_get_openai_client = ch._get_openai_client
+compute_metrics = ch.compute_metrics
+print_summary = ch.print_summary
+from minicpmo_inference import load_model, run_inference
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on DPO sync test set.")
+    p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
+    p.add_argument("--data-root", type=Path,
+                   default=Path("/opt/dlami/nvme/video_source"))
+    p.add_argument("--test-jsonl", type=Path, default=None,
+                   help="Default: <data-root>/kto_training_data_v2_test.jsonl")
+    p.add_argument("--output-dir", type=Path,
+                   default=Path("/home/ubuntu/eval_results/sync_minicpmo"))
+    p.add_argument("--max-samples", type=int, default=-1)
+    p.add_argument("--max-new-tokens", type=int, default=256)
+    p.add_argument("--temperature", type=float, default=0.0)
+    p.add_argument("--label", type=str, default="minicpmo_sync")
+    p.add_argument("--max-frames", type=int, default=32,
+                   help="Sync clips are short (<30s); 32 frames is plenty.")
+    p.add_argument("--fps", type=float, default=2.0)
+    p.add_argument("--attn", type=str, default="flash_attention_2",
+                   choices=["sdpa", "flash_attention_2", "eager"])
+    # vLLM flags: accepted for CLI parity with Qwen3-Omni. MiniCPM-o 4.5
+    # multimodal vLLM support is not yet available upstream, so these are
+    # currently a no-op (we always run transformers). Kept so the same
+    # run_*.sh scripts work across the two models.
+    p.add_argument("--vllm", action="store_true", default=False,
+                   help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
+    p.add_argument("--tp", type=int, default=None)
+    p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
+    p.add_argument("--max-model-len", type=int, default=65536)
+    p.add_argument("--gpt-judge", action="store_true", default=False)
+    p.add_argument("--openai-api-key", type=str, default=None)
+    p.add_argument("--gpt-model", type=str, default="gpt-5.4")
+    # Data-parallel sharding
+    p.add_argument("--shard", type=int, default=0)
+    p.add_argument("--num-shards", type=int, default=1)
+    return p.parse_args()
+def main() -> None:
+    args = parse_args()
+    set_data_root(args.data_root)
+    test_jsonl = args.test_jsonl or (args.data_root / "kto_training_data_v2_test.jsonl")
+    if args.vllm:
+        print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
+              "supported upstream yet; falling back to transformers.")
+    if args.gpt_judge:
+        if _get_openai_client(args.openai_api_key) is None:
+            print("[ERROR] --gpt-judge requires OPENAI_API_KEY or --openai-api-key.")
+            raise SystemExit(1)
+    out_dir = args.output_dir / args.label
+    out_dir.mkdir(parents=True, exist_ok=True)
+    shard_suffix = (f".shard{args.shard}of{args.num_shards}"
+                    if args.num_shards > 1 else "")
+    results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
+    metrics_json = out_dir / "metrics.json"
+    summary_txt = out_dir / "summary.txt"
+    test_data = load_test_data(test_jsonl, args.max_samples)
+    if args.num_shards > 1:
+        test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
+        print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} samples")
+    else:
+        print(f"[data] {len(test_data)} test samples")
+    processed: set = set()
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                obj = json.loads(line)
+                processed.add(obj["video"])
+        print(f"[resume] {len(processed)} already processed")
+    def _do_extract(raw_output: str):
+        if args.gpt_judge and raw_output:
+            gpt_pred = gpt_extract_prediction(
+                raw_output, api_key=args.openai_api_key, model=args.gpt_model,
+            )
+            if gpt_pred is not None:
+                return gpt_pred
+        return extract_prediction(raw_output)
+    model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
+                                  init_audio=True)
+    for item in tqdm(test_data, desc="Sync", unit="sample"):
+        if item["video"] in processed:
+            continue
+        try:
+            raw_output = run_inference(
+                model, tokenizer,
+                video_path=item["video_path"],
+                audio_path=item["audio_path"],
+                prompt=EVAL_PROMPT,
+                max_new_tokens=args.max_new_tokens,
+                temperature=args.temperature,
+                max_frames=args.max_frames,
+                fps=args.fps,
+            )
+        except Exception as exc:
+            import traceback
+            print(f"  [error] {item['video']}: {exc}")
+            traceback.print_exc()
+            raw_output = ""
+        pred = _do_extract(raw_output)
+        result = {
+            "video": item["video"],
+            "video_path": item["video_path"],
+            "gt_synced": item["gt_synced"],
+            "gt_direction": item["gt_direction"],
+            "gt_offset_sec": item["gt_offset_sec"],
+            "gt_t_v": item["gt_t_v"],
+            "gt_t_a": item["gt_t_a"],
+            "pred_synced": pred["pred_synced"],
+            "pred_direction": pred["pred_direction"],
+            "pred_offset_sec": pred["pred_offset_sec"],
+            "pred_t_v": pred.get("pred_t_v"),
+            "pred_t_a": pred.get("pred_t_a"),
+            "pred_explanation": pred.get("pred_explanation", ""),
+            "parse_method": pred["parse_method"],
+            "raw_output": raw_output,
+        }
+        with open(results_jsonl, "a", encoding="utf-8") as f:
+            f.write(json.dumps(result, ensure_ascii=False) + "\n")
+        processed.add(item["video"])
+        gc.collect()
+        torch.cuda.empty_cache()
+    if args.num_shards > 1:
+        print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
+        print(f"[shard] Run merge_shards.py --bench dpo_sync --label-dir {out_dir}")
+        return
+    all_results = []
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                all_results.append(json.loads(line))
+    metrics = compute_metrics(all_results)
+    metrics["eval_config"] = {
+        "model_id": args.model_id,
+        "data_root": str(args.data_root),
+        "test_jsonl": str(test_jsonl),
+        "total_test_samples": len(test_data),
+        "max_new_tokens": args.max_new_tokens,
+        "temperature": args.temperature,
+        "max_frames": args.max_frames,
+        "fps": args.fps,
+        "attn": args.attn,
+        "gpt_judge": args.gpt_judge,
+        "gpt_model": args.gpt_model if args.gpt_judge else None,
+    }
+    with open(metrics_json, "w", encoding="utf-8") as f:
+        json.dump(metrics, f, indent=2, ensure_ascii=False)
+    print_summary(metrics, args.label)
+    with open(summary_txt, "w", encoding="utf-8") as f:
+        buf = io.StringIO()
+        with contextlib.redirect_stdout(buf):
+            print_summary(metrics, args.label)
+        f.write(buf.getvalue())
+    print(f"\n[output] Results: {results_jsonl}")
+    print(f"[output] Metrics: {metrics_json}")
+    print(f"[output] Summary: {summary_txt}")
+if __name__ == "__main__":
+    main()

scripts/eval_lvbench.py ADDED Viewed

	@@ -0,0 +1,168 @@

+#!/usr/bin/env python3
+"""Evaluate MiniCPM-o 4.5 on LVBench.
+Reuses data loader and metrics from CleverHans-Evaluation's eval_lvbench.py.
+LVBench is video-only (long video QA); no audio is passed.
+"""
+from __future__ import annotations
+import _common
+import argparse
+import gc
+import io
+import contextlib
+import json
+from pathlib import Path
+import torch
+from tqdm import tqdm
+ch = _common.ch("lvbench")
+load_lvbench = ch.load_lvbench
+extract_answer = ch.extract_answer
+compute_metrics = ch.compute_metrics
+print_summary = ch.print_summary
+DEFAULT_VIDEO_DIR = ch.DEFAULT_VIDEO_DIR
+DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
+from minicpmo_inference import load_model, run_inference
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on LVBench.")
+    p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
+    p.add_argument("--video-dir", type=Path, default=DEFAULT_VIDEO_DIR)
+    p.add_argument("--output-dir", type=Path,
+                   default=Path("/home/ubuntu/eval_results/lvbench_minicpmo"))
+    p.add_argument("--max-samples", type=int, default=-1)
+    p.add_argument("--max-new-tokens", type=int, default=32)
+    p.add_argument("--temperature", type=float, default=0.0)
+    p.add_argument("--label", type=str, default="minicpmo_lvbench")
+    p.add_argument("--max-frames", type=int, default=96,
+                   help="LVBench has long videos; larger frame budget helps.")
+    p.add_argument("--fps", type=float, default=0.5)
+    p.add_argument("--attn", type=str, default="flash_attention_2",
+                   choices=["sdpa", "flash_attention_2", "eager"])
+    # vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
+    p.add_argument("--vllm", action="store_true", default=False,
+                   help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
+    p.add_argument("--tp", type=int, default=None)
+    p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
+    p.add_argument("--max-model-len", type=int, default=65536)
+    p.add_argument("--batch-size", type=int, default=32)
+    # Data-parallel sharding
+    p.add_argument("--shard", type=int, default=0)
+    p.add_argument("--num-shards", type=int, default=1)
+    return p.parse_args()
+def main() -> None:
+    args = parse_args()
+    out_dir = args.output_dir / args.label
+    out_dir.mkdir(parents=True, exist_ok=True)
+    shard_suffix = (f".shard{args.shard}of{args.num_shards}"
+                    if args.num_shards > 1 else "")
+    results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
+    metrics_json = out_dir / "metrics.json"
+    summary_txt = out_dir / "summary.txt"
+    if args.vllm:
+        print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
+              "supported upstream yet; falling back to transformers.")
+    print("[data] Loading LVBench dataset...")
+    test_data = load_lvbench(args.video_dir, args.max_samples)
+    if args.num_shards > 1:
+        test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
+        print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
+    else:
+        print(f"[data] {len(test_data)} questions ready")
+    processed: set = set()
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                obj = json.loads(line)
+                processed.add(obj["uid"])
+        print(f"[resume] {len(processed)} already processed")
+    model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
+                                  init_audio=False)
+    for item in tqdm(test_data, desc="LVBench", unit="q"):
+        if item["uid"] in processed:
+            continue
+        try:
+            raw_output = run_inference(
+                model, tokenizer,
+                video_path=item["video_path"],
+                audio_path=None,
+                prompt=item["prompt"],
+                max_new_tokens=args.max_new_tokens,
+                temperature=args.temperature,
+                max_frames=args.max_frames,
+                fps=args.fps,
+            )
+        except Exception as exc:
+            import traceback
+            print(f"  [error] {item['uid']}: {exc}")
+            traceback.print_exc()
+            raw_output = ""
+        pred = extract_answer(raw_output)
+        result = {
+            "uid": item["uid"],
+            "video_id": item["video_id"],
+            "video_type": item["video_type"],
+            "question_type": item["question_type"],
+            "question": item["question"],
+            "gt_answer": item["gt_answer"],
+            "time_reference": item.get("time_reference", ""),
+            "pred_answer": pred,
+            "raw_output": raw_output,
+        }
+        with open(results_jsonl, "a", encoding="utf-8") as f:
+            f.write(json.dumps(result, ensure_ascii=False) + "\n")
+        processed.add(item["uid"])
+        gc.collect()
+        torch.cuda.empty_cache()
+    if args.num_shards > 1:
+        print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
+        print(f"[shard] Run merge_shards.py --bench lvbench --label-dir {out_dir}")
+        return
+    all_results = []
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                all_results.append(json.loads(line))
+    metrics = compute_metrics(all_results)
+    metrics["eval_config"] = {
+        "model_id": args.model_id,
+        "video_dir": str(args.video_dir),
+        "max_new_tokens": args.max_new_tokens,
+        "temperature": args.temperature,
+        "max_frames": args.max_frames,
+        "fps": args.fps,
+        "attn": args.attn,
+    }
+    with open(metrics_json, "w", encoding="utf-8") as f:
+        json.dump(metrics, f, indent=2, ensure_ascii=False)
+    print_summary(metrics, args.label)
+    with open(summary_txt, "w", encoding="utf-8") as f:
+        buf = io.StringIO()
+        with contextlib.redirect_stdout(buf):
+            print_summary(metrics, args.label)
+        f.write(buf.getvalue())
+    print(f"\n[output] Results: {results_jsonl}")
+    print(f"[output] Metrics: {metrics_json}")
+    print(f"[output] Summary: {summary_txt}")
+if __name__ == "__main__":
+    main()

scripts/eval_vggsoundsync.py ADDED Viewed

	@@ -0,0 +1,195 @@

+#!/usr/bin/env python3
+"""Evaluate MiniCPM-o 4.5 on VGG-Sound Sync (out-of-domain sync).
+Reuses the data loader, MCQ / freetext prompts, answer parsers, GPT judge,
+and metrics from CleverHans-Evaluation's eval_vggsoundsync.py. Only the
+inference path is replaced with MiniCPM-o.
+"""
+from __future__ import annotations
+import _common
+import argparse
+import gc
+import io
+import contextlib
+import json
+from pathlib import Path
+import torch
+from tqdm import tqdm
+ch = _common.ch("vggsoundsync")
+MCQ_PROMPT = ch.MCQ_PROMPT
+FREETEXT_PROMPT = ch.FREETEXT_PROMPT
+load_test_data = ch.load_test_data
+extract_mcq_answer = ch.extract_mcq_answer
+extract_freetext_prediction = ch.extract_freetext_prediction
+gpt_extract_prediction = ch.gpt_extract_prediction
+_get_openai_client = ch._get_openai_client
+compute_metrics = ch.compute_metrics
+print_summary = ch.print_summary
+_build_result = ch._build_result
+DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
+from minicpmo_inference import load_model, run_inference
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on VGG-Sound Sync.")
+    p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
+    p.add_argument("--test-jsonl", type=Path, required=True,
+                   help="test.jsonl from prepare_vggsoundsync.py")
+    p.add_argument("--output-dir", type=Path,
+                   default=Path("/home/ubuntu/eval_results/vggsoundsync_minicpmo"))
+    p.add_argument("--mode", choices=["mcq", "freetext"], default="mcq")
+    p.add_argument("--max-samples", type=int, default=-1)
+    p.add_argument("--max-new-tokens", type=int, default=64)
+    p.add_argument("--temperature", type=float, default=0.0)
+    p.add_argument("--label", type=str, default="minicpmo_vggsync")
+    p.add_argument("--max-frames", type=int, default=32)
+    p.add_argument("--fps", type=float, default=2.0)
+    p.add_argument("--attn", type=str, default="flash_attention_2",
+                   choices=["sdpa", "flash_attention_2", "eager"])
+    # vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
+    p.add_argument("--vllm", action="store_true", default=False,
+                   help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
+    p.add_argument("--tp", type=int, default=None)
+    p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
+    p.add_argument("--max-model-len", type=int, default=65536)
+    p.add_argument("--batch-size", type=int, default=16)
+    p.add_argument("--gpt-judge", action="store_true", default=False)
+    p.add_argument("--openai-api-key", type=str, default=None)
+    p.add_argument("--gpt-model", type=str, default="gpt-5.4")
+    # Data-parallel sharding
+    p.add_argument("--shard", type=int, default=0)
+    p.add_argument("--num-shards", type=int, default=1)
+    return p.parse_args()
+def _extract_pred(raw_output, mode, gpt_judge, api_key, gpt_model, answer_map=None):
+    if mode == "mcq":
+        return extract_mcq_answer(raw_output, answer_map=answer_map)
+    if gpt_judge and raw_output:
+        gpt_pred = gpt_extract_prediction(raw_output, api_key=api_key, model=gpt_model)
+        if gpt_pred is not None:
+            return gpt_pred
+    return extract_freetext_prediction(raw_output)
+def main() -> None:
+    args = parse_args()
+    default_prompt = MCQ_PROMPT if args.mode == "mcq" else FREETEXT_PROMPT
+    if args.vllm:
+        print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
+              "supported upstream yet; falling back to transformers.")
+    if args.gpt_judge and args.mode == "freetext":
+        if _get_openai_client(args.openai_api_key) is None:
+            print("[ERROR] --gpt-judge requires OPENAI_API_KEY or --openai-api-key.")
+            raise SystemExit(1)
+    out_dir = args.output_dir / args.label
+    out_dir.mkdir(parents=True, exist_ok=True)
+    shard_suffix = (f".shard{args.shard}of{args.num_shards}"
+                    if args.num_shards > 1 else "")
+    results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
+    metrics_json = out_dir / "metrics.json"
+    summary_txt = out_dir / "summary.txt"
+    test_data = load_test_data(args.test_jsonl, args.max_samples)
+    if args.num_shards > 1:
+        test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
+        print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} samples (mode={args.mode})")
+    else:
+        print(f"[data] {len(test_data)} samples loaded (mode={args.mode})")
+    processed: set = set()
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                obj = json.loads(line)
+                processed.add(obj["uid"])
+        print(f"[resume] {len(processed)} already processed")
+    model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
+                                  init_audio=True)
+    for item in tqdm(test_data, desc="VGGSync", unit="sample"):
+        if item["uid"] in processed:
+            continue
+        item_prompt = item.get("mcq_prompt", default_prompt) if args.mode == "mcq" else default_prompt
+        item_answer_map = item.get("mcq_answer_map") if args.mode == "mcq" else None
+        try:
+            raw_output = run_inference(
+                model, tokenizer,
+                video_path=item["video_path"],
+                audio_path=item["audio_path"],
+                prompt=item_prompt,
+                max_new_tokens=args.max_new_tokens,
+                temperature=args.temperature,
+                max_frames=args.max_frames,
+                fps=args.fps,
+            )
+        except Exception as exc:
+            import traceback
+            print(f"  [error] {item['uid']}: {exc}")
+            traceback.print_exc()
+            raw_output = ""
+        pred = _extract_pred(raw_output, args.mode, args.gpt_judge,
+                             args.openai_api_key, args.gpt_model,
+                             answer_map=item_answer_map)
+        result = _build_result(item, pred, raw_output, args.mode)
+        with open(results_jsonl, "a", encoding="utf-8") as f:
+            f.write(json.dumps(result, ensure_ascii=False) + "\n")
+        processed.add(item["uid"])
+        gc.collect()
+        torch.cuda.empty_cache()
+    if args.num_shards > 1:
+        print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
+        print(f"[shard] Run merge_shards.py --bench vggsoundsync --label-dir {out_dir}")
+        return
+    all_results = []
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                all_results.append(json.loads(line))
+    metrics = compute_metrics(all_results)
+    metrics["eval_config"] = {
+        "model_id": args.model_id,
+        "mode": args.mode,
+        "test_jsonl": str(args.test_jsonl),
+        "max_new_tokens": args.max_new_tokens,
+        "temperature": args.temperature,
+        "max_frames": args.max_frames,
+        "fps": args.fps,
+        "attn": args.attn,
+        "gpt_judge": args.gpt_judge,
+    }
+    with open(metrics_json, "w", encoding="utf-8") as f:
+        json.dump(metrics, f, indent=2, ensure_ascii=False)
+    print_summary(metrics, args.label)
+    with open(summary_txt, "w", encoding="utf-8") as f:
+        buf = io.StringIO()
+        with contextlib.redirect_stdout(buf):
+            print_summary(metrics, args.label)
+        f.write(buf.getvalue())
+    print(f"\n[output] Results: {results_jsonl}")
+    print(f"[output] Metrics: {metrics_json}")
+    print(f"[output] Summary: {summary_txt}")
+if __name__ == "__main__":
+    main()

scripts/eval_videomme.py ADDED Viewed

	@@ -0,0 +1,173 @@

+#!/usr/bin/env python3
+"""Evaluate MiniCPM-o 4.5 on Video-MME.
+Reuses the data loader and metrics from CleverHans-Evaluation's Qwen3-Omni
+eval_videomme.py and swaps out the inference with MiniCPM-o. Video-MME is
+video-only (no audio), so we do NOT pass audio in.
+"""
+from __future__ import annotations
+import _common
+import argparse
+import gc
+import io
+import contextlib
+import json
+from pathlib import Path
+import torch
+from tqdm import tqdm
+ch = _common.ch("videomme")
+load_videomme = ch.load_videomme
+extract_answer = ch.extract_answer
+compute_metrics = ch.compute_metrics
+print_summary = ch.print_summary
+DEFAULT_VIDEO_DIR = ch.DEFAULT_VIDEO_DIR
+DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
+from minicpmo_inference import load_model, run_inference
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on Video-MME.")
+    p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
+    p.add_argument("--video-dir", type=Path, default=DEFAULT_VIDEO_DIR)
+    p.add_argument("--output-dir", type=Path,
+                   default=Path("/home/ubuntu/eval_results/videomme_minicpmo"))
+    p.add_argument("--max-samples", type=int, default=-1)
+    p.add_argument("--max-new-tokens", type=int, default=32)
+    p.add_argument("--temperature", type=float, default=0.0)
+    p.add_argument("--label", type=str, default="minicpmo_videomme")
+    p.add_argument("--max-frames", type=int, default=64,
+                   help="Max frames sampled from each video (MiniCPM-o uses "
+                        "PIL images).")
+    p.add_argument("--fps", type=float, default=1.0)
+    p.add_argument("--attn", type=str, default="flash_attention_2",
+                   choices=["sdpa", "flash_attention_2", "eager"])
+    # vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
+    p.add_argument("--vllm", action="store_true", default=False,
+                   help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
+    p.add_argument("--tp", type=int, default=None)
+    p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
+    p.add_argument("--max-model-len", type=int, default=65536)
+    p.add_argument("--batch-size", type=int, default=32)
+    # Data-parallel sharding: split test set into K slices, process slice N
+    p.add_argument("--shard", type=int, default=0)
+    p.add_argument("--num-shards", type=int, default=1)
+    return p.parse_args()
+def main() -> None:
+    args = parse_args()
+    out_dir = args.output_dir / args.label
+    out_dir.mkdir(parents=True, exist_ok=True)
+    shard_suffix = (f".shard{args.shard}of{args.num_shards}"
+                    if args.num_shards > 1 else "")
+    results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
+    metrics_json = out_dir / "metrics.json"
+    summary_txt = out_dir / "summary.txt"
+    if args.vllm:
+        print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
+              "supported upstream yet; falling back to transformers.")
+    print("[data] Loading Video-MME dataset...")
+    test_data = load_videomme(args.video_dir, args.max_samples)
+    if args.num_shards > 1:
+        test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
+        print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
+    else:
+        print(f"[data] {len(test_data)} questions ready")
+    processed: set = set()
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                obj = json.loads(line)
+                processed.add(obj["question_id"])
+        print(f"[resume] {len(processed)} already processed, skipping")
+    model, tokenizer = load_model(args.model_id, attn_implementation=args.attn,
+                                  init_audio=False)
+    for item in tqdm(test_data, desc="Video-MME", unit="q"):
+        if item["question_id"] in processed:
+            continue
+        try:
+            raw_output = run_inference(
+                model, tokenizer,
+                video_path=item["video_path"],
+                audio_path=None,
+                prompt=item["prompt"],
+                max_new_tokens=args.max_new_tokens,
+                temperature=args.temperature,
+                max_frames=args.max_frames,
+                fps=args.fps,
+            )
+        except Exception as exc:
+            import traceback
+            print(f"  [error] {item['question_id']}: {exc}")
+            traceback.print_exc()
+            raw_output = ""
+        pred = extract_answer(raw_output)
+        result = {
+            "question_id": item["question_id"],
+            "video_id": item["video_id"],
+            "duration": item["duration"],
+            "domain": item["domain"],
+            "sub_category": item["sub_category"],
+            "task_type": item["task_type"],
+            "question": item["question"],
+            "options": item["options"],
+            "gt_answer": item["gt_answer"],
+            "pred_answer": pred,
+            "raw_output": raw_output,
+        }
+        with open(results_jsonl, "a", encoding="utf-8") as f:
+            f.write(json.dumps(result, ensure_ascii=False) + "\n")
+        processed.add(item["question_id"])
+        gc.collect()
+        torch.cuda.empty_cache()
+    if args.num_shards > 1:
+        print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
+        print(f"[shard] Run merge_shards.py --bench videomme --label-dir {out_dir}")
+        return
+    all_results = []
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                all_results.append(json.loads(line))
+    metrics = compute_metrics(all_results)
+    metrics["eval_config"] = {
+        "model_id": args.model_id,
+        "video_dir": str(args.video_dir),
+        "max_new_tokens": args.max_new_tokens,
+        "temperature": args.temperature,
+        "max_frames": args.max_frames,
+        "fps": args.fps,
+        "attn": args.attn,
+    }
+    with open(metrics_json, "w", encoding="utf-8") as f:
+        json.dump(metrics, f, indent=2, ensure_ascii=False)
+    print_summary(metrics, args.label)
+    with open(summary_txt, "w", encoding="utf-8") as f:
+        buf = io.StringIO()
+        with contextlib.redirect_stdout(buf):
+            print_summary(metrics, args.label)
+        f.write(buf.getvalue())
+    print(f"\n[output] Results: {results_jsonl}")
+    print(f"[output] Metrics: {metrics_json}")
+    print(f"[output] Summary: {summary_txt}")
+if __name__ == "__main__":
+    main()

scripts/eval_worldsense.py ADDED Viewed

	@@ -0,0 +1,175 @@

+#!/usr/bin/env python3
+"""Evaluate MiniCPM-o 4.5 on WorldSense.
+WorldSense videos have embedded audio; we extract it via ffmpeg and feed
+both the video frames and the audio waveform to MiniCPM-o.
+"""
+from __future__ import annotations
+import _common
+import argparse
+import gc
+import io
+import contextlib
+import json
+from pathlib import Path
+import torch
+from tqdm import tqdm
+ch = _common.ch("worldsense")
+load_worldsense = ch.load_worldsense
+extract_answer = ch.extract_answer
+compute_metrics = ch.compute_metrics
+print_summary = ch.print_summary
+DEFAULT_DATA_DIR = ch.DEFAULT_DATA_DIR
+DEFAULT_OUTPUT_DIR = ch.DEFAULT_OUTPUT_DIR
+from minicpmo_inference import load_model, run_inference
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(description="Evaluate MiniCPM-o on WorldSense.")
+    p.add_argument("--model-id", type=str, default="openbmb/MiniCPM-o-4_5")
+    p.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR)
+    p.add_argument("--output-dir", type=Path,
+                   default=Path("/home/ubuntu/eval_results/worldsense_minicpmo"))
+    p.add_argument("--max-samples", type=int, default=-1)
+    p.add_argument("--max-new-tokens", type=int, default=32)
+    p.add_argument("--temperature", type=float, default=0.0)
+    p.add_argument("--label", type=str, default="minicpmo_worldsense")
+    p.add_argument("--max-frames", type=int, default=64)
+    p.add_argument("--fps", type=float, default=1.0)
+    p.add_argument("--attn", type=str, default="flash_attention_2",
+                   choices=["sdpa", "flash_attention_2", "eager"])
+    p.add_argument("--no-audio", action="store_true",
+                   help="Video-only mode (skip audio extraction).")
+    # vLLM flags: parity-only (MiniCPM-o 4.5 multimodal vLLM not yet supported).
+    p.add_argument("--vllm", action="store_true", default=False,
+                   help="(no-op for MiniCPM-o 4.5; auto-falls back to transformers).")
+    p.add_argument("--tp", type=int, default=None)
+    p.add_argument("--gpu-memory-utilization", type=float, default=0.90)
+    p.add_argument("--max-model-len", type=int, default=65536)
+    p.add_argument("--batch-size", type=int, default=32)
+    # Data-parallel sharding
+    p.add_argument("--shard", type=int, default=0)
+    p.add_argument("--num-shards", type=int, default=1)
+    return p.parse_args()
+def main() -> None:
+    args = parse_args()
+    out_dir = args.output_dir / args.label
+    out_dir.mkdir(parents=True, exist_ok=True)
+    shard_suffix = (f".shard{args.shard}of{args.num_shards}"
+                    if args.num_shards > 1 else "")
+    results_jsonl = out_dir / f"eval_results{shard_suffix}.jsonl"
+    metrics_json = out_dir / "metrics.json"
+    summary_txt = out_dir / "summary.txt"
+    if args.vllm:
+        print("[warn] --vllm requested but MiniCPM-o 4.5 multimodal vLLM is not "
+              "supported upstream yet; falling back to transformers.")
+    print("[data] Loading WorldSense dataset...")
+    test_data = load_worldsense(args.data_dir, args.max_samples)
+    if args.num_shards > 1:
+        test_data = [x for i, x in enumerate(test_data) if i % args.num_shards == args.shard]
+        print(f"[shard] shard {args.shard}/{args.num_shards}: {len(test_data)} questions")
+    else:
+        print(f"[data] {len(test_data)} questions ready")
+    processed: set = set()
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                obj = json.loads(line)
+                processed.add(obj["question_id"])
+        print(f"[resume] {len(processed)} already processed")
+    model, tokenizer = load_model(
+        args.model_id, attn_implementation=args.attn, init_audio=not args.no_audio,
+    )
+    for item in tqdm(test_data, desc="WorldSense", unit="q"):
+        if item["question_id"] in processed:
+            continue
+        try:
+            raw_output = run_inference(
+                model, tokenizer,
+                video_path=item["video_path"],
+                audio_path=None,
+                prompt=item["prompt"],
+                max_new_tokens=args.max_new_tokens,
+                temperature=args.temperature,
+                max_frames=args.max_frames,
+                fps=args.fps,
+                use_audio_from_video=not args.no_audio,
+            )
+        except Exception as exc:
+            import traceback
+            print(f"  [error] {item['question_id']}: {exc}")
+            traceback.print_exc()
+            raw_output = ""
+        pred = extract_answer(raw_output)
+        result = {
+            "question_id": item["question_id"],
+            "video_id": item["video_id"],
+            "duration": item["duration"],
+            "domain": item["domain"],
+            "sub_category": item["sub_category"],
+            "task_domain": item["task_domain"],
+            "task_type": item["task_type"],
+            "question": item["question"],
+            "candidates": item["candidates"],
+            "gt_answer": item["gt_answer"],
+            "pred_answer": pred,
+            "raw_output": raw_output,
+        }
+        with open(results_jsonl, "a", encoding="utf-8") as f:
+            f.write(json.dumps(result, ensure_ascii=False) + "\n")
+        processed.add(item["question_id"])
+        gc.collect()
+        torch.cuda.empty_cache()
+    if args.num_shards > 1:
+        print(f"\n[shard {args.shard}/{args.num_shards}] Done. Results: {results_jsonl}")
+        print(f"[shard] Run merge_shards.py --bench worldsense --label-dir {out_dir}")
+        return
+    all_results = []
+    if results_jsonl.exists():
+        with open(results_jsonl) as f:
+            for line in f:
+                all_results.append(json.loads(line))
+    metrics = compute_metrics(all_results)
+    metrics["eval_config"] = {
+        "model_id": args.model_id,
+        "data_dir": str(args.data_dir),
+        "max_new_tokens": args.max_new_tokens,
+        "temperature": args.temperature,
+        "max_frames": args.max_frames,
+        "fps": args.fps,
+        "attn": args.attn,
+        "no_audio": args.no_audio,
+    }
+    with open(metrics_json, "w", encoding="utf-8") as f:
+        json.dump(metrics, f, indent=2, ensure_ascii=False)
+    print_summary(metrics, args.label)
+    with open(summary_txt, "w", encoding="utf-8") as f:
+        buf = io.StringIO()
+        with contextlib.redirect_stdout(buf):
+            print_summary(metrics, args.label)
+        f.write(buf.getvalue())
+    print(f"\n[output] Results: {results_jsonl}")
+    print(f"[output] Metrics: {metrics_json}")
+    print(f"[output] Summary: {summary_txt}")
+if __name__ == "__main__":
+    main()

scripts/merge_shards.py ADDED Viewed

	@@ -0,0 +1,128 @@

+#!/usr/bin/env python3
+"""Merge sharded eval_results.shard*.jsonl files and recompute metrics.
+Usage:
+    python merge_shards.py --bench videomme \
+        --label-dir /home/ubuntu/eval_results/videomme/vmme_minicpmo_4_5
+The script finds all `eval_results.shard*.jsonl` under `--label-dir`,
+concatenates them into `eval_results.jsonl` (deduping by a bench-specific
+primary key), then re-runs the bench's `compute_metrics` + `print_summary`.
+Final outputs: `eval_results.jsonl`, `metrics.json`, `summary.txt`.
+"""
+from __future__ import annotations
+import _common  # noqa: F401
+import argparse
+import contextlib
+import io
+import json
+import sys
+from pathlib import Path
+# Primary key per bench (must match the field written by each eval script).
+PK = {
+    "videomme": "question_id",
+    "lvbench": "uid",
+    "worldsense": "question_id",
+    "daily_omni": "question_id",
+    "dpo_sync": "video",
+    "vggsoundsync": "uid",
+}
+# Extra label used when printing the summary
+LABEL_HINT = {
+    "videomme": "Video-MME",
+    "lvbench": "LVBench",
+    "worldsense": "WorldSense",
+    "daily_omni": "Daily-Omni",
+    "dpo_sync": "Sync",
+    "vggsoundsync": "VGGSoundSync",
+}
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--bench", required=True,
+                   choices=list(PK.keys()),
+                   help="Which benchmark this label-dir belongs to.")
+    p.add_argument("--label-dir", type=Path, required=True,
+                   help="Eval output dir containing eval_results.shard*.jsonl.")
+    args = p.parse_args()
+    ch = _common.ch(args.bench)
+    pk = PK[args.bench]
+    shard_files = sorted(args.label_dir.glob("eval_results.shard*.jsonl"))
+    if not shard_files:
+        print(f"[merge] ERROR: no eval_results.shard*.jsonl in {args.label_dir}",
+              file=sys.stderr)
+        return 1
+    print(f"[merge] Found {len(shard_files)} shard file(s):")
+    for sf in shard_files:
+        print(f"         - {sf.name}")
+    merged_path = args.label_dir / "eval_results.jsonl"
+    all_results = []
+    seen: set = set()
+    n_dup = 0
+    with open(merged_path, "w", encoding="utf-8") as out:
+        for sf in shard_files:
+            with open(sf) as f:
+                for line in f:
+                    line = line.strip()
+                    if not line:
+                        continue
+                    obj = json.loads(line)
+                    key = obj.get(pk)
+                    if key in seen:
+                        n_dup += 1
+                        continue
+                    seen.add(key)
+                    out.write(line + "\n")
+                    all_results.append(obj)
+    print(f"[merge] Merged {len(all_results)} unique results "
+          f"({n_dup} duplicates skipped) -> {merged_path}")
+    metrics = ch.compute_metrics(all_results)
+    # Preserve eval_config from any shard if present
+    for sf in shard_files:
+        try:
+            with open(sf) as f:
+                first = f.readline().strip()
+            if first:
+                obj = json.loads(first)
+                if "eval_config" in obj:
+                    metrics["eval_config"] = obj["eval_config"]
+                    break
+        except Exception:
+            pass
+    metrics_json = args.label_dir / "metrics.json"
+    summary_txt = args.label_dir / "summary.txt"
+    with open(metrics_json, "w", encoding="utf-8") as f:
+        json.dump(metrics, f, indent=2, ensure_ascii=False)
+    label = args.label_dir.name
+    ch.print_summary(metrics, label)
+    buf = io.StringIO()
+    with contextlib.redirect_stdout(buf):
+        ch.print_summary(metrics, label)
+    with open(summary_txt, "w", encoding="utf-8") as f:
+        f.write(buf.getvalue())
+    print(f"\n[merge] Done.")
+    print(f"  Results:  {merged_path}")
+    print(f"  Metrics:  {metrics_json}")
+    print(f"  Summary:  {summary_txt}")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

scripts/minicpmo_inference.py ADDED Viewed

	@@ -0,0 +1,264 @@

+"""
+Common inference wrapper for MiniCPM-o 4.5.
+MiniCPM-o's API is `model.chat(msgs=[...], tokenizer=...)` where `msgs` is a
+list of `{"role": ..., "content": [image, audio, ..., text]}`. This module
+hides that detail behind `run_inference(model, tokenizer, video, audio,
+prompt)` so the 6 benchmark eval scripts can share one inference code path.
+Also runs the compatibility patcher on import so users who haven't run
+`setup_env.sh` still get a working model.
+"""
+from __future__ import annotations
+import os
+import subprocess
+import tempfile
+from pathlib import Path
+from typing import Any, List, Optional, Tuple
+import numpy as np
+# ---------------------------------------------------------------------------
+# Apply transformers>=4.52 compatibility patches lazily on import.
+# Safe to call multiple times; idempotent.
+# ---------------------------------------------------------------------------
+def _maybe_patch_once() -> None:
+    try:
+        from patch_minicpmo import (
+            _find_modeling_file,
+            _find_processing_file,
+            patch_file,
+            patch_processing_file,
+        )
+    except ImportError:
+        return
+    path = _find_modeling_file()
+    if path is not None:
+        try:
+            patch_file(path)
+        except Exception as exc:  # pragma: no cover
+            print(f"[minicpmo] (warn) patch failed: {exc}")
+    proc = _find_processing_file()
+    if proc is not None:
+        try:
+            patch_processing_file(proc)
+        except Exception as exc:  # pragma: no cover
+            print(f"[minicpmo] (warn) processing patch failed: {exc}")
+_maybe_patch_once()
+def _max_inp_length_for_chat(model: Any, max_new_tokens: int) -> int:
+    """Upper bound for ``model.chat(..., max_inp_length=...)`` (defaults to 8192).
+    Many frames × per-frame image placeholders can exceed 8k text tokens; the
+    processor then truncates ``input_ids`` and image start/end counts diverge,
+    causing ``RuntimeError`` in ``processing_minicpmo._convert``.
+    """
+    reserve = int(max_new_tokens) + 1024
+    best = 32768
+    for cfg in (
+        getattr(model, "config", None),
+        getattr(getattr(model, "llm", None), "config", None),
+    ):
+        if cfg is None:
+            continue
+        npos = getattr(cfg, "max_position_embeddings", None)
+        if isinstance(npos, int) and npos > 8192:
+            best = min(best, max(npos - reserve, 16384))
+    return best
+# ---------------------------------------------------------------------------
+# Frame / audio loaders
+# ---------------------------------------------------------------------------
+def load_video_frames(video_path: str, max_frames: int = 32,
+                      fps: float = 1.0) -> List:
+    """Sample PIL RGB frames uniformly from a video.
+    MiniCPM-o expects a list of PIL Images (not a tensor). `fps=1.0,
+    max_frames=32` covers ~32s; longer videos get sparser sampling.
+    """
+    from PIL import Image
+    import decord
+    vr = decord.VideoReader(video_path, num_threads=1)
+    total_frames = len(vr)
+    video_fps = vr.get_avg_fps()
+    duration = total_frames / max(video_fps, 1e-6)
+    target = max(int(round(fps * duration)), 2)
+    target = min(target, max_frames)
+    target = min(target, total_frames)
+    idx = np.linspace(0, total_frames - 1, target).round().astype(int).tolist()
+    frames = vr.get_batch(idx).asnumpy()
+    return [Image.fromarray(f).convert("RGB") for f in frames]
+def load_audio_waveform(audio_path: str, target_sr: int = 16000) -> np.ndarray:
+    """Load audio as float32 numpy in [-1, 1] at `target_sr`."""
+    import librosa
+    y, _ = librosa.load(audio_path, sr=target_sr, mono=True)
+    return y.astype(np.float32)
+def extract_audio_from_video(video_path: str, target_sr: int = 16000,
+                             tmp_dir: Optional[str] = None) -> Optional[str]:
+    """Extract the audio track from a video file to a temp .wav via ffmpeg.
+    Returns the path to the .wav file, or None if the video has no audio
+    track or extraction fails. Caller is responsible for cleanup.
+    """
+    tmp_dir = tmp_dir or tempfile.mkdtemp(prefix="mo_audio_")
+    out = os.path.join(tmp_dir, "audio.wav")
+    try:
+        subprocess.run(
+            ["ffmpeg", "-y", "-loglevel", "error", "-i", video_path,
+             "-vn", "-ac", "1", "-ar", str(target_sr), out],
+            check=True,
+            stdout=subprocess.DEVNULL,
+            stderr=subprocess.PIPE,
+            timeout=120,
+        )
+    except Exception:
+        return None
+    if not os.path.isfile(out) or os.path.getsize(out) < 64:
+        return None
+    return out
+# ---------------------------------------------------------------------------
+# Model loading
+# ---------------------------------------------------------------------------
+def load_model(model_id: str = "openbmb/MiniCPM-o-4_5",
+               device: str = "cuda",
+               dtype: str = "bfloat16",
+               init_audio: bool = True,
+               attn_implementation: str = "flash_attention_2"):
+    """Load MiniCPM-o model + tokenizer. Returns (model, tokenizer).
+    Tries `attn_implementation` first; if flash_attention_2 isn't installed or
+    the backbone doesn't support it, falls back to sdpa automatically.
+    """
+    import torch
+    from transformers import AutoModel, AutoTokenizer
+    torch_dtype = {"bfloat16": torch.bfloat16, "float16": torch.float16,
+                   "float32": torch.float32}[dtype]
+    def _try_load(attn: str):
+        print(f"[minicpmo] Loading {model_id} (dtype={dtype}, device={device}, "
+              f"init_audio={init_audio}, attn={attn})...")
+        return AutoModel.from_pretrained(
+            model_id,
+            trust_remote_code=True,
+            attn_implementation=attn,
+            torch_dtype=torch_dtype,
+            init_vision=True,
+            init_audio=init_audio,
+            init_tts=False,
+        )
+    try:
+        model = _try_load(attn_implementation)
+    except Exception as exc:
+        if attn_implementation != "sdpa":
+            print(f"[minicpmo] (warn) {attn_implementation} failed ({exc}); falling back to sdpa.")
+            model = _try_load("sdpa")
+        else:
+            raise
+    model = model.eval().to(device)
+    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+    print(f"[minicpmo] Model ready.")
+    return model, tokenizer
+# ---------------------------------------------------------------------------
+# Inference
+# ---------------------------------------------------------------------------
+def run_inference(
+    model,
+    tokenizer,
+    video_path: Optional[str],
+    audio_path: Optional[str],
+    prompt: str,
+    max_new_tokens: int = 256,
+    temperature: float = 0.0,
+    max_frames: int = 32,
+    fps: float = 1.0,
+    use_audio_from_video: bool = False,
+) -> str:
+    """Run MiniCPM-o chat inference.
+    Args:
+      video_path: optional path to an mp4/etc. file.
+      audio_path: optional path to a wav file. If `use_audio_from_video` is
+        True and `audio_path` is None, we extract audio from the video.
+      prompt: user instruction text.
+      temperature: 0 means greedy.
+      use_audio_from_video: if True, extract audio from the video automatically
+        (useful for WorldSense / Daily-Omni where video has embedded audio but
+        no separate wav is provided).
+    """
+    content: List[Any] = []
+    tmp_audio_dir: Optional[str] = None
+    if video_path is not None:
+        frames = load_video_frames(video_path, max_frames=max_frames, fps=fps)
+        content.extend(frames)
+    if audio_path is None and use_audio_from_video and video_path is not None:
+        tmp_audio_dir = tempfile.mkdtemp(prefix="mo_audio_")
+        audio_path = extract_audio_from_video(video_path, tmp_dir=tmp_audio_dir)
+    if audio_path is not None:
+        try:
+            audio = load_audio_waveform(audio_path, target_sr=16000)
+            if audio.size > 0:
+                content.append(audio)
+        except Exception as exc:
+            print(f"  [minicpmo] (warn) audio load failed: {exc}")
+    content.append(prompt)
+    msgs = [{"role": "user", "content": content}]
+    # Critical defaults for video understanding (see MiniCPM-o 4.5 HF README
+    # "Chat with Video"): without ``use_image_id=False, max_slice_nums=1`` the
+    # processor treats each frame as an independent HD image, slicing it into
+    # multiple sub-images with per-image ID tokens. That token distribution is
+    # OOD for the video-trained model and produces degenerate output (repeated
+    # training-data fragments, e.g. "the image description of the first image
+    # you see as a brief description ...").
+    gen_kwargs = dict(
+        max_new_tokens=max_new_tokens,
+        do_sample=temperature > 0,
+        temperature=temperature if temperature > 0 else 1.0,
+        top_p=0.9 if temperature > 0 else 1.0,
+        max_inp_length=_max_inp_length_for_chat(model, max_new_tokens),
+        use_tts_template=False,
+        enable_thinking=False,
+    )
+    if video_path is not None:
+        gen_kwargs["use_image_id"] = False
+        gen_kwargs["max_slice_nums"] = 1
+    if use_audio_from_video and video_path is not None:
+        gen_kwargs.setdefault("omni_mode", True)
+    try:
+        res = model.chat(msgs=msgs, tokenizer=tokenizer, **gen_kwargs)
+    except TypeError:
+        res = model.chat(msgs=msgs, tokenizer=tokenizer)
+    if tmp_audio_dir is not None:
+        import shutil
+        shutil.rmtree(tmp_audio_dir, ignore_errors=True)
+    if isinstance(res, tuple):
+        res = res[0]
+    return str(res).strip()

scripts/patch_minicpmo.py ADDED Viewed

	@@ -0,0 +1,255 @@

+#!/usr/bin/env python3
+"""Patch MiniCPM-o 4.5 custom code in the Hugging Face modules cache.
+``modeling_minicpmo.py`` (transformers >= 4.52):
+  1. `WhisperEncoderLayer.forward` unpacks 3 values from `self.self_attn(...)`,
+     but new `WhisperAttention.forward` returns 2 values.
+  2. `prepare_inputs_for_generation` reads `past_key_values.seen_tokens`, which
+     was removed from `DynamicCache`.
+  3. `chat()` force-sets ``use_tts_template = True`` whenever audio is in the
+     ``content`` list. That appends ``<|tts_bos|>`` to the assistant prefix
+     and the model then generates **audio (TTS codec) ids**; decoded as text
+     they look like ``<think>`` floods / gibberish. We want audio-in +
+     **text-out** for benchmark eval, so respect the caller's kwarg instead.
+``processing_minicpmo.py``:
+  4. `_convert` used ``max(len(image_start_idx), len(image_end_idx))`` when
+     building ``image_bounds``; after ``max_length`` truncation start/end counts
+     can differ by one and ``torch.hstack`` raises (common with many video
+     frames under the default ``chat(..., max_inp_length=8192)``). Use ``min``.
+Idempotent. Also downloads model code on demand so files exist before patching.
+"""
+from __future__ import annotations
+import os
+import sys
+from pathlib import Path
+MODEL_ID = "openbmb/MiniCPM-o-4_5"
+def _find_modeling_file() -> Path | None:
+    """Locate the cached modeling_minicpmo.py (matches HF's module dir naming)."""
+    home = Path(os.path.expanduser("~"))
+    candidates = [
+        home / ".cache" / "huggingface" / "modules" / "transformers_modules",
+    ]
+    hits: list[Path] = []
+    for root in candidates:
+        if not root.exists():
+            continue
+        for p in root.rglob("modeling_minicpmo.py"):
+            hits.append(p)
+    if not hits:
+        return None
+    # Prefer the deepest (snapshot-hashed) one.
+    hits.sort(key=lambda p: len(p.parts), reverse=True)
+    return hits[0]
+def _find_processing_file() -> Path | None:
+    """``processing_minicpmo.py`` lives next to the cached ``modeling_minicpmo.py``."""
+    modeling = _find_modeling_file()
+    if modeling is None:
+        return None
+    proc = modeling.parent / "processing_minicpmo.py"
+    return proc if proc.is_file() else None
+def _download_model_code() -> None:
+    """Force HF to download MiniCPM-o's custom code so the file is cached.
+    We only need the Python files + config (not weights) for patching. We use
+    `hf_hub_download` for the individual code files to avoid fetching the
+    multi-GB safetensors shards just to patch a .py file.
+    """
+    try:
+        from huggingface_hub import hf_hub_download
+    except ImportError:
+        print("[patch] huggingface_hub not installed; skipping auto-download.")
+        return
+    for fn in [
+        "config.json",
+        "configuration_minicpm.py",
+        "modeling_minicpmo.py",
+        "modeling_navit_siglip.py",
+        "processing_minicpmo.py",
+        "resampler.py",
+        "utils.py",
+    ]:
+        try:
+            hf_hub_download(repo_id=MODEL_ID, filename=fn)
+        except Exception as exc:
+            # Some files may not exist in every revision; that's fine.
+            print(f"[patch] (warn) could not fetch {fn}: {exc}")
+def patch_whisper_unpack(text: str) -> tuple[str, bool]:
+    """Fix #1: WhisperAttention now returns 2 values, not 3."""
+    OLD = (
+        "        hidden_states, attn_weights, past_key_values = self.self_attn(\n"
+        "            hidden_states=hidden_states,\n"
+        "            attention_mask=attention_mask,\n"
+        "            layer_head_mask=layer_head_mask,\n"
+        "            output_attentions=output_attentions,\n"
+        "            past_key_value=past_key_values,\n"
+        "        )"
+    )
+    NEW = (
+        "        _attn_out = self.self_attn(\n"
+        "            hidden_states=hidden_states,\n"
+        "            attention_mask=attention_mask,\n"
+        "            layer_head_mask=layer_head_mask,\n"
+        "            output_attentions=output_attentions,\n"
+        "            past_key_value=past_key_values,\n"
+        "        )\n"
+        "        if len(_attn_out) == 3:\n"
+        "            hidden_states, attn_weights, past_key_values = _attn_out\n"
+        "        else:\n"
+        "            hidden_states, attn_weights = _attn_out"
+    )
+    if NEW.split("\n", 1)[0] in text:
+        return text, False  # already patched
+    if OLD not in text:
+        return text, False  # not applicable (different revision?)
+    return text.replace(OLD, NEW), True
+def patch_seen_tokens(text: str) -> tuple[str, bool]:
+    """Fix #2: DynamicCache.seen_tokens was removed in newer transformers."""
+    OLD = (
+        "            cache_length = past_key_values.get_seq_length()\n"
+        "            past_length = past_key_values.seen_tokens"
+    )
+    NEW = (
+        "            cache_length = past_key_values.get_seq_length()\n"
+        "            past_length = getattr(past_key_values, \"seen_tokens\", cache_length)"
+    )
+    if 'getattr(past_key_values, "seen_tokens"' in text:
+        return text, False  # already patched
+    if OLD not in text:
+        return text, False
+    return text.replace(OLD, NEW), True
+def patch_chat_force_tts_template(text: str) -> tuple[str, bool]:
+    """Fix #3: don't force ``use_tts_template=True`` on audio-containing content.
+    MiniCPM-o's ``chat()`` assumes "audio in implies TTS audio out". For MCQ /
+    freetext eval we want a text answer; the caller's ``use_tts_template`` kwarg
+    (default ``False``) must win so the assistant prefix doesn't get
+    ``<|tts_bos|>`` appended (which causes the LM to emit audio-codec ids that
+    look like ``<think>`` repetitions when text-decoded).
+    """
+    OLD = (
+        '                    elif isinstance(c, np.ndarray):  # audio\n'
+        '                        audios.append(c)\n'
+        '                        audio_parts.append(i)\n'
+        '                        cur_msgs.append("<audio>./</audio>")\n'
+        '                        use_tts_template = True\n'
+    )
+    NEW = (
+        '                    elif isinstance(c, np.ndarray):  # audio\n'
+        '                        audios.append(c)\n'
+        '                        audio_parts.append(i)\n'
+        '                        cur_msgs.append("<audio>./</audio>")\n'
+        '                        # PATCHED: honour caller-provided use_tts_template.\n'
+        '                        # Upstream force-sets True on any audio, which makes the model\n'
+        '                        # generate TTS codec ids (look like <think> noise as text).\n'
+    )
+    if "PATCHED: honour caller-provided use_tts_template" in text:
+        return text, False
+    if OLD not in text:
+        return text, False
+    return text.replace(OLD, NEW), True
+def patch_processor_image_bounds(text: str) -> tuple[str, bool]:
+    """Fix ``image_bounds`` when start/end marker counts disagree (truncation)."""
+    OLD = "        valid_image_nums = max(len(image_start_idx), len(image_end_idx))"
+    NEW = (
+        "        # Pair only complete spans; max() breaks torch.hstack if counts differ.\n"
+        "        valid_image_nums = min(len(image_start_idx), len(image_end_idx))"
+    )
+    if "valid_image_nums = min(len(image_start_idx), len(image_end_idx))" in text:
+        return text, False
+    if OLD not in text:
+        return text, False
+    return text.replace(OLD, NEW), True
+def patch_file(path: Path) -> bool:
+    original = path.read_text()
+    text = original
+    any_change = False
+    text, c1 = patch_whisper_unpack(text)
+    any_change |= c1
+    text, c2 = patch_seen_tokens(text)
+    any_change |= c2
+    text, c3 = patch_chat_force_tts_template(text)
+    any_change |= c3
+    if any_change:
+        backup = path.with_suffix(path.suffix + ".bak")
+        if not backup.exists():
+            backup.write_text(original)
+            print(f"[patch] Backup -> {backup}")
+        path.write_text(text)
+        print(f"[patch] Patched {path.name}: "
+              f"whisper_unpack={c1}, seen_tokens={c2}, chat_tts_template={c3}")
+    else:
+        print(f"[patch] No changes needed (already patched or unknown revision)")
+    return any_change
+def patch_processing_file(path: Path) -> bool:
+    """Patch ``processing_minicpmo.py`` (image_bounds hstack)."""
+    original = path.read_text()
+    text = original
+    text, c = patch_processor_image_bounds(text)
+    if not c:
+        print(f"[patch] {path.name}: image_bounds already patched or pattern missing")
+        return False
+    backup = path.with_suffix(path.suffix + ".bak")
+    if not backup.exists():
+        backup.write_text(original)
+        print(f"[patch] Backup -> {backup}")
+    path.write_text(text)
+    print(f"[patch] Patched {path.name}: image_bounds min() fix")
+    return True
+def main() -> int:
+    path = _find_modeling_file()
+    if path is None:
+        print("[patch] modeling_minicpmo.py not cached yet; fetching from HF...")
+        _download_model_code()
+        path = _find_modeling_file()
+    if path is None:
+        print("[patch] ERROR: could not locate modeling_minicpmo.py", file=sys.stderr)
+        return 1
+    print(f"[patch] Target: {path}")
+    patch_file(path)
+    proc = _find_processing_file()
+    if proc is not None:
+        print(f"[patch] Target: {proc}")
+        patch_processing_file(proc)
+    else:
+        print("[patch] (warn) processing_minicpmo.py not found next to modeling; "
+              "run once with HF cache populated")
+    # Invalidate __pycache__ so the edited file is re-imported.
+    for pc in path.parent.rglob("__pycache__"):
+        import shutil
+        shutil.rmtree(pc, ignore_errors=True)
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

scripts/test_minicpmo.py ADDED Viewed

	@@ -0,0 +1,62 @@

+#!/usr/bin/env python3
+"""
+Sanity check: load MiniCPM-o 4.5 and run a single sample through it.
+Picks one video from sync eval set, passes video + audio + prompt, prints
+the model's response.
+"""
+from __future__ import annotations
+import os
+import sys
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).parent))
+from minicpmo_inference import load_model, run_inference
+def main():
+    # Pick the first original video in the sync eval set
+    original_root = Path("/opt/dlami/nvme/video_source/original/uag_oops")
+    audio_root = Path("/opt/dlami/nvme/video_source/extracted_audio/original/uag_oops")
+    videos = sorted(original_root.glob("*.mp4"))
+    if not videos:
+        print(f"ERROR: no videos found at {original_root}")
+        sys.exit(1)
+    video_path = videos[0]
+    audio_path = audio_root / f"{video_path.stem}.wav"
+    if not audio_path.exists():
+        print(f"ERROR: audio not found for {video_path.name}")
+        sys.exit(1)
+    print(f"Video: {video_path}")
+    print(f"Audio: {audio_path}")
+    print()
+    model, tokenizer = load_model()
+    prompt = (
+        "Watch this video and listen to its audio carefully. "
+        "Determine whether the audio and video tracks are synchronized. "
+        "Explain your reasoning."
+    )
+    print("=== Running inference ===")
+    response = run_inference(
+        model, tokenizer,
+        video_path=str(video_path),
+        audio_path=str(audio_path),
+        prompt=prompt,
+        max_new_tokens=128,
+        temperature=0.0,
+    )
+    print()
+    print("=== Response ===")
+    print(response)
+if __name__ == "__main__":
+    main()

scripts/upload_to_hf_model.py ADDED Viewed

	@@ -0,0 +1,84 @@

+#!/usr/bin/env python3
+"""Create or update a Hugging Face **model** repo with this evaluation codebase.
+This upload is **code and docs only** (no MiniCPM-o weights). HF allows model
+repos to host auxiliary artifacts; use ``--private`` if you do not want the
+scripts public.
+Prerequisites::
+    pip install huggingface_hub
+    export HF_TOKEN=hf_...    # or: huggingface-cli login
+Usage::
+    cd MiniCPM-Evaluation
+    python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation
+Private repo::
+    python scripts/upload_to_hf_model.py --repo-id YourUsername/MiniCPM-Evaluation --private
+"""
+from __future__ import annotations
+import argparse
+import sys
+from pathlib import Path
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument(
+        "--repo-id",
+        required=True,
+        help="HF repository id, e.g. username/MiniCPM-Evaluation",
+    )
+    p.add_argument(
+        "--private",
+        action="store_true",
+        help="Create the repo as private (only for first create; ignored if repo exists).",
+    )
+    p.add_argument(
+        "--token",
+        default=None,
+        help="HF token (default: HF_TOKEN env or cached huggingface-cli login).",
+    )
+    args = p.parse_args()
+    root = Path(__file__).resolve().parent.parent
+    if not (root / "README.md").is_file():
+        print(f"error: unexpected layout; expected README.md under {root}", file=sys.stderr)
+        return 1
+    try:
+        from huggingface_hub import HfApi, create_repo
+    except ImportError:
+        print("error: install huggingface_hub: pip install huggingface_hub", file=sys.stderr)
+        return 1
+    api = HfApi(token=args.token)
+    create_repo(
+        repo_id=args.repo_id,
+        repo_type="model",
+        private=args.private,
+        exist_ok=True,
+    )
+    api.upload_folder(
+        folder_path=str(root),
+        repo_id=args.repo_id,
+        repo_type="model",
+        ignore_patterns=[
+            ".git/**",
+            ".git",
+            "**/__pycache__/**",
+            "**/*.pyc",
+            "**/.DS_Store",
+        ],
+    )
+    print(f"Uploaded: {root}")
+    print(f"URL: https://huggingface.co/{args.repo_id}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

setup_env.sh ADDED Viewed

	@@ -0,0 +1,80 @@

+#!/usr/bin/env bash
+# MiniCPM-o 4.5 evaluation environment setup.
+#
+# Creates a separate conda env 'minicpmo' because MiniCPM-o has its own
+# dependency stack (librosa, decord, sentencepiece pin, etc.) that may conflict
+# with the Qwen3-Omni 'video' env. Safer to keep them isolated.
+#
+# Usage:
+#   bash setup_env.sh
+#
+set -euo pipefail
+CONDA_ENV="${CONDA_ENV:-minicpmo}"
+PYTHON_VER="${PYTHON_VER:-3.12}"
+INSTALL_DIR="${INSTALL_DIR:-${HOME}/anaconda3}"
+log() { echo "[setup_env] $*"; }
+log "Bootstrapping conda..."
+if ! command -v conda &>/dev/null; then
+  if [[ -f "${INSTALL_DIR}/etc/profile.d/conda.sh" ]]; then
+    source "${INSTALL_DIR}/etc/profile.d/conda.sh"
+  else
+    echo "Error: conda not found. Install Anaconda first (see CleverHans-Evaluation/setup_env.sh)."
+    exit 1
+  fi
+fi
+eval "$(conda shell.bash hook)"
+log "Creating conda env '${CONDA_ENV}' (python=${PYTHON_VER})..."
+if conda env list | awk '{print $1}' | grep -Fxq "${CONDA_ENV}"; then
+  log "Env '${CONDA_ENV}' already exists; activating."
+  conda activate "${CONDA_ENV}"
+else
+  conda create -n "${CONDA_ENV}" "python=${PYTHON_VER}" -y
+  conda activate "${CONDA_ENV}"
+fi
+log "Installing FFmpeg 6 (for audio/video decoding)..."
+conda install -y -c conda-forge 'ffmpeg>=6,<7' || log "Warning: conda-forge ffmpeg failed."
+log "Installing PyTorch 2.6 (MiniCPM-o stable target; newer torch may work)..."
+pip install --upgrade pip
+pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
+  --index-url https://download.pytorch.org/whl/cu124
+log "Installing MiniCPM-o core dependencies..."
+# MiniCPM-o 4.5 uses Qwen3Config (needs transformers >=4.52).
+pip install 'transformers>=4.52,<4.58' accelerate==0.33.0
+pip install Pillow==10.4.0
+pip install sentencepiece==0.2.0
+pip install decord==0.6.0 librosa==0.10.2 soundfile==0.12.1 moviepy==1.0.3
+pip install vocos==0.1.0
+pip install huggingface_hub==0.26.5
+pip install einops==0.8.0
+pip install tqdm openai
+# CleverHans-Evaluation loaders used by MiniCPM-o eval scripts (imported via _common.ch):
+#   - eval_worldsense.py  → pandas + pyarrow (parquet)
+#   - eval_videomme.py    → datasets (lmms-lab/Video-MME)
+#   - eval_lvbench.py     → datasets (lmms-lab/LVBench)
+log "Installing eval data-loader deps (datasets, pandas, pyarrow)..."
+pip install datasets pandas pyarrow
+# MiniCPM-o 4.5 custom modeling file imports 'minicpmo' (PyPI package) for TTS utils.
+# The package drags in cosyvoice + stepaudio2 which need these downstream deps.
+pip install minicpmo==0.1.2
+pip install onnx onnxruntime hyperpyyaml diffusers
+log "Patching MiniCPM-o modeling file for transformers>=4.52 compatibility..."
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+python "${SCRIPT_DIR}/scripts/patch_minicpmo.py" || log "Warning: patch_minicpmo.py failed (non-fatal; see errors above)."
+log "Done."
+echo ""
+echo "  Active env:     ${CONDA_ENV}"
+echo "  Python:         $(command -v python)"
+echo ""
+echo "Next: conda activate ${CONDA_ENV}"
+echo "      Then try: python scripts/test_minicpmo.py"