Diffusers
Safetensors
EvalMDE / PHASE0_EVALMDE_HANDOFF.md
zeyuren2002's picture
Add files using upload-large-folder tool
d547008 verified

Phase 0 EvalMDE Adaptation β€” Handoff

Date: 2026-05-14 Status: EvalMDE workspace bootstrapped; main eval script + sbatch still to write.


Goal

Run the 7 MoGe-Phase-0 models on Infinigen 95 scenes under the EvalMDE protocol (raw native input, no homography warp), producing RelNormal + SAWA-H + standard metrics.

EvalMDE and MoGe are independent workflows. EvalMDE workspace is at /home/ywan0794/EvalMDE/. Model wrappers are copied from MoGe (single source of truth still in MoGe/baselines/), because the wrappers' infer(image, intrinsics) API doesn't depend on MoGe's eval pipeline.


What's done

1. EvalMDE env (Python 3.10) β€” built and verified

evalmde conda env has: torch 2.7.0+cu126, opencv, scipy, utils3d, pipeline, evalmde package, bpy 4.0 (Blender python, for textureless-relighting visualization). Sample run python compute_metrics_example.py outputs sawa_h=1.268, rel_normal=0.390 βœ“.

2. 7 baselines (model wrappers) β€” copied from MoGe + verified

/home/ywan0794/EvalMDE/baselines/:

  • depth_pro.py β†’ emits depth_metric (+ intrinsics from FOV head)
  • marigold.py β†’ emits depth_affine_invariant (paper: scale_inv+shift_inv β†’ affine)
  • lotus.py β†’ emits disparity_affine_invariant when --disparity set
  • depthmaster.py β†’ emits depth_affine_invariant
  • ppd.py β†’ emits depth_affine_invariant (training quantile normalization)
  • da3_mono.py β†’ emits depth_scale_invariant
  • fe2e.py β†’ emits depth_affine_invariant (Lpred clamped to [0,1])

MGEBaselineInterface copied to /home/ywan0794/EvalMDE/test/baseline.py.

3. EvalMDE-native dataloader skeleton β€” written

/home/ywan0794/EvalMDE/scripts/dataloader.py (EvalMDELoaderPipeline):

  • Reads <scene>/rgb.png + <scene>/gt_depth.npz (keys: depth (H,W), intr (4,) [fx,fy,cx,cy]px, valid (H,W) bool)
  • Pixel intrinsics β†’ 3Γ—3 normalized matrix [fx/W, fy/H, cx/W, cy/H] (MoGe convention)
  • Computes 3D pointmap from depth + native pixel intrinsics
  • NaN/invalid pixels replaced with 1.0 (matches evalmde/utils/depth.py:load_data convention)
  • Returns dict with: image [3,H,W] float [0,1], depth, depth_mask, intrinsics (3,3), points (H,W,3), is_metric=True, _intr_px (4,) (for EvalMDE metrics raw npz)

4. Infinigen download β€” IN PROGRESS (background)

  • Source: Princeton GDrive 1amzb6KyF2USFQ5W4CeYKFCh1F-yOQsmp
  • Target: /home/ywan0794/EvalMDE/data/infinigen/
  • Log: /tmp/dl_infinigen.log
  • Estimated 50-100 GB
  • Check state: du -sh /home/ywan0794/EvalMDE/data/infinigen/

5. Production MoGe-protocol eval β€” independent track, already running

  • sbatch eval_scripts/eval_all_slurm.sh submitted earlier (job 12110 etc.)
  • 5 models pending (Marigold/Lotus/DepthMaster/PPD/FE2E), 2 already done (DA3-Mono/Depth Pro)
  • Results in /home/ywan0794/MoGe/eval_output/<model>_<TS>.json
  • EvalMDE adaptation is a separate effort, doesn't block production MoGe eval.

TODO (was 4 items, now 2 remain)

βœ… TODO-1: Fix baseline imports β€” SUPERSEDED by sys.path approach in run_inference.py

EvalMDE/baselines/*.py still have from moge.test.baseline import MGEBaselineInterface. Resolved via Option A: scripts/run_inference.py does sys.path.insert(0, '/home/ywan0794/MoGe') so baselines still resolve their interface from MoGe. No sed needed.

βœ… TODO-2 (inference driver): scripts/run_inference.py β€” WRITTEN

/home/ywan0794/EvalMDE/scripts/run_inference.py:

  • Click CLI with --baseline /path/to/baselines/<m>.py --data-root <infinigen> --output-root <out> --model-name <name>
  • Passes remaining click args through to baseline's load.main(ctx.args)
  • For each scene with rgb.png + gt_depth.npz: loads rgb, builds normalized 3Γ—3 K from GT pixel intr, calls baseline.infer_for_evaluation(image, K_norm), picks depth in priority order (depth_metric > depth_scale_invariant > depth_affine_invariant > 1/disparity_affine_invariant), writes <out>/<model>/<scene>/pred_depth.npz with EvalMDE keys {depth, intr (4,) px, valid}
  • For pred intrinsics: uses model-predicted intr if present (Depth Pro), else GT intr

❗ Original TODO-2 (script/eval.py) was REWORKED into 2 stages: inference + metric.

This is cleaner: inference runs in per-model env, metric runs in evalmde env.

TODO-3: Write scripts/compute_metrics.py (run in evalmde env)

Reads each model's pred_depth.npz + GT gt_depth.npz, computes EvalMDE metrics + standard MDE metrics.

Pseudocode:

import sys, json, click
from pathlib import Path
import numpy as np

from evalmde.utils.depth import load_data
from evalmde.metrics.rel_normal import compute_rel_normal
from evalmde.metrics.sawa_h     import compute_sawa_h

@click.command()
@click.option('--gt-root',   required=True, type=click.Path())  # Infinigen root
@click.option('--pred-root', required=True, type=click.Path())  # output of run_inference.py
@click.option('--model-name', required=True, type=str)
@click.option('--output',    required=True, type=click.Path())
def main(gt_root, pred_root, model_name, output):
    gt_root = Path(gt_root); pred_root = Path(pred_root) / model_name
    scenes = sorted(d.name for d in pred_root.iterdir() if (d / 'pred_depth.npz').exists())

    results = []
    for scene in scenes:
        gt_d,  gt_intr,  gt_v = load_data(gt_root  / scene / 'gt_depth.npz')
        pr_d,  pr_intr,  pr_v = load_data(pred_root / scene / 'pred_depth.npz')

        # SAWA-H aligns internally (affine via least-squares). RelNormal uses surface normals
        # which are invariant to scale but NOT to shift β€” for affine-invariant preds, the
        # shift will skew normals at far depths. Acceptable caveat in Phase 0; document it.
        sawa  = compute_sawa_h    (pr_d, pr_intr, pr_v, gt_d, gt_intr, gt_v)
        rnorm = compute_rel_normal(pr_d, pr_intr, pr_v, gt_d, gt_intr, gt_v)

        # Standard AbsRel + Ξ΄1 after affine alignment (re-implement, ~10 lines):
        mask  = gt_v & pr_v
        gtm, prm = gt_d[mask], pr_d[mask]
        # fit y = a*x + b on (prm, gtm)
        A = np.stack([prm, np.ones_like(prm)], axis=-1)
        a, b = np.linalg.lstsq(A, gtm, rcond=None)[0]
        aligned = pr_d * a + b
        am = aligned[mask]
        abs_rel = np.mean(np.abs(am - gtm) / np.maximum(gtm, 1e-6))
        delta1  = np.mean(np.maximum(am/gtm, gtm/am) < 1.25)

        results.append({'scene': scene, 'sawa_h': float(sawa), 'rel_normal': float(rnorm),
                        'abs_rel': float(abs_rel), 'delta1': float(delta1)})

    # Per-scene + aggregate mean
    summary = {'per_scene': results,
               'mean': {k: float(np.mean([r[k] for r in results])) for k in ['sawa_h','rel_normal','abs_rel','delta1']}}
    json.dump(summary, open(output, 'w'), indent=2)


if __name__ == '__main__':
    main()

Note on alignment: compute_sawa_h aligns internally (via align_depth_least_square + align_affine_lstsq), so passing RAW pred (affine-invariant) is correct. compute_rel_normal does NOT align β€” its inputs should be in a comparable depth scale. For Phase 0 simplicity, pass raw pred; document the affine-shift caveat in the analysis. For stricter eval, pre-affine-align before RelNormal.

TODO-4: Scene list / config

Once Infinigen download succeeds (currently blocked, see issue below), run_inference.py auto-discovers all scene dirs under --data-root. If a subset is wanted, write scenes.txt and add filtering in run_inference.py (~3 lines).

TODO-5: sbatch eval_scripts/eval_evalmde_all_slurm.sh

Same pattern as MoGe's sanity_all_slurm.sh: single sbatch, single H100, serial per-model. For each of 7 models: conda activate <env>; python scripts/run_inference.py --baseline baselines/<m>.py ... Then after all 7 inferences done: conda activate evalmde; for m in ...; do python scripts/compute_metrics.py --model-name $m ...; done

Each per-model env needs evalmde pip-installed so it can from evalmde.metrics... β€” actually no, this is wrong: per-model envs only run inference (which needs torch + model wrapper deps, no evalmde). Only the metric-aggregation stage runs in evalmde env. So envs need no extra install.

TODO-3: Scene list / config

Once Infinigen download finishes, inspect actual layout:

ls /home/ywan0794/EvalMDE/data/infinigen/ | head -20

If scenes are scene_001/, scene_002/, ...: dataloader auto-discovers them. If grouped under sub-folders or different naming: may need a manual scenes.txt split file.

TODO-4: sbatch EvalMDE/eval_scripts/eval_evalmde_all_slurm.sh

Mirror MoGe's sanity_all_slurm.sh structure:

  • Single sbatch, single H100, serial per-model
  • For each model: activate model's conda env, run python scripts/eval.py --baseline baselines/<m>.py --data-root data/infinigen --output results/<m>.json
  • After all inference done, optionally re-aggregate in evalmde env for cross-model summary

Per-model env mapping same as MoGe:

model env
depth_pro depth-pro
marigold marigold
lotus lotus
depthmaster depthmaster
ppd ppd
da3_mono da3
fe2e fe2e

Plus: each env needs evalmde package installed (pip install -e /home/ywan0794/EvalMDE) so from evalmde.metrics.* import compute_rel_normal, compute_sawa_h works inside model envs.


Paper-canonical inference parameters (locked, confirmed against each repo)

Model Args Source
Depth Pro --precision fp32 create_model_and_transforms() default
Marigold v1-1 + --denoise_steps 4 --ensemble_size 1 (user decision: balanced speed)
Lotus g-v2-1-disparity + --mode generation --disparity --timestep 999 --fp16 --seed 42 Lotus/eval.sh
DepthMaster --processing_res 768 DepthMaster/scripts/infer.sh
PPD --semantics_model MoGe2 --semantics_pth checkpoints/moge2.pt --model_pth checkpoints/ppd_moge.pth --sampling_steps 4 PPD/ppd/configs/eval.yaml
DA3-Mono --hf_id depth-anything/DA3MONO-LARGE DA3 README
FE2E --prompt_type empty --single_denoise --cfg_guidance 6.0 --size_level 768 FE2E/README.md eval block

Key insights to preserve

  1. EvalMDE protocol uses raw native input, no homography warp. MoGe's eval pipeline does aggressive canonical-view warping (dataloader.py:_process_instance:119-180). That is MoGe-paper-specific; EvalMDE explicitly uses raw inputs (see compute_metrics_example.py).

  2. Output key contract (per MGEBaselineInterface):

    • depth_metric β†’ metric depth in meters (Depth Pro)
    • depth_scale_invariant β†’ scale-invariant relative depth (DA3-Mono)
    • depth_affine_invariant β†’ affine-invariant depth (Marigold/DepthMaster/PPD/FE2E)
    • disparity_affine_invariant β†’ affine-invariant disparity (Lotus disparity ckpts)
  3. Pre-alignment for SAWA-H/RelNormal: SAWA-H itself does affine alignment internally (evalmde/metrics/sawa_h.py:compute_sawa_h uses align_depth_least_square + align_affine_lstsq), so you can pass RAW pred depth to SAWA-H. RelNormal works on normals which are scale-invariant in the limit, but shift in depth space WILL skew normals at far depths β€” so for affine-invariant pred models, do an affine align before passing to compute_rel_normal.

  4. MoGe's eval can run in parallel with EvalMDE work. Production eval_all_slurm.sh already running. Don't disturb.

  5. Lotus disparity ckpt inversion was numerically unstable (1/disp blows up near disparity=0). For EvalMDE, only emit disparity_affine_invariant from Lotus, then convert: aligned_disp = scale*disp + shift (fit in disp space), aligned_depth = 1/aligned_disp.clamp(1/gt_depth_max). Reference: moge/test/metrics.py:202-218 disparity_affine_invariant block.


Resume instructions

  1. cd /home/ywan0794/EvalMDE
  2. Check Infinigen download: du -sh data/infinigen; tail /tmp/dl_infinigen.log
  3. Fix imports (TODO-1):
    sed -i 's|from moge.test.baseline|from test.baseline|g' baselines/*.py
    
  4. Write scripts/eval.py (TODO-2) using the pseudocode above.
  5. Test on 1 scene with depth_pro: python scripts/eval.py --baseline baselines/depth_pro.py --data-root data/infinigen --output /tmp/test.json --repo /home/ywan0794/EvalMDE/ml-depth-pro --checkpoint /home/ywan0794/EvalMDE/ml-depth-pro/checkpoints/depth_pro.pt
  6. Inspect /tmp/test.json. If sane (rel_normal in [0, 1] rad, sawa_h plausible), proceed to write sbatch (TODO-4).

End of handoff.