Sprint 10A: shadow_eval.py — compare candidate vs baseline safely bc30484 verified Rohan03 commited on 12 days ago