Sprint 10A: shadow_eval.py — compare candidate vs baseline safely bc30484 verified Rohan03 commited on 11 days ago