Running Email Triage Environment Tests ============================================= ✅ Task1 reset ✅ Task1 list inbox ❌ Task1 read email: ✅ Task1 correct label reward ✅ Task1 wrong label no reward ❌ Task1 full score 1.0: Expected 1.0, got 0.2 ❌ Task1 partial score 0.4: Expected 0.4, got 0.2 ✅ Task2 reset ✅ Task2 no reply = 0.0 ✅ Task2 good reply > 0.5 ✅ Task2 short reply error ✅ Task3 reset ✅ Task3 archive spam no penalty ✅ Task3 archive urgent = penalty ✅ Task3 flag ambiguous ✅ Task3 loop detection ✅ Task3 not found error 14/17 tests passed