Upload eval_results.md with huggingface_hub
Browse files- eval_results.md +15 -15
eval_results.md
CHANGED
|
@@ -8,27 +8,27 @@ Paired evaluation: identical (seed, profile) pairs across both models.
|
|
| 8 |
|
| 9 |
| Metric | Difficulty | Base | Trained | Delta |
|
| 10 |
|---|---|---|---|---|
|
| 11 |
-
| Rule violations per episode (lower is better) | D1 | 0.
|
| 12 |
-
| Rule violations per episode (lower is better) | D2 | 1.000 |
|
| 13 |
| Rule violations per episode (lower is better) | D3 | 1.000 | 1.250 | ↑ +0.250 |
|
| 14 |
-
| Rule violations per episode (lower is better) | D4 | 0.000 | 0.
|
| 15 |
-
| Correct step-ordering rate | D1 |
|
| 16 |
-
| Correct step-ordering rate | D2 | 0.800 | 0.
|
| 17 |
| Correct step-ordering rate | D3 | 0.571 | 0.571 | · +0.000 |
|
| 18 |
-
| Correct step-ordering rate | D4 |
|
| 19 |
-
| Successful close rate | D1 |
|
| 20 |
-
| Successful close rate | D2 | 0.000 | 0.
|
| 21 |
| Successful close rate | D3 | 0.000 | 0.000 | · +0.000 |
|
| 22 |
| Successful close rate | D4 | 0.000 | 0.000 | · +0.000 |
|
| 23 |
| Correct disqualification rate | D1 | 0.000 | 0.000 | · +0.000 |
|
| 24 |
| Correct disqualification rate | D2 | 0.000 | 0.000 | · +0.000 |
|
| 25 |
| Correct disqualification rate | D3 | 0.000 | 0.000 | · +0.000 |
|
| 26 |
-
| Correct disqualification rate | D4 |
|
| 27 |
-
| Mean cumulative episode reward | D1 |
|
| 28 |
-
| Mean cumulative episode reward | D2 | 0.580 | 0.
|
| 29 |
| Mean cumulative episode reward | D3 | 0.534 | 0.539 | ↑ +0.005 |
|
| 30 |
-
| Mean cumulative episode reward | D4 | 0.
|
| 31 |
-
| Mean final-turn reward | D1 | 0.
|
| 32 |
-
| Mean final-turn reward | D2 | -0.080 |
|
| 33 |
| Mean final-turn reward | D3 | -0.080 | -0.080 | · +0.000 |
|
| 34 |
-
| Mean final-turn reward | D4 | 0.
|
|
|
|
| 8 |
|
| 9 |
| Metric | Difficulty | Base | Trained | Delta |
|
| 10 |
|---|---|---|---|---|
|
| 11 |
+
| Rule violations per episode (lower is better) | D1 | 0.250 | 0.000 | ↓ -0.250 |
|
| 12 |
+
| Rule violations per episode (lower is better) | D2 | 1.000 | 0.750 | ↓ -0.250 |
|
| 13 |
| Rule violations per episode (lower is better) | D3 | 1.000 | 1.250 | ↑ +0.250 |
|
| 14 |
+
| Rule violations per episode (lower is better) | D4 | 0.000 | 0.250 | ↑ +0.250 |
|
| 15 |
+
| Correct step-ordering rate | D1 | 0.917 | 1.000 | ↑ +0.083 |
|
| 16 |
+
| Correct step-ordering rate | D2 | 0.800 | 0.850 | ↑ +0.050 |
|
| 17 |
| Correct step-ordering rate | D3 | 0.571 | 0.571 | · +0.000 |
|
| 18 |
+
| Correct step-ordering rate | D4 | 0.750 | 1.000 | ↑ +0.250 |
|
| 19 |
+
| Successful close rate | D1 | 0.750 | 1.000 | ↑ +0.250 |
|
| 20 |
+
| Successful close rate | D2 | 0.000 | 0.250 | ↑ +0.250 |
|
| 21 |
| Successful close rate | D3 | 0.000 | 0.000 | · +0.000 |
|
| 22 |
| Successful close rate | D4 | 0.000 | 0.000 | · +0.000 |
|
| 23 |
| Correct disqualification rate | D1 | 0.000 | 0.000 | · +0.000 |
|
| 24 |
| Correct disqualification rate | D2 | 0.000 | 0.000 | · +0.000 |
|
| 25 |
| Correct disqualification rate | D3 | 0.000 | 0.000 | · +0.000 |
|
| 26 |
+
| Correct disqualification rate | D4 | 0.750 | 1.000 | ↑ +0.250 |
|
| 27 |
+
| Mean cumulative episode reward | D1 | 0.955 | 1.090 | ↑ +0.135 |
|
| 28 |
+
| Mean cumulative episode reward | D2 | 0.580 | 0.710 | ↑ +0.130 |
|
| 29 |
| Mean cumulative episode reward | D3 | 0.534 | 0.539 | ↑ +0.005 |
|
| 30 |
+
| Mean cumulative episode reward | D4 | -0.675 | 0.705 | ↑ +1.380 |
|
| 31 |
+
| Mean final-turn reward | D1 | 0.246 | 0.357 | ↑ +0.110 |
|
| 32 |
+
| Mean final-turn reward | D2 | -0.080 | 0.025 | ↑ +0.105 |
|
| 33 |
| Mean final-turn reward | D3 | -0.080 | -0.080 | · +0.000 |
|
| 34 |
+
| Mean final-turn reward | D4 | 0.075 | 0.200 | ↑ +0.125 |
|