DJLougen commited on
Commit
0c42e4a
·
verified ·
1 Parent(s): 463b596

Upload V12 resolved eval artifacts

Browse files
eval/evaluation_report_v12_harness_v1_resolved_full.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BusyBeaver Evaluation Report
2
+
3
+ - Mode: model
4
+ - Rows: 320
5
+
6
+ - json_validity_rate: 1.0000
7
+ - strict_json_rate: 1.0000
8
+ - schema_validity_rate: 1.0000
9
+ - valid_tool_rate: 1.0000
10
+ - correct_tool_accuracy: 1.0000
11
+ - argument_exact_match: 0.7000
12
+ - argument_semantic_match: 0.9000
13
+ - unnecessary_escalation_rate: 0.0000
14
+ - unsafe_command_rate: 0.0000
15
+ - placeholder_rate: 0.0000
16
+ - correct_tool_and_arg_semantic: 0.9000
17
+ - repeated_action_loop_rate: 0.0000
18
+ - concrete_argument_semantic_match: 0.9000
19
+ - recovery_action_accuracy: 0.8000
20
+ - concrete_argument_rows: 320.0000
21
+ - recovery_rows: 160.0000
22
+
23
+ ## Grouped Metrics
24
+
25
+ ### edit (n=32)
26
+ - json_validity_rate: 1.0000
27
+ - strict_json_rate: 1.0000
28
+ - schema_validity_rate: 1.0000
29
+ - valid_tool_rate: 1.0000
30
+ - correct_tool_accuracy: 1.0000
31
+ - argument_exact_match: 1.0000
32
+ - argument_semantic_match: 1.0000
33
+ ### escalate (n=32)
34
+ - json_validity_rate: 1.0000
35
+ - strict_json_rate: 1.0000
36
+ - schema_validity_rate: 1.0000
37
+ - valid_tool_rate: 1.0000
38
+ - correct_tool_accuracy: 1.0000
39
+ - argument_exact_match: 1.0000
40
+ - argument_semantic_match: 1.0000
41
+ ### inspect (n=96)
42
+ - json_validity_rate: 1.0000
43
+ - strict_json_rate: 1.0000
44
+ - schema_validity_rate: 1.0000
45
+ - valid_tool_rate: 1.0000
46
+ - correct_tool_accuracy: 1.0000
47
+ - argument_exact_match: 1.0000
48
+ - argument_semantic_match: 1.0000
49
+ ### other (n=128)
50
+ - json_validity_rate: 1.0000
51
+ - strict_json_rate: 1.0000
52
+ - schema_validity_rate: 1.0000
53
+ - valid_tool_rate: 1.0000
54
+ - correct_tool_accuracy: 1.0000
55
+ - argument_exact_match: 0.2500
56
+ - argument_semantic_match: 0.7500
57
+ ### test (n=32)
58
+ - json_validity_rate: 1.0000
59
+ - strict_json_rate: 1.0000
60
+ - schema_validity_rate: 1.0000
61
+ - valid_tool_rate: 1.0000
62
+ - correct_tool_accuracy: 1.0000
63
+ - argument_exact_match: 1.0000
64
+ - argument_semantic_match: 1.0000