DJLougen commited on
Commit
463b596
·
verified ·
1 Parent(s): 0d5d210

Upload V12 resolved eval artifacts

Browse files
eval/evaluation_report_v12_path_grounding_resolved_full.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BusyBeaver Evaluation Report
2
+
3
+ - Mode: model
4
+ - Rows: 384
5
+
6
+ - json_validity_rate: 1.0000
7
+ - strict_json_rate: 1.0000
8
+ - schema_validity_rate: 1.0000
9
+ - valid_tool_rate: 1.0000
10
+ - correct_tool_accuracy: 1.0000
11
+ - argument_exact_match: 0.8958
12
+ - argument_semantic_match: 0.9792
13
+ - unnecessary_escalation_rate: 0.0000
14
+ - unsafe_command_rate: 0.0000
15
+ - placeholder_rate: 0.0000
16
+ - correct_tool_and_arg_semantic: 0.9792
17
+ - repeated_action_loop_rate: 0.0000
18
+ - concrete_argument_semantic_match: 0.9792
19
+ - recovery_action_accuracy: 1.0000
20
+ - concrete_argument_rows: 384.0000
21
+ - recovery_rows: 128.0000
22
+
23
+ ## Grouped Metrics
24
+
25
+ ### edit (n=32)
26
+ - json_validity_rate: 1.0000
27
+ - strict_json_rate: 1.0000
28
+ - schema_validity_rate: 1.0000
29
+ - valid_tool_rate: 1.0000
30
+ - correct_tool_accuracy: 1.0000
31
+ - argument_exact_match: 1.0000
32
+ - argument_semantic_match: 1.0000
33
+ ### escalate (n=32)
34
+ - json_validity_rate: 1.0000
35
+ - strict_json_rate: 1.0000
36
+ - schema_validity_rate: 1.0000
37
+ - valid_tool_rate: 1.0000
38
+ - correct_tool_accuracy: 1.0000
39
+ - argument_exact_match: 1.0000
40
+ - argument_semantic_match: 1.0000
41
+ ### execute (n=32)
42
+ - json_validity_rate: 1.0000
43
+ - strict_json_rate: 1.0000
44
+ - schema_validity_rate: 1.0000
45
+ - valid_tool_rate: 1.0000
46
+ - correct_tool_accuracy: 1.0000
47
+ - argument_exact_match: 1.0000
48
+ - argument_semantic_match: 1.0000
49
+ ### inspect (n=128)
50
+ - json_validity_rate: 1.0000
51
+ - strict_json_rate: 1.0000
52
+ - schema_validity_rate: 1.0000
53
+ - valid_tool_rate: 1.0000
54
+ - correct_tool_accuracy: 1.0000
55
+ - argument_exact_match: 0.9922
56
+ - argument_semantic_match: 0.9922
57
+ ### memory (n=32)
58
+ - json_validity_rate: 1.0000
59
+ - strict_json_rate: 1.0000
60
+ - schema_validity_rate: 1.0000
61
+ - valid_tool_rate: 1.0000
62
+ - correct_tool_accuracy: 1.0000
63
+ - argument_exact_match: 0.7812
64
+ - argument_semantic_match: 0.7812
65
+ ### other (n=96)
66
+ - json_validity_rate: 1.0000
67
+ - strict_json_rate: 1.0000
68
+ - schema_validity_rate: 1.0000
69
+ - valid_tool_rate: 1.0000
70
+ - correct_tool_accuracy: 1.0000
71
+ - argument_exact_match: 0.6667
72
+ - argument_semantic_match: 1.0000
73
+ ### test (n=32)
74
+ - json_validity_rate: 1.0000
75
+ - strict_json_rate: 1.0000
76
+ - schema_validity_rate: 1.0000
77
+ - valid_tool_rate: 1.0000
78
+ - correct_tool_accuracy: 1.0000
79
+ - argument_exact_match: 1.0000
80
+ - argument_semantic_match: 1.0000