fblgit's picture
Upload folder using huggingface_hub
f0f5785
Benchmark: A: openmed-base vs B: haremb
Dataset: nvidia/Nemotron-PII, split=test, eval_pct=1.0, ctx=1024, seed=42, n_docs=1000
Eval tokens scored: 212,909
=== Aggregate ===
A: openmed-base RAW span_F1=0.9174 P=0.9125 R=0.9223 token_acc=0.9895 non_o_recall=0.9685 spans=8627/8720/7957
A: openmed-base VITERBI span_F1=0.9434 P=0.9531 R=0.9338 token_acc=0.9900 non_o_recall=0.9703 spans=8627/8452/8056
B: haremb RAW span_F1=0.7741 P=0.7186 R=0.8388 token_acc=0.9831 non_o_recall=0.9467 spans=8627/10069/7236
B: haremb VITERBI span_F1=0.9288 P=0.9396 R=0.9182 token_acc=0.9885 non_o_recall=0.9637 spans=8627/8430/7921
Gap B vs A (viterbi span_F1): -0.0146
Throughput: A: openmed-base 3293 tok/s (1399.61M params)
B: haremb 6343 tok/s (287.11M params)
=== Performance ===
metric A: openmed-base B: haremb B vs A
------------------------------- --------------- --------- -------------
total params (M) 1399.61 287.11 4.87× smaller
dense params (M) 139.35 129.58 1.08× smaller
MoE expert params (M) 1260.26 157.53 8.00× smaller
active params/token (M, mem) 178.73 134.50 1.33× less
compute params/token (M, FLOPs) 50.69 6.46 7.85× cheaper
GFLOP / token 0.1014 0.0129 7.85× cheaper
disk size (MiB) 547.6
weights in RAM (MiB) 2669.5 547.6 4.87× smaller
peak GPU mem eval (MiB) 3376.2 1248.6 2.70× less
throughput (tok/s) 3293 6343 1.93× faster
=== Pairwise (viterbi, all gold tokens) — A: openmed-base vs B: haremb ===
both_correct 209830 (98.55%)
only_A_correct 958 (0.45%)
only_B_correct 633 (0.30%)
both_wrong 1488 (0.70%)
=== Pairwise (viterbi, gold non-O tokens) — A: openmed-base vs B: haremb ===
both_correct 43902 (95.47%)
only_A_correct 717 (1.56%)
only_B_correct 417 (0.91%)
both_wrong 951 (2.07%)
=== Worst B-net wins by gold category — A: openmed-base ahead (top 15) ===
company_name net_B= -142 A_only= 216 B_only= 74 both_wrong= 119
first_name net_B= -75 A_only= 82 B_only= 7 both_wrong= 19
last_name net_B= -65 A_only= 67 B_only= 2 both_wrong= 38
occupation net_B= -55 A_only= 79 B_only= 24 both_wrong= 286
device_identifier net_B= -29 A_only= 29 B_only= 0 both_wrong= 0
user_name net_B= -26 A_only= 29 B_only= 3 both_wrong= 10
city net_B= -16 A_only= 32 B_only= 16 both_wrong= 36
street_address net_B= -13 A_only= 14 B_only= 1 both_wrong= 0
date_of_birth net_B= -12 A_only= 12 B_only= 0 both_wrong= 0
email net_B= -8 A_only= 9 B_only= 1 both_wrong= 0
medical_record_number net_B= -7 A_only= 7 B_only= 0 both_wrong= 0
phone_number net_B= -6 A_only= 6 B_only= 0 both_wrong= 0
account_number net_B= -6 A_only= 10 B_only= 4 both_wrong= 0
tax_id net_B= -6 A_only= 6 B_only= 0 both_wrong= 0
race_ethnicity net_B= -5 A_only= 15 B_only= 10 both_wrong= 11
=== Best B-net wins by gold category — B: haremb ahead (top 15) ===
date net_B= +46 A_only= 15 B_only= 61 both_wrong= 145
fax_number net_B= +29 A_only= 1 B_only= 30 both_wrong= 44
unique_id net_B= +26 A_only= 4 B_only= 30 both_wrong= 0
ssn net_B= +18 A_only= 0 B_only= 18 both_wrong= 0
time net_B= +12 A_only= 9 B_only= 21 both_wrong= 87
political_view net_B= +11 A_only= 3 B_only= 14 both_wrong= 6
coordinate net_B= +11 A_only= 0 B_only= 11 both_wrong= 0
customer_id net_B= +8 A_only= 4 B_only= 12 both_wrong= 7
certificate_license_number net_B= +7 A_only= 0 B_only= 7 both_wrong= 0
education_level net_B= +6 A_only= 2 B_only= 8 both_wrong= 28
state net_B= +6 A_only= 13 B_only= 19 both_wrong= 20
blood_type net_B= +6 A_only= 0 B_only= 6 both_wrong= 0
gender net_B= +2 A_only= 0 B_only= 2 both_wrong= 2
http_cookie net_B= +2 A_only= 0 B_only= 2 both_wrong= 3
country net_B= +2 A_only= 0 B_only= 2 both_wrong= 19
=== Per-category span F1 (viterbi) ===
-- A: openmed-base --
account_number F1=0.9929 P=0.9929 R=0.9929 (140/140/139)
age F1=0.8840 P=0.8511 R=0.9195 (87/94/80)
api_key F1=0.9921 P=0.9844 R=1.0000 (63/64/63)
bank_routing_number F1=0.9867 P=0.9867 R=0.9867 (75/75/74)
biometric_identifier F1=1.0000 P=1.0000 R=1.0000 (113/113/113)
blood_type F1=0.9032 P=0.9032 R=0.9032 (62/62/56)
certificate_license_number F1=0.9697 P=1.0000 R=0.9412 (34/32/32)
city F1=0.9154 P=0.9583 R=0.8762 (210/192/184)
company_name F1=0.8824 P=0.9143 R=0.8526 (563/525/480)
coordinate F1=0.8000 P=0.8000 R=0.8000 (55/55/44)
country F1=0.9431 P=0.9324 R=0.9539 (217/222/207)
county F1=0.9519 P=0.9612 R=0.9429 (105/103/99)
credit_debit_card F1=0.9967 P=0.9934 R=1.0000 (150/151/150)
customer_id F1=0.9849 P=1.0000 R=0.9703 (202/196/196)
cvv F1=0.9787 P=1.0000 R=0.9583 (48/46/46)
date F1=0.9440 P=0.9571 R=0.9312 (814/792/758)
date_of_birth F1=1.0000 P=1.0000 R=1.0000 (164/164/164)
date_time F1=0.9635 P=0.9429 R=0.9851 (134/140/132)
device_identifier F1=0.9714 P=0.9444 R=1.0000 (51/54/51)
education_level F1=0.9091 P=0.9524 R=0.8696 (92/84/80)
email F1=0.9971 P=0.9961 R=0.9980 (511/512/510)
employee_id F1=0.9948 P=1.0000 R=0.9896 (96/95/95)
employment_status F1=0.9478 P=0.9593 R=0.9365 (126/123/118)
fax_number F1=0.9091 P=0.9848 R=0.8442 (77/66/65)
first_name F1=0.9766 P=0.9716 R=0.9816 (871/880/855)
gender F1=0.9737 P=0.9867 R=0.9610 (77/75/74)
health_plan_beneficiary_number F1=1.0000 P=1.0000 R=1.0000 (103/103/103)
http_cookie F1=0.9307 P=0.9400 R=0.9216 (51/50/47)
ipv4 F1=1.0000 P=1.0000 R=1.0000 (59/59/59)
ipv6 F1=1.0000 P=1.0000 R=1.0000 (21/21/21)
language F1=0.9000 P=0.9000 R=0.9000 (90/90/81)
last_name F1=0.9744 P=0.9767 R=0.9721 (646/643/628)
license_plate F1=1.0000 P=1.0000 R=1.0000 (55/55/55)
mac_address F1=1.0000 P=1.0000 R=1.0000 (30/30/30)
medical_record_number F1=1.0000 P=1.0000 R=1.0000 (103/103/103)
national_id F1=1.0000 P=1.0000 R=1.0000 (28/28/28)
occupation F1=0.6522 P=0.7721 R=0.5645 (372/272/210)
password F1=0.9217 P=0.9636 R=0.8833 (60/55/53)
phone_number F1=0.9751 P=0.9514 R=1.0000 (235/247/235)
pin F1=0.9302 P=0.8955 R=0.9677 (62/67/60)
political_view F1=0.8387 P=0.8125 R=0.8667 (45/48/39)
postcode F1=0.9934 P=0.9868 R=1.0000 (75/76/75)
race_ethnicity F1=0.8889 P=0.8889 R=0.8889 (81/81/72)
religious_belief F1=0.8936 P=0.8750 R=0.9130 (46/48/42)
sexuality F1=0.9667 P=1.0000 R=0.9355 (31/29/29)
ssn F1=0.9440 P=0.9365 R=0.9516 (62/63/59)
state F1=0.9198 P=0.9399 R=0.9005 (191/183/172)
street_address F1=0.9894 P=0.9842 R=0.9947 (188/190/187)
swift_bic F1=0.9905 P=0.9811 R=1.0000 (52/53/52)
tax_id F1=1.0000 P=1.0000 R=1.0000 (15/15/15)
time F1=0.8209 P=0.8514 R=0.7926 (188/175/149)
unique_id F1=0.9600 P=1.0000 R=0.9231 (13/12/12)
url F1=0.9725 P=0.9687 R=0.9763 (380/383/371)
user_name F1=0.9497 P=0.9264 R=0.9742 (155/163/151)
vehicle_identifier F1=0.9815 P=0.9636 R=1.0000 (53/55/53)
-- B: haremb --
account_number F1=0.9751 P=0.9716 R=0.9786 (140/141/137)
age F1=0.8571 P=0.8211 R=0.8966 (87/95/78)
api_key F1=0.9921 P=0.9844 R=1.0000 (63/64/63)
bank_routing_number F1=0.9933 P=1.0000 R=0.9867 (75/74/74)
biometric_identifier F1=1.0000 P=1.0000 R=1.0000 (113/113/113)
blood_type F1=1.0000 P=1.0000 R=1.0000 (62/62/62)
certificate_license_number F1=0.9855 P=0.9714 R=1.0000 (34/35/34)
city F1=0.8932 P=0.9109 R=0.8762 (210/202/184)
company_name F1=0.7766 P=0.8120 R=0.7442 (563/516/419)
coordinate F1=1.0000 P=1.0000 R=1.0000 (55/55/55)
country F1=0.9543 P=0.9457 R=0.9631 (217/221/209)
county F1=0.9340 P=0.9252 R=0.9429 (105/107/99)
credit_debit_card F1=0.9934 P=0.9868 R=1.0000 (150/152/150)
customer_id F1=0.9779 P=0.9707 R=0.9851 (202/205/199)
cvv F1=0.9792 P=0.9792 R=0.9792 (48/48/47)
date F1=0.9510 P=0.9599 R=0.9423 (814/799/767)
date_of_birth F1=0.9939 P=1.0000 R=0.9878 (164/162/162)
date_time F1=0.9635 P=0.9429 R=0.9851 (134/140/132)
device_identifier F1=0.9515 P=0.9423 R=0.9608 (51/52/49)
education_level F1=0.9091 P=0.9524 R=0.8696 (92/84/80)
email F1=0.9912 P=0.9883 R=0.9941 (511/514/508)
employee_id F1=0.9895 P=1.0000 R=0.9792 (96/94/94)
employment_status F1=0.9562 P=0.9600 R=0.9524 (126/125/120)
fax_number F1=0.9396 P=0.9722 R=0.9091 (77/72/70)
first_name F1=0.9299 P=0.9231 R=0.9369 (871/884/816)
gender F1=0.9870 P=0.9870 R=0.9870 (77/77/76)
health_plan_beneficiary_number F1=1.0000 P=1.0000 R=1.0000 (103/103/103)
http_cookie F1=0.9608 P=0.9608 R=0.9608 (51/51/49)
ipv4 F1=1.0000 P=1.0000 R=1.0000 (59/59/59)
ipv6 F1=1.0000 P=1.0000 R=1.0000 (21/21/21)
language F1=0.8966 P=0.9286 R=0.8667 (90/84/78)
last_name F1=0.9308 P=0.9457 R=0.9164 (646/626/592)
license_plate F1=1.0000 P=1.0000 R=1.0000 (55/55/55)
mac_address F1=1.0000 P=1.0000 R=1.0000 (30/30/30)
medical_record_number F1=0.9903 P=0.9903 R=0.9903 (103/103/102)
national_id F1=0.9825 P=0.9655 R=1.0000 (28/29/28)
occupation F1=0.5981 P=0.7440 R=0.5000 (372/250/186)
password F1=0.9391 P=0.9818 R=0.9000 (60/55/54)
phone_number F1=0.9730 P=0.9512 R=0.9957 (235/246/234)
pin F1=0.9508 P=0.9667 R=0.9355 (62/60/58)
political_view F1=0.8723 P=0.8367 R=0.9111 (45/49/41)
postcode F1=0.9934 P=0.9868 R=1.0000 (75/76/75)
race_ethnicity F1=0.8590 P=0.8933 R=0.8272 (81/75/67)
religious_belief F1=0.9348 P=0.9348 R=0.9348 (46/46/43)
sexuality F1=0.9492 P=1.0000 R=0.9032 (31/28/28)
ssn F1=0.9688 P=0.9394 R=1.0000 (62/66/62)
state F1=0.9105 P=0.9153 R=0.9058 (191/189/173)
street_address F1=0.9894 P=0.9894 R=0.9894 (188/188/186)
swift_bic F1=0.9905 P=0.9811 R=1.0000 (52/53/52)
tax_id F1=0.9655 P=1.0000 R=0.9333 (15/14/14)
time F1=0.8421 P=0.8786 R=0.8085 (188/173/152)
unique_id F1=0.8571 P=0.8000 R=0.9231 (13/15/12)
url F1=0.9752 P=0.9688 R=0.9816 (380/385/373)
user_name F1=0.9416 P=0.9477 R=0.9355 (155/153/145)
vehicle_identifier F1=0.9630 P=0.9455 R=0.9811 (53/55/52)