Results
====================================================================== Calculating evaluation metrics...
=== Evaluating on 496 samples ===
--- Categorical and Accuracy-Based Evaluation ---
Weather Accuracy: 0.00% (36/496 samples)
Time Accuracy: 0.00% (1/496 samples)
Illumination Accuracy: 28.92% (408/496 samples)
Visibility Accuracy: 7.69% (416/496 samples)
Road Surface Accuracy: 31.19% (388/496 samples)
Traffic Lights Accuracy: 83.33% (6/496 samples)
Road Type Accuracy: 0.00% (3/496 samples)
Location Accuracy: 0.00% (1/496 samples)
Risk Accuracy: 45.45% (11/496 samples)
Intention Accuracy: 72.73% (11/496 samples)
Exact Match Accuracy: 0.00%
--- 'Risk' Field Classification Report ---
precision recall f1-score support
** No 0.00 0.00 0.00 0
<1 0.00 0.00 0.00 0
No 0.40 0.67 0.50 3
Yes 0.75 0.38 0.50 8
accuracy 0.45 11
macro avg 0.29 0.26 0.25 11
weighted avg 0.65 0.45 0.50 11
--- Performance Summary ---
Average Accuracy across all fields: 26.93%
--- Inference Time Analysis ---
Average inference time: 1893.4 ms
Median inference time: 1427.1 ms
Min/Max: 1328.3 / 3684.9 ms
=== Evaluation Complete ===
====================================================================== EVALUATION SUMMARY
Model: enpeizhao/internvl2-1b-odd-distilled-merged Test Samples: 500 Results File: test_results_distilled_20260129_203526.csv
Field Accuracies:
Traffic Lights ββββββββββββββββββββ 83.3%
Intention ββββββββββββββββββββ 72.7%
Risk ββββββββββββββββββββ 45.5%
Road Surface ββββββββββββββββββββ 31.2%
Illumination ββββββββββββββββββββ 28.9%
Visibility ββββββββββββββββββββ 7.7%
Weather ββββββββββββββββββββ 0.0%
Time ββββββββββββββββββββ 0.0%
Road Type ββββββββββββββββββββ 0.0%
Location ββββββββββββββββββββ 0.0%
AVERAGE ββββββββββββββββββββ 26.9%
======================================================================
- Downloads last month
- 2