YOLOv11s β Automated Road Surface Damage Detection
Model Description
This model is a fine-tuned YOLOv11s object detection model trained to detect and classify four types of road surface damage from street-level imagery. Given a dashcam or road-facing image, the model outputs bounding boxes and class labels identifying where damage is located and what type it is.
Training approach: Fine-tuned from yolo11s.pt (pre-trained on COCO) on a curated subset of the RDD2022 dataset via Roboflow.
Intended use cases:
- Municipal road maintenance prioritization
- Automated road condition monitoring using dashcam footage
- State DOT inspection support to reduce manual survey labor costs
- Research into scalable road infrastructure assessment tools
Training Data
Dataset: Road Damage Dataset 2022 (RDD2022) β Roboflow fork Source: Roboflow Universe β road-damage-vmqh5 Original dataset: Arya et al. (2022), RDD2022: A multi-national image dataset for automatic Road Damage Detection
Total images: 9,732 Number of classes: 4
Class Distribution (Training Set)
| Class | Training Instances |
|---|---|
| Alligator Cracks | ~3,200 |
| Longitudinal Cracks | ~2,800 |
| Transverse Cracks | ~1,900 |
| Potholes | ~1,400 |
Three classes present in the original Roboflow dataset (Damaged Crosswalk, Damaged Paint, Manhole Cover) were removed as they fell outside the scope of structural road surface damage detection.
Train / Val / Test split: 70% / 20% / 10%
Data collection: Images sourced from the Japan subset of RDD2022, collected via dashcam from moving vehicles across diverse road conditions and environments.
Annotation process: The dataset came pre-annotated by research teams in YOLO format β no conversion was needed. Annotations were reviewed for label accuracy and bounding box tightness. Three irrelevant classes were identified and removed. The value added in this project was the quality filtering, class remapping, and dataset curation through Roboflow rather than raw annotation from scratch.
Augmentation applied:
- Horizontal flip (p=0.5)
- HSV value shift (hsv_v=0.4)
- Mosaic augmentation (mosaic=1.0)
- Vertical flip disabled (flipud=0.0)
Known biases and limitations in training data:
- Geographically limited to Japan road infrastructure β damage patterns, surface materials, and road marking styles differ significantly in other countries
- Class imbalance exists β Potholes have significantly fewer training instances than crack classes
- Images are captured during daylight under varied but generally clear conditions; performance in rain, night, or heavy occlusion is untested
Training Procedure
Framework: Ultralytics YOLOv11
Base model: yolo11s.pt (COCO pre-trained)
Hardware: Google Colab β NVIDIA T4 GPU (15GB VRAM)
Training time: ~55 minutes (15 epochs on T4)
| Hyperparameter | Value |
|---|---|
| Epochs | 15 |
| Image size | 640 Γ 640 |
| Batch size | 16 |
| Patience (early stopping) | 10 |
| Optimizer | AdamW (auto) |
| Horizontal flip | 0.5 |
| Mosaic | 1.0 |
| hsv_v | 0.4 |
Note on epochs: Training was limited to 15 epochs due to compute constraints on the free Colab tier. The loss curves at epoch 15 show the model had not yet plateaued, meaning performance is expected to improve meaningfully with longer training. A full 50-epoch run on the complete dataset is the primary next step.
Preprocessing: Images resized to 640Γ640. No additional normalization beyond standard Ultralytics defaults.
Evaluation Results
Metrics derived from the held-out test set (~975 images, 10% split). Precision and recall calculated from the confusion matrix.
Overall Metrics
| Metric | Score |
|---|---|
| mAP50 | ~0.47 |
| mAP50-95 | ~0.19 |
| Precision | 0.63 |
| Recall | 0.40 |
Per-Class Breakdown
| Class | Correct Detections | Precision | Recall | F1 |
|---|---|---|---|---|
| Alligator Cracks | 655 | 0.73 | 0.52 | 0.60 |
| Longitudinal Cracks | 325 | 0.60 | 0.34 | 0.43 |
| Potholes | 234 | 0.56 | 0.45 | 0.50 |
| Transverse Cracks | 213 | 0.63 | 0.28 | 0.39 |
Visual Examples of Each Class
Alligator Cracks β Interconnected web-like fracture patterns spreading across the road surface, resembling alligator skin. Typically indicate deep structural fatigue beneath the surface layer. Visually the most distinctive class in this dataset.
Longitudinal Cracks β Linear cracks running parallel to the direction of traffic. Often caused by pavement shrinkage, lane edge stress, or subgrade settlement. Can be subtle and easily confused with road markings under low contrast.
Transverse Cracks β Linear cracks running perpendicular to traffic direction. Typically caused by thermal contraction or reflection cracking from underlying structural layers. Visually similar to lane markings in some lighting conditions.
Potholes β Bowl-shaped surface depressions caused by progressive failure of the road surface under repeated traffic load and water infiltration. The least represented class in the training set.
Key Visualizations
1. Loss & Metrics Curves
All three training losses (box, classification, DFL) decrease consistently across 15 epochs. The mAP50 curve trends upward throughout training with no sign of plateauing, confirming the model is still actively learning at the point training was stopped. Train and validation losses track closely together, indicating no overfitting at this stage.
2. Confusion Matrix
Alligator Cracks achieve the highest correct detection count (655), reflecting their visual distinctiveness and higher training representation. The most notable pattern across all classes is the background row β a large proportion of true instances are missed entirely and classified as background. Inter-class confusion between the four damage types is low, meaning when the model does fire a detection, it classifies the damage type correctly. The main failure mode is missed detections, not misclassification.
3. Sample Predictions
Real validation set predictions showing all four damage classes detected on Japan road imagery.
Limitations and Biases
Known failure cases:
- The model misses a substantial portion of instances across all four classes, classifying them as background. This is the primary failure mode and is expected to reduce significantly with more training epochs and more training data
- Damage near road markings causes confusion β Transverse Cracks in particular are missed when they visually overlap with painted lane lines or crosswalks
- Small, shallow, or partially occluded cracks are frequently missed, especially when debris or tire marks reduce the visible crack area
Poor performing classes:
- Transverse Cracks have the lowest recall (0.28) β the model detects fewer than 1 in 3 true instances. This is likely a combination of fewer training examples and high visual similarity to road markings
- Longitudinal Cracks recall is also low (0.34) despite being the most common class in the original full dataset β suggesting the Roboflow subset used may not be fully representative
- Potholes perform moderately (F1 0.50) but have the fewest training instances, making performance estimates less statistically reliable
Data biases:
- Geographic: Training data is limited to Japan road imagery. Road surface materials, damage patterns, and marking styles differ across countries. Performance on US or European roads is untested and likely lower
- Environmental: All images appear captured in daylight under clear or partly cloudy conditions. Night-time driving, heavy rain, snow, or glare are not represented in training data
- Class imbalance: Potholes are underrepresented relative to crack classes, which likely suppresses recall for that class
Inappropriate use cases:
- Should not be used as the sole basis for infrastructure safety or repair decisions without human review
- Not suitable for detecting damage types outside the four trained classes
- Should not be deployed in countries with significantly different road infrastructure without fine-tuning on local data
- Not validated for real-time inference from moving vehicles β only tested on static image inputs
Ethical considerations:
- Automated road prioritization systems could inadvertently reinforce existing maintenance inequities if the underlying routing logic is not designed with equity in mind. The model itself has no mechanism to prevent this.
Sample size limitations:
- The test set contains approximately 975 images total. Per-class instance counts for lower-frequency classes (Potholes, Transverse Cracks) may be too small for statistically robust precision and recall estimates
- 15 epochs of training is below the recommended threshold for this dataset size β all metrics should be treated as a baseline rather than a final assessment of the model's capability
Model trained as part of B DATA 497 β Computer Vision Techniques, University of Washington, March 2025. Author: Lewi Alemayehu