YOLOv11s β€” Automated Road Surface Damage Detection

Model Description

This model is a fine-tuned YOLOv11s object detection model trained to detect and classify four types of road surface damage from street-level imagery. Given a dashcam or road-facing image, the model outputs bounding boxes and class labels identifying where damage is located and what type it is.

Training approach: Fine-tuned from yolo11s.pt (pre-trained on COCO) on a curated subset of the RDD2022 dataset via Roboflow.

Intended use cases:

  • Municipal road maintenance prioritization
  • Automated road condition monitoring using dashcam footage
  • State DOT inspection support to reduce manual survey labor costs
  • Research into scalable road infrastructure assessment tools

Training Data

Dataset: Road Damage Dataset 2022 (RDD2022) β€” Roboflow fork Source: Roboflow Universe β€” road-damage-vmqh5 Original dataset: Arya et al. (2022), RDD2022: A multi-national image dataset for automatic Road Damage Detection

Total images: 9,732 Number of classes: 4

Class Distribution (Training Set)

Class Training Instances
Alligator Cracks ~3,200
Longitudinal Cracks ~2,800
Transverse Cracks ~1,900
Potholes ~1,400

Three classes present in the original Roboflow dataset (Damaged Crosswalk, Damaged Paint, Manhole Cover) were removed as they fell outside the scope of structural road surface damage detection.

Train / Val / Test split: 70% / 20% / 10%

Data collection: Images sourced from the Japan subset of RDD2022, collected via dashcam from moving vehicles across diverse road conditions and environments.

Annotation process: The dataset came pre-annotated by research teams in YOLO format β€” no conversion was needed. Annotations were reviewed for label accuracy and bounding box tightness. Three irrelevant classes were identified and removed. The value added in this project was the quality filtering, class remapping, and dataset curation through Roboflow rather than raw annotation from scratch.

Augmentation applied:

  • Horizontal flip (p=0.5)
  • HSV value shift (hsv_v=0.4)
  • Mosaic augmentation (mosaic=1.0)
  • Vertical flip disabled (flipud=0.0)

Known biases and limitations in training data:

  • Geographically limited to Japan road infrastructure β€” damage patterns, surface materials, and road marking styles differ significantly in other countries
  • Class imbalance exists β€” Potholes have significantly fewer training instances than crack classes
  • Images are captured during daylight under varied but generally clear conditions; performance in rain, night, or heavy occlusion is untested

Training Procedure

Framework: Ultralytics YOLOv11 Base model: yolo11s.pt (COCO pre-trained) Hardware: Google Colab β€” NVIDIA T4 GPU (15GB VRAM) Training time: ~55 minutes (15 epochs on T4)

Hyperparameter Value
Epochs 15
Image size 640 Γ— 640
Batch size 16
Patience (early stopping) 10
Optimizer AdamW (auto)
Horizontal flip 0.5
Mosaic 1.0
hsv_v 0.4

Note on epochs: Training was limited to 15 epochs due to compute constraints on the free Colab tier. The loss curves at epoch 15 show the model had not yet plateaued, meaning performance is expected to improve meaningfully with longer training. A full 50-epoch run on the complete dataset is the primary next step.

Preprocessing: Images resized to 640Γ—640. No additional normalization beyond standard Ultralytics defaults.


Evaluation Results

Metrics derived from the held-out test set (~975 images, 10% split). Precision and recall calculated from the confusion matrix.

Overall Metrics

Metric Score
mAP50 ~0.47
mAP50-95 ~0.19
Precision 0.63
Recall 0.40

Per-Class Breakdown

Class Correct Detections Precision Recall F1
Alligator Cracks 655 0.73 0.52 0.60
Longitudinal Cracks 325 0.60 0.34 0.43
Potholes 234 0.56 0.45 0.50
Transverse Cracks 213 0.63 0.28 0.39

Visual Examples of Each Class

Alligator Cracks β€” Interconnected web-like fracture patterns spreading across the road surface, resembling alligator skin. Typically indicate deep structural fatigue beneath the surface layer. Visually the most distinctive class in this dataset.

Longitudinal Cracks β€” Linear cracks running parallel to the direction of traffic. Often caused by pavement shrinkage, lane edge stress, or subgrade settlement. Can be subtle and easily confused with road markings under low contrast.

Transverse Cracks β€” Linear cracks running perpendicular to traffic direction. Typically caused by thermal contraction or reflection cracking from underlying structural layers. Visually similar to lane markings in some lighting conditions.

Potholes β€” Bowl-shaped surface depressions caused by progressive failure of the road surface under repeated traffic load and water infiltration. The least represented class in the training set.

Key Visualizations

1. Loss & Metrics Curves Loss Curves All three training losses (box, classification, DFL) decrease consistently across 15 epochs. The mAP50 curve trends upward throughout training with no sign of plateauing, confirming the model is still actively learning at the point training was stopped. Train and validation losses track closely together, indicating no overfitting at this stage.

2. Confusion Matrix Confusion Matrix Alligator Cracks achieve the highest correct detection count (655), reflecting their visual distinctiveness and higher training representation. The most notable pattern across all classes is the background row β€” a large proportion of true instances are missed entirely and classified as background. Inter-class confusion between the four damage types is low, meaning when the model does fire a detection, it classifies the damage type correctly. The main failure mode is missed detections, not misclassification.

3. Sample Predictions Sample Predictions Real validation set predictions showing all four damage classes detected on Japan road imagery.


Limitations and Biases

Known failure cases:

  • The model misses a substantial portion of instances across all four classes, classifying them as background. This is the primary failure mode and is expected to reduce significantly with more training epochs and more training data
  • Damage near road markings causes confusion β€” Transverse Cracks in particular are missed when they visually overlap with painted lane lines or crosswalks
  • Small, shallow, or partially occluded cracks are frequently missed, especially when debris or tire marks reduce the visible crack area

Poor performing classes:

  • Transverse Cracks have the lowest recall (0.28) β€” the model detects fewer than 1 in 3 true instances. This is likely a combination of fewer training examples and high visual similarity to road markings
  • Longitudinal Cracks recall is also low (0.34) despite being the most common class in the original full dataset β€” suggesting the Roboflow subset used may not be fully representative
  • Potholes perform moderately (F1 0.50) but have the fewest training instances, making performance estimates less statistically reliable

Data biases:

  • Geographic: Training data is limited to Japan road imagery. Road surface materials, damage patterns, and marking styles differ across countries. Performance on US or European roads is untested and likely lower
  • Environmental: All images appear captured in daylight under clear or partly cloudy conditions. Night-time driving, heavy rain, snow, or glare are not represented in training data
  • Class imbalance: Potholes are underrepresented relative to crack classes, which likely suppresses recall for that class

Inappropriate use cases:

  • Should not be used as the sole basis for infrastructure safety or repair decisions without human review
  • Not suitable for detecting damage types outside the four trained classes
  • Should not be deployed in countries with significantly different road infrastructure without fine-tuning on local data
  • Not validated for real-time inference from moving vehicles β€” only tested on static image inputs

Ethical considerations:

  • Automated road prioritization systems could inadvertently reinforce existing maintenance inequities if the underlying routing logic is not designed with equity in mind. The model itself has no mechanism to prevent this.

Sample size limitations:

  • The test set contains approximately 975 images total. Per-class instance counts for lower-frequency classes (Potholes, Transverse Cracks) may be too small for statistically robust precision and recall estimates
  • 15 epochs of training is below the recommended threshold for this dataset size β€” all metrics should be treated as a baseline rather than a final assessment of the model's capability

Model trained as part of B DATA 497 β€” Computer Vision Techniques, University of Washington, March 2025. Author: Lewi Alemayehu


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support