YOLOv11s — Automated Road Surface Damage Detection

Model Description

This model is a fine-tuned YOLOv11s object detection model trained to detect and classify four types of road surface damage from street-level imagery. Given a dashcam or road-facing image, the model outputs bounding boxes and class labels identifying where damage is located and what type it is.

Training approach: Fine-tuned from yolo11s.pt (pre-trained on COCO) on a curated subset of the RDD2022 dataset via Roboflow.

Intended use cases:

Municipal road maintenance prioritization
Automated road condition monitoring using dashcam footage
State DOT inspection support to reduce manual survey labor costs
Research into scalable road infrastructure assessment tools

Training Data

Dataset: Road Damage Dataset 2022 (RDD2022) — Roboflow fork Source: Roboflow Universe — road-damage-vmqh5 Original dataset: Arya et al. (2022), RDD2022: A multi-national image dataset for automatic Road Damage Detection

Total images: 9,732 Number of classes: 4

Class Distribution (Training Set)

Class	Training Instances
Alligator Cracks	~3,200
Longitudinal Cracks	~2,800
Transverse Cracks	~1,900
Potholes	~1,400

Three classes present in the original Roboflow dataset (Damaged Crosswalk, Damaged Paint, Manhole Cover) were removed as they fell outside the scope of structural road surface damage detection.

Train / Val / Test split: 70% / 20% / 10%

Data collection: Images sourced from the Japan subset of RDD2022, collected via dashcam from moving vehicles across diverse road conditions and environments.

Annotation process: The dataset came pre-annotated by research teams in YOLO format — no conversion was needed. Annotations were reviewed for label accuracy and bounding box tightness. Three irrelevant classes were identified and removed. The value added in this project was the quality filtering, class remapping, and dataset curation through Roboflow rather than raw annotation from scratch.

Augmentation applied:

Horizontal flip (p=0.5)
HSV value shift (hsv_v=0.4)
Mosaic augmentation (mosaic=1.0)
Vertical flip disabled (flipud=0.0)

Known biases and limitations in training data:

Geographically limited to Japan road infrastructure — damage patterns, surface materials, and road marking styles differ significantly in other countries
Class imbalance exists — Potholes have significantly fewer training instances than crack classes
Images are captured during daylight under varied but generally clear conditions; performance in rain, night, or heavy occlusion is untested

Training Procedure

Framework: Ultralytics YOLOv11 Base model: yolo11s.pt (COCO pre-trained) Hardware: Google Colab — NVIDIA T4 GPU (15GB VRAM) Training time: ~55 minutes (15 epochs on T4)

Hyperparameter	Value
Epochs	15
Image size	640 × 640
Batch size	16
Patience (early stopping)	10
Optimizer	AdamW (auto)
Horizontal flip	0.5
Mosaic	1.0
hsv_v	0.4

Note on epochs: Training was limited to 15 epochs due to compute constraints on the free Colab tier. The loss curves at epoch 15 show the model had not yet plateaued, meaning performance is expected to improve meaningfully with longer training. A full 50-epoch run on the complete dataset is the primary next step.

Preprocessing: Images resized to 640×640. No additional normalization beyond standard Ultralytics defaults.

Evaluation Results

Metrics derived from the held-out test set (~975 images, 10% split). Precision and recall calculated from the confusion matrix.

Overall Metrics

Metric	Score
mAP50	~0.47
mAP50-95	~0.19
Precision	0.63
Recall	0.40

Per-Class Breakdown

Class	Correct Detections	Precision	Recall	F1
Alligator Cracks	655	0.73	0.52	0.60
Longitudinal Cracks	325	0.60	0.34	0.43
Potholes	234	0.56	0.45	0.50
Transverse Cracks	213	0.63	0.28	0.39

Visual Examples of Each Class

Alligator Cracks — Interconnected web-like fracture patterns spreading across the road surface, resembling alligator skin. Typically indicate deep structural fatigue beneath the surface layer. Visually the most distinctive class in this dataset.

Longitudinal Cracks — Linear cracks running parallel to the direction of traffic. Often caused by pavement shrinkage, lane edge stress, or subgrade settlement. Can be subtle and easily confused with road markings under low contrast.

Transverse Cracks — Linear cracks running perpendicular to traffic direction. Typically caused by thermal contraction or reflection cracking from underlying structural layers. Visually similar to lane markings in some lighting conditions.

Potholes — Bowl-shaped surface depressions caused by progressive failure of the road surface under repeated traffic load and water infiltration. The least represented class in the training set.

Key Visualizations

1. Loss & Metrics Curves All three training losses (box, classification, DFL) decrease consistently across 15 epochs. The mAP50 curve trends upward throughout training with no sign of plateauing, confirming the model is still actively learning at the point training was stopped. Train and validation losses track closely together, indicating no overfitting at this stage.

2. Confusion Matrix Alligator Cracks achieve the highest correct detection count (655), reflecting their visual distinctiveness and higher training representation. The most notable pattern across all classes is the background row — a large proportion of true instances are missed entirely and classified as background. Inter-class confusion between the four damage types is low, meaning when the model does fire a detection, it classifies the damage type correctly. The main failure mode is missed detections, not misclassification.

3. Sample Predictions Real validation set predictions showing all four damage classes detected on Japan road imagery.

Limitations and Biases

Known failure cases:

The model misses a substantial portion of instances across all four classes, classifying them as background. This is the primary failure mode and is expected to reduce significantly with more training epochs and more training data
Damage near road markings causes confusion — Transverse Cracks in particular are missed when they visually overlap with painted lane lines or crosswalks
Small, shallow, or partially occluded cracks are frequently missed, especially when debris or tire marks reduce the visible crack area

Poor performing classes:

Transverse Cracks have the lowest recall (0.28) — the model detects fewer than 1 in 3 true instances. This is likely a combination of fewer training examples and high visual similarity to road markings
Longitudinal Cracks recall is also low (0.34) despite being the most common class in the original full dataset — suggesting the Roboflow subset used may not be fully representative
Potholes perform moderately (F1 0.50) but have the fewest training instances, making performance estimates less statistically reliable

Data biases:

Geographic: Training data is limited to Japan road imagery. Road surface materials, damage patterns, and marking styles differ across countries. Performance on US or European roads is untested and likely lower
Environmental: All images appear captured in daylight under clear or partly cloudy conditions. Night-time driving, heavy rain, snow, or glare are not represented in training data
Class imbalance: Potholes are underrepresented relative to crack classes, which likely suppresses recall for that class

Inappropriate use cases:

Should not be used as the sole basis for infrastructure safety or repair decisions without human review
Not suitable for detecting damage types outside the four trained classes
Should not be deployed in countries with significantly different road infrastructure without fine-tuning on local data
Not validated for real-time inference from moving vehicles — only tested on static image inputs

Ethical considerations:

Automated road prioritization systems could inadvertently reinforce existing maintenance inequities if the underlying routing logic is not designed with equity in mind. The model itself has no mechanism to prevent this.

Sample size limitations:

The test set contains approximately 975 images total. Per-class instance counts for lower-frequency classes (Potholes, Transverse Cracks) may be too small for statistically robust precision and recall estimates
15 epochs of training is below the recommended threshold for this dataset size — all metrics should be treated as a baseline rather than a final assessment of the model's capability

Model trained as part of B DATA 497 — Computer Vision Techniques, University of Washington, March 2025. Author: Lewi Alemayehu

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support