File size: 6,160 Bytes

045a6a9

---
annotations_creators:
- expert-generated
language:
- en
language_creators:
- expert-generated
license: cc-by-sa-4.0
multimodality:
- video
- text
pretty_name: 'SynWTS: Synthetic Woven Traffic Safety Dataset'
size_categories:
- 100<n<1K
source_datasets:
- WTS (Woven Traffic Safety)
tags:
- traffic-safety
- sim2real
- video-captioning
- vqa
- autonomous-driving
- ai-city-challenge
- vlm
- multimodal
- description
task_categories:
- visual-question-answering
- video-classification
- question-answering
- text-generation
- video-text-to-text
task_ids:
- visual-question-answering
- natural-language-inference
- closed-domain-qa
- multiple-choice-qa
contact: David C. Anastasiu danastasiu@scu.edu
---

# SynWTS: Synthetic Woven Traffic Safety Dataset

SynWTS is a high-fidelity synthetic dataset built as a **Digital Twin** of the [Woven Traffic Safety (WTS) dataset](https://woven-visionai.github.io/wts-dataset-homepage/). It is developed for the [**2026 AI City Challenge (Track 2)**](https://www.aicitychallenge.org/2026-track2/) to advance Sim2Real research in transportation safety understanding.

## Dataset Summary
Participants in the Sim2Real challenge must train models exclusively on this synthetic data and evaluate performance on real-world video. SynWTS provides a geometric match to real-world test locations, focusing on pedestrian-involved incidents with multi-view 1080p video, structured temporal captions, and complex Visual Question Answering (VQA) pairs.

### Key Features
- **Sim2Real Benchmark:** Specifically designed to bridge the gap between NVIDIA Isaac Sim environments and real-world traffic scenarios.
- **Multi-View Perception:** Synchronized views from overhead infrastructure cameras and vehicle-ego perspectives.
- **Temporal Segmentation:** Scenarios are partitioned into five safety-critical phases: *Pre-recognition, Recognition, Judgment, Action, and Avoidance.*
- **Structured Annotations:** Descriptions cover four pillars: **Location, Attention, Behavior, and Context.**

---

## Dataset Structure

### Directory Layout
```text
data/
├── videos/
│   └── {split}/{scenario}/{view}/*.mp4
├── annotations/
│   ├── caption/
│   │   └── {split}/{scenario}/{view}/{scenario}_caption.json
│   ├── bbox_annotated/
│   │   ├── pedestrian/{split}/{scenario}/{view}/{scenario}_{camera_id}_bbox.json
│   │   └── vehicle/{split}/{scenario}/overhead_view/{scenario}_{camera_id}_bbox.json
│   └── vqa/
│       └── {split}/{scenario}/{view}/{scenario}.json
```
*{split} = train | val | test*

*{view} = overhead_view | vehicle_view | environment*

*{camera_id} = {camera_ip_address}_{direction_id} | vehicle_view*

### Data Fields & Samples

#### 1. Fine-Grained Captions
Captions are generated from a checklist of 170+ traffic items. Each event phase contains a distinct caption for the pedestrian and the vehicle. We used the same annotations as in the WTS dataset and only updated necessary details that could not be simulated in the current version.

**Sample (from overhead_view_caption.json):**
```json
{
    "id": 765,
    "event_phase": [
        {
            "labels": ["4"],
            "caption_pedestrian": "The pedestrian was a male in his 30s walking slowly... He was standing close behind a vehicle... Although he almost noticed the vehicle, he seemed unaware of it.",
            "caption_vehicle": "The vehicle was on the left side of the pedestrian and was close to them... The vehicle slightly collided with the pedestrian while moving at a speed of 0 km/h.",
            "start_time": "8.993",
            "end_time": "14.903"
        }
    ]
}
```

#### 2. Visual Question Answering (VQA)
Includes multiple-choice questions covering position, distance, visibility, and actions.

**Sample (from vqa-vehicle_view.json):**
```json
{
    "question": "What is the action taken by vehicle?",
    "a": "Swerved to the left to avoid",
    "b": "Swerved to the right, but could not avoid",
    "c": "Tried sudden braking but could not avoid",
    "d": "Collided with the pedestrian",
    "correct": "d"
}
```

---

## Technical Specifications & Limitations

### Digital Twin Characteristics
- **Environmental Fidelity:** Roads and buildings are a close geometric match to real-world WTS locations.
- **No 3D Gaze:** Unlike the original WTS, 3D gaze and head bounding boxes are not included due to simulation constraints.
- **Character Dynamics:** Poses are simulated and may not perfectly replicate real-world physics.
- **Object Limitations:** Characters do not hold hand-held objects (umbrellas, phones) that may appear in the real-world test set. Labels/VQA have been adjusted accordingly.

---

## Test Set

The dataset only includes the `train` and `val` sets of the data. The test set will be the "internal" or "main" subset of the [WTS Dataset](https://github.com/woven-visionai/wts-dataset). Note that the WTS dataset also contais a BDD_PC_5K subset in its train/val/test splits that will not be used for this challenge since synthetic versions of those scenarios are not included in our training and validation sets. 

---

## Release Schedule
- **Initial Release:** 80 scenarios (May 1, 2026)
- **Mid-May Update:** 144 scenarios (May 11, 2026)
- **Final Dataset:** ~249 scenarios total (Expected May 25, 2026).

---

## Team & Credits

### Santa Clara University
Dhanishtha Patil, Ridham Kachhadiya, Andrew Vattuone, and David C. Anastasiu

### NVIDIA
Haoquan Liang, Jiajun Li, Yuxing Wang, and Thomas Tang

### Woven by Toyota
Ashutosh Kumar and Quan Kong

**Point of Contact:**

For questions regarding the SynWTS dataset or the AI City Challenge Track 2, please contact:
> David C. Anastasiu
>
> Email: danastasiu@scu.edu

---

## Citation
Please cite the original WTS paper and the 2026 AI City Challenge:

```bibtex
@article{kong2024wts,
  title={WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding},
  author={Kong, Quan and Kumar, Ashutosh and others},
  journal={arXiv preprint arXiv:2407.15350},
  year={2024}
}
```
Stay tuned for an updated citation to our dataset paper.