Hosting 3D Medical Image Datasets on Hugging Face: A Deep Dive into MedVision
- MedVision relies on a remote data loading script.
trust_remote_codeis no longer supported indatasets>=4.0.0. Please installdatasetswithpip install datasets==3.6.0.
Hosting large-scale, complex medical datasets requires more than just uploading files; it demands a robust architecture to handle 3D volumes, diverse modalities, and precise annotations. In this post, we explore how the MedVision dataset leverages Hugging Face's advanced dataset features to manage this complexity.
๐ฉป What is MedVision?
MedVision is a large-scale, multi-anatomy, multi-modality dataset designed for quantitative medical image analysis. It standardizes diverse public datasets (like BraTS24, MSD, OAIZIB-CM) into a unified structure suitable for training massive foundation models.
โจ Key Features
- ๐ฆ Automatic Data Handling: Automatic downloading and processing of 3D images.
- โ๏ธ Dynamic Slicing: Dynamic loading of 2D slices from local 3D volumes.
- ๐ Quantitative Annotations: Detailed annotations including mask size, tumor/lesion size, and angle/distance measurements.
- ๐๏ธ Dataset Codebase: A dedicated codebase for robust dataset construction.
๐ ๏ธ Deep Dive: The Data Processing Script
The magic behind MedVision's integration with the datasets library lies in its processing script, MedVision.py. This script orchestrates everything from dependency management to dynamic data slicing.
1. โ๏ธ Configuration Definition (MedVisionConfig)
Medical datasets often require distinct subsets. For example, a user might need 2D sagittal slices for segmenting target A, but 2D axial slices for target B. To handle this, MedVision implements a custom configuration class inheriting from datasets.BuilderConfig.
How it works:
The MedVisionConfig class defines essential parameters like taskType, imageType (2D/3D), and imageSliceType (axial/coronal/sagittal) to ensure the correct data view is loaded. Basically, data configuration defines what data will be extracted from the raw data storage.
Documentation: Hugging Face BuilderConfig
2. ๐๏ธ Data Preparation (_split_generators)
The _split_generators method is responsible for downloading the data and organizing it into splits (Train/Test).
Key Features:
๐งฉ Dataset Codebase: Usage of
medvision_ds, a distinct codebase located in the src folder of the repo. This handles the heavy lifting of data downloading, processing, and annotation generation (via benchmark planners).For advanced usage and installation instructions, see the official guide.
๐ฅ Raw Image Download: It checks the
MedVision_DATA_DIRenvironment variable and saves the data there.๐ Preprocessing: It invokes specific download scripts to fetch raw image files and standardizes them (e.g., converting to NIfTI format, reorienting to RAS+ orientation).
๐ Annotation Handling: It loads annotations and metadata directly from the benchmark planners, the JSON files released within the dataset repository.
Documentation: Hugging Face SplitGenerator
3. ๐พ Data Loading (_generate_examples)
This method yields the actual training samples. For 3D medical images, simply reading a 3D volume file isn't always sufficient, as many current Vision-Language Models (VLMs) operate on 2D image inputs. Therefore, a flexible method to load 2D slices from 3D volumes is essential.
How it works: For a specific dataset configuration, the script iterates through the cases in the benchmark planner. It dynamically processes the data and filters out invalid samples.
Documentation: Hugging Face: Build and Load
๐ฅ Data Downloading & Advanced Usage
While the script automates much of the process, some datasets (like SKM-TEA or ToothFairy2) have restrictive licenses that prevent direct automatic downloading. For these, MedVision provides a Data Downloading guide. Users must manually download the raw data, process it using the provided tools, and format it correctly before the Hugging Face script can load it.
- Read more: MedVision Data Downloading Guide
๐ Key Takeaways for Your Own Datasets
- Use
BuilderConfigto organize complex datasets with multiple subsets or tasks. - Automate Installation inside
_split_generatorsif your dataset requires custom helper code. - Process Dynamically in
_generate_examplesto save disk space and allow for flexible data views (e.g., generating 2D slices from 3D volumes on the fly).
๐ Related Docs
- Details in dataset concepts, content of the returned dataset, and the dataset building workflow are described here
