Hosting 3D Medical Image Datasets on Hugging Face: A Deep Dive into MedVision

Community Article Published December 10, 2025

  • MedVision relies on a remote data loading script.
  • trust_remote_code is no longer supported in datasets>=4.0.0. Please install datasets with pip install datasets==3.6.0.

Hosting large-scale, complex medical datasets requires more than just uploading files; it demands a robust architecture to handle 3D volumes, diverse modalities, and precise annotations. In this post, we explore how the MedVision dataset leverages Hugging Face's advanced dataset features to manage this complexity.

๐Ÿฉป What is MedVision?

MedVision is a large-scale, multi-anatomy, multi-modality dataset designed for quantitative medical image analysis. It standardizes diverse public datasets (like BraTS24, MSD, OAIZIB-CM) into a unified structure suitable for training massive foundation models.

โœจ Key Features

  1. ๐Ÿ“ฆ Automatic Data Handling: Automatic downloading and processing of 3D images.
  2. โœ‚๏ธ Dynamic Slicing: Dynamic loading of 2D slices from local 3D volumes.
  3. ๐Ÿ“ Quantitative Annotations: Detailed annotations including mask size, tumor/lesion size, and angle/distance measurements.
  4. ๐Ÿ—๏ธ Dataset Codebase: A dedicated codebase for robust dataset construction.

๐Ÿ› ๏ธ Deep Dive: The Data Processing Script

The magic behind MedVision's integration with the datasets library lies in its processing script, MedVision.py. This script orchestrates everything from dependency management to dynamic data slicing.

1. โš™๏ธ Configuration Definition (MedVisionConfig)

Medical datasets often require distinct subsets. For example, a user might need 2D sagittal slices for segmenting target A, but 2D axial slices for target B. To handle this, MedVision implements a custom configuration class inheriting from datasets.BuilderConfig.

How it works: The MedVisionConfig class defines essential parameters like taskType, imageType (2D/3D), and imageSliceType (axial/coronal/sagittal) to ensure the correct data view is loaded. Basically, data configuration defines what data will be extracted from the raw data storage.

Documentation: Hugging Face BuilderConfig

2. ๐Ÿ—‚๏ธ Data Preparation (_split_generators)

The _split_generators method is responsible for downloading the data and organizing it into splits (Train/Test).

Key Features:

  1. ๐Ÿงฉ Dataset Codebase: Usage of medvision_ds, a distinct codebase located in the src folder of the repo. This handles the heavy lifting of data downloading, processing, and annotation generation (via benchmark planners).

    For advanced usage and installation instructions, see the official guide.

  2. ๐Ÿ“ฅ Raw Image Download: It checks the MedVision_DATA_DIR environment variable and saves the data there.

  3. ๐Ÿ”„ Preprocessing: It invokes specific download scripts to fetch raw image files and standardizes them (e.g., converting to NIfTI format, reorienting to RAS+ orientation).

  4. ๐Ÿ“ Annotation Handling: It loads annotations and metadata directly from the benchmark planners, the JSON files released within the dataset repository.

Documentation: Hugging Face SplitGenerator

3. ๐Ÿ’พ Data Loading (_generate_examples)

This method yields the actual training samples. For 3D medical images, simply reading a 3D volume file isn't always sufficient, as many current Vision-Language Models (VLMs) operate on 2D image inputs. Therefore, a flexible method to load 2D slices from 3D volumes is essential.

How it works: For a specific dataset configuration, the script iterates through the cases in the benchmark planner. It dynamically processes the data and filters out invalid samples.

Documentation: Hugging Face: Build and Load

๐Ÿ“ฅ Data Downloading & Advanced Usage

While the script automates much of the process, some datasets (like SKM-TEA or ToothFairy2) have restrictive licenses that prevent direct automatic downloading. For these, MedVision provides a Data Downloading guide. Users must manually download the raw data, process it using the provided tools, and format it correctly before the Hugging Face script can load it.


๐ŸŒŸ Key Takeaways for Your Own Datasets

  1. Use BuilderConfig to organize complex datasets with multiple subsets or tasks.
  2. Automate Installation inside _split_generators if your dataset requires custom helper code.
  3. Process Dynamically in _generate_examples to save disk space and allow for flexible data views (e.g., generating 2D slices from 3D volumes on the fly).

๐Ÿ“š Related Docs

  • Details in dataset concepts, content of the returned dataset, and the dataset building workflow are described here

Community

Sign up or log in to comment