Hosting 3D Medical Image Datasets on Hugging Face: A Deep Dive into MedVision

Community Article Published December 10, 2025

🩻 What is MedVision?
✨ Key Features
🛠️ Deep Dive: The Data Processing Script
1. ⚙️ Configuration Definition (MedVisionConfig)
2. 🗂️ Data Preparation (_split_generators)
3. 💾 Data Loading (_generate_examples)
📥 Data Downloading & Advanced Usage
🌟 Key Takeaways for Your Own Datasets
📚 Related Docs

MedVision relies on a remote data loading script.

trust_remote_code is no longer supported in datasets>=4.0.0. Please install datasets with pip install datasets==3.6.0.

Hosting large-scale, complex medical datasets requires more than just uploading files; it demands a robust architecture to handle 3D volumes, diverse modalities, and precise annotations. In this post, we explore how the MedVision dataset leverages Hugging Face's advanced dataset features to manage this complexity.

🩻 What is MedVision?

MedVision is a large-scale, multi-anatomy, multi-modality dataset designed for quantitative medical image analysis. It standardizes diverse public datasets (like BraTS24, MSD, OAIZIB-CM) into a unified structure suitable for training massive foundation models.

✨ Key Features

📦 Automatic Data Handling: Automatic downloading and processing of 3D images.
✂️ Dynamic Slicing: Dynamic loading of 2D slices from local 3D volumes.
📏 Quantitative Annotations: Detailed annotations including mask size, tumor/lesion size, and angle/distance measurements.
🏗️ Dataset Codebase: A dedicated codebase for robust dataset construction.

🛠️ Deep Dive: The Data Processing Script

The magic behind MedVision's integration with the datasets library lies in its processing script, MedVision.py. This script orchestrates everything from dependency management to dynamic data slicing.

1. ⚙️ Configuration Definition (`MedVisionConfig`)

Medical datasets often require distinct subsets. For example, a user might need 2D sagittal slices for segmenting target A, but 2D axial slices for target B. To handle this, MedVision implements a custom configuration class inheriting from datasets.BuilderConfig.

How it works: The MedVisionConfig class defines essential parameters like taskType, imageType (2D/3D), and imageSliceType (axial/coronal/sagittal) to ensure the correct data view is loaded. Basically, data configuration defines what data will be extracted from the raw data storage.

Documentation: Hugging Face BuilderConfig

2. 🗂️ Data Preparation (`_split_generators`)

The _split_generators method is responsible for downloading the data and organizing it into splits (Train/Test).

Key Features:

🧩 Dataset Codebase: Usage of medvision_ds, a distinct codebase located in the src folder of the repo. This handles the heavy lifting of data downloading, processing, and annotation generation (via benchmark planners).

For advanced usage and installation instructions, see the official guide.
📥 Raw Image Download: It checks the MedVision_DATA_DIR environment variable and saves the data there.
🔄 Preprocessing: It invokes specific download scripts to fetch raw image files and standardizes them (e.g., converting to NIfTI format, reorienting to RAS+ orientation).
📝 Annotation Handling: It loads annotations and metadata directly from the benchmark planners, the JSON files released within the dataset repository.

Documentation: Hugging Face SplitGenerator

3. 💾 Data Loading (`_generate_examples`)

This method yields the actual training samples. For 3D medical images, simply reading a 3D volume file isn't always sufficient, as many current Vision-Language Models (VLMs) operate on 2D image inputs. Therefore, a flexible method to load 2D slices from 3D volumes is essential.

How it works: For a specific dataset configuration, the script iterates through the cases in the benchmark planner. It dynamically processes the data and filters out invalid samples.

Documentation: Hugging Face: Build and Load

📥 Data Downloading & Advanced Usage

While the script automates much of the process, some datasets (like SKM-TEA or ToothFairy2) have restrictive licenses that prevent direct automatic downloading. For these, MedVision provides a Data Downloading guide. Users must manually download the raw data, process it using the provided tools, and format it correctly before the Hugging Face script can load it.

Read more: MedVision Data Downloading Guide

🌟 Key Takeaways for Your Own Datasets

Use BuilderConfig to organize complex datasets with multiple subsets or tasks.
Automate Installation inside _split_generators if your dataset requires custom helper code.
Process Dynamically in _generate_examples to save disk space and allow for flexible data views (e.g., generating 2D slices from 3D volumes on the fly).

📚 Related Docs

Details in dataset concepts, content of the returned dataset, and the dataset building workflow are described here

Datasets mentioned in this article 1

Supervised Fine-Tuning (SFT) for VLMs on Medical Image Data

December 9, 2025

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Hosting 3D Medical Image Datasets on Hugging Face: A Deep Dive into MedVision

🩻 What is MedVision?

✨ Key Features

🛠️ Deep Dive: The Data Processing Script

1. ⚙️ Configuration Definition (MedVisionConfig)

2. 🗂️ Data Preparation (_split_generators)

3. 💾 Data Loading (_generate_examples)

📥 Data Downloading & Advanced Usage

🌟 Key Takeaways for Your Own Datasets

📚 Related Docs

Datasets mentioned in this article 1

Supervised Fine-Tuning (SFT) for VLMs on Medical Image Data

Community

Datasets mentioned in this article 1

1. ⚙️ Configuration Definition (`MedVisionConfig`)

2. 🗂️ Data Preparation (`_split_generators`)

3. 💾 Data Loading (`_generate_examples`)