--- license: mit language: - en tags: - tactile-sensing - controlnet - stable-diffusion - depth-to-tactile - image-generation - robotics - multi-modal - diffusion - ICRA pipeline_tag: image-to-image library_name: pytorch ---

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation

MultiDiffSense is a **ControlNet-based diffusion model** that generates realistic and physically grounded tactile sensor images using dual-conditioning from depth maps and text prompts. It translates text prompts and rendered depth maps of 3D objects into multi-modal tactile sensor outputs across three sensor modalities. ## Model Details | | | |---|---| | **Architecture** | ControlNet built on Stable Diffusion 1.5 | | **Task** | Depth map + Text Prompt to Multi-Modal Tactile sensor image generation | | **Input** | 512x512 depth map (viridis colourmap) + text prompt | | **Output** | 512x512 tactile sensor image | | **Training** | ~150 epochs, frozen SD backbone, lr=1e-5, batch size 8 | | **Parameters** | ~860M (SD 1.5) + ~360M (ControlNet copy) | ## Supported Tactile Sensor Modalities

Sensor	Description	Image Example
TacTip	Optical tactile sensor with pin-based deformation markers
ViTac	Vision-based tactile sensor (no markers)
ViTacTip	Combined vision-tactile sensor

## Files | File | Description | |------|-------------| | `multidiffsense.ckpt` | Trained checkpoint (trained on short prompts + depth maps) | ## Usage Clone the [GitHub repository](https://github.com/sirine-b/MultiDiffSense) and follow the installation instructions, then run inference. The checkpoint is downloaded automatically on first run: ```bash git clone https://github.com/sirine-b/MultiDiffSense.git cd MultiDiffSense pip install -r requirements.txt # Single depth map: python multidiffsense/controlnet/generate.py \ --source_image path/to/depth_map.png \ --prompt '{"sensor_context": "captured by a high-resolution vision only sensor ViTac.", "object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}' # Batch generation from a prompt file: python multidiffsense/controlnet/generate.py \ --dataset_dir datasets \ --prompt_json datasets/test/prompt_ViTacTip.json ``` See the [GitHub repository](https://github.com/sirine-b/MultiDiffSense) for full documentation on dataset preparation, training from scratch, evaluation, and ablation studies. ## Citation ```bibtex @inproceedings{multidiffsense2026, title = {MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose}, author = {Sirine Bhouri and Lan Wei and Jian-Qing Zheng and Dandan Zhang}, booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}, year = {2026} url = {https://arxiv.org/abs/2602.19348} } ``` ## License MIT