Image-to-Video
Diffusers
Safetensors
SsharvienKumar commited on
Commit
4217d5b
ยท
verified ยท
1 Parent(s): e5ac407

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +118 -3
  2. checkpoints/todo.txt +0 -0
  3. datasets/todo.txt +0 -0
README.md CHANGED
@@ -1,3 +1,118 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div id="top" align="center">
2
+
3
+ # SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation (MICCAI 2026)
4
+ Ssharvien Kumar Sivakumar, Akwele Johnson, Anirudh Dhingra, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay
5
+
6
+ [![arXiv](https://img.shields.io/badge/arXiv-xxxx.xxxxx-b31b1b.svg)](https://arxiv.org/abs/xxxx.xxxxx)
7
+ [![Homepage](https://img.shields.io/badge/Homepage-Visit-blue)](https://ssharvienkumar.github.io/SWoMo/)
8
+ [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/SsharvienKumar/SWoMo)
9
+
10
+ </div>
11
+
12
+ ***This framework provides ability to use any combination of text, graph, image and video as conditioning for video synthesisation. We have provided sample configs to run training and inference for all these combinations. Feel free to use our work for comparisons and to cite it!***
13
+
14
+ ***STILL NEED TO UPDATE ARXIV, HUGGINGFACE, GIVE CONTACT INFO***
15
+
16
+ ## ๐Ÿ”‘ Key Features
17
+ - SWoMo is a neuro-symbolic world model for surgical simulation that decouples interaction dynamics from visual appearance.
18
+ - Using an inverse pairing strategy, real surgical videos are reconstructed in a simulator to create paired data for training a video diffusion model for sim-to-real translation, with intermediate scene graphs serving as a constraint regularizer.
19
+ - We demonstrate improved phase recognition, unsupervised style transfer, and strong generalisation to unseen interaction geometries.
20
+
21
+ ![introduction](asset/concept.png)
22
+
23
+ ## ๐Ÿ›  Setup
24
+ ```bash
25
+ git clone https://github.com/MECLabTUDA/SWoMo.git
26
+ cd SWoMo
27
+ conda env create -f environment.yaml
28
+ conda activate swomo
29
+ ```
30
+
31
+
32
+ ## ๐Ÿ’พ Dataset Preparation and Annotation Tools
33
+ We released our interactive SAM2-based annotation tool in a separate repository: [IntrekSAM](https://github.com/MECLabTUDA/IntrekSAM). In our research, we found that there was no existing tool for video segmentation annotation that is free, open-source, locally deployable, easily modifiable, supports multi-class segmentation, and is simple to set up. Therefore, we rewrote the GUI in Python while still keeping the original SAM2 backend.
34
+
35
+ We also make our processed Cataract-1k data available on [Hugging Face](https://huggingface.co/SsharvienKumar/SWoMo/tree/main/datasets), including real videos, simulated videos, simulated segmentations, and scene graphs. If you would like to use our **manually annotated segmentations of the real videos (at 16 fps)** for the 1,068 videos from Cataract-1K and 50 videos from CATARACTS, please contact me here [TODO]. I would also be happy to share additional annotations described in the paper, such as phase labels and tracking point annotation, upon request.
36
+
37
+
38
+ ## ๐Ÿ Checkpoints
39
+ Download the checkpoints from [Huggingface](https://huggingface.co/SsharvienKumar/SWoMo/tree/main/checkpoints) and place them in [checkpoints](./checkpoints) folder.
40
+
41
+
42
+ ## ๐Ÿ’ฅ Sampling with SWoMo
43
+ Conditioned with initial frame, graph, and video. For other conditioning combination, refer here: [./configs/inference](./configs/inference)
44
+ ```bash
45
+ python sample.py --inference_config ./configs/inference/inference_img_graph_vid_cataracts.yaml
46
+ ```
47
+
48
+ ## โณ Training SWoMo
49
+ **Step 1:** Train Image VQGAN and Segmentation VQGAN (For Graph Encoders)
50
+ ```bash
51
+ python swomo/taming/main.py --base configs/vae/config_image_autoencoder_vqgan_cataract.yaml -t --gpus 0, --logdir ./checkpoints/Cataract-1K
52
+
53
+ python swomo/taming/main.py --base configs/vae/config_segmentation_autoencoder_vqgan_cataract.yaml -t --gpus 0, --logdir ./checkpoints/Cataract-1K
54
+ ```
55
+
56
+ **Step 2:** Train Another VAE (For Video Diffusion Model)
57
+ ```bash
58
+ python swomo/ldm/main.py --base configs/vae/config_autoencoderkl_cataract.yaml -t --gpus 0, --logdir ./checkpoints/Cataract-1K
59
+
60
+ # Converting a CompVis VAE to Diffusers VAE Format
61
+ # IMPORTANT: First update Diffusers to version 0.31.0, run the script, and then downgrade back to 0.21.2
62
+ python scripts/ae_compvis_to_diffuser.py \
63
+ --vae_pt_path /path/to/checkpoints/last.ckpt \
64
+ --dump_path /path/to/save/vae_vid_diffusion
65
+ ```
66
+
67
+ **Step 3:** Train Both Graph Encoders
68
+ ```bash
69
+ python train_graph.py --name masked --config configs/graph/graph_cataract.yaml
70
+ python train_graph.py --name segclip --config configs/graph/graph_cataract.yaml
71
+ ```
72
+
73
+ **Step 4:** Train Video Diffusion Model (Without Video Conditioning)
74
+
75
+ Single-GPU Setup
76
+ ```bash
77
+ python train.py --config configs/training/training_img_graph_xvid_cataracts -n swomo_training
78
+ ```
79
+
80
+ Multi-GPU Setup (Single Node)
81
+ ```bash
82
+ python -m torch.distributed.run \
83
+ --nproc_per_node=${GPU_PER_NODE} \
84
+ --master_addr=127.0.0.1 \
85
+ --master_port=29501 \
86
+ --nnodes=1 \
87
+ --node_rank=0 \
88
+ train.py \
89
+ --config configs/training/training_img_graph_xvid_cataracts.yaml \
90
+ -n swomo_training
91
+ ```
92
+
93
+ **Step 5:** Train ControlNet (For Video Conditioning)
94
+
95
+ Update the `finetuned_unet_path` in the config with the model trained in Step 4. The training of ControlNet can also be run in a multi-GPU setup, similar to Step 4.
96
+
97
+ ```bash
98
+ python train.py --config configs/training/training_img_graph_vid_cataracts -n swomo_training
99
+ ```
100
+
101
+
102
+ ## ๐Ÿ“œ Citations
103
+ If you are using SWoMo for your paper, please cite the following paper:
104
+ ```
105
+ @article{sivakumar2026swomo,
106
+ title={SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation},
107
+ author={Sivakumar, Ssharvien Kumar and Johnson, Akwele and Dhingra, Anirudh and Frisch, Yannik and Ghazaei, Ghazal and Mukhopadhyay, Anirban},
108
+ journal={arXiv preprint arXiv:xxxx.xxxxx},
109
+ year={2026}
110
+ }
111
+ ```
112
+
113
+
114
+ ## โญ Acknowledgement
115
+ Thanks for the following projects and theoretical works that we have either used or inspired from:
116
+ - [SG2VID](https://github.com/MECLabTUDA/SG2VID)
117
+ - [VQGAN](https://github.com/CompVis/taming-transformers)
118
+
checkpoints/todo.txt ADDED
File without changes
datasets/todo.txt ADDED
File without changes