frankjiang commited on
Commit
8bdff1b
Β·
verified Β·
1 Parent(s): c8e38a1

Update README.md

Browse files

Update detailed readme.

Files changed (1) hide show
  1. README.md +133 -10
README.md CHANGED
@@ -8,27 +8,150 @@ pipeline_tag: robotics
8
 
9
  # FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
10
 
11
- **FantasyVLN** is a unified multimodal Chain-of-Thought (CoT) reasoning framework that enables efficient and precise navigation based on natural language instructions and visual observations.
 
 
 
 
12
 
13
- - **Paper:** [FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation](https://huggingface.co/papers/2601.13976)
14
- - **Project Page:** [https://fantasy-amap.github.io/fantasy-vln/](https://fantasy-amap.github.io/fantasy-vln/)
15
- - **Code:** [https://github.com/Fantasy-AMAP/fantasy-vln](https://github.com/Fantasy-AMAP/fantasy-vln)
16
 
17
  ## Introduction
18
 
19
- Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences.
20
 
21
- FantasyVLN combines the benefits of textual, visual, and multimodal CoT reasoning by constructing a unified representation space across these reasoning modes. To enable efficient reasoning, we align these CoT reasoning modes with non-CoT reasoning during training, while using only non-CoT reasoning at test time. Notably, we perform visual CoT in the latent space of a [VAR](https://github.com/FoundationVision/VAR) model, where only low-scale latent representations are predicted. Compared to traditional pixel-level visual CoT methods, our approach significantly improves both training and inference efficiency, reducing inference latency by an order of magnitude compared to explicit CoT methods.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Citation
24
 
25
- If you find this work helpful, please consider citing:
26
 
27
  ```bibtex
28
- @article{zuo2025fantasyvln,
29
  title={FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation},
 
30
  author={Zuo, Jing and Mu, Lingzhou and Jiang, Fan and Ma, Chengcheng and Xu, Mu and Qi, Yonggang},
31
- journal={arXiv preprint arXiv:2601.13976},
32
- year={2025}
33
  }
34
  ```
 
8
 
9
  # FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
10
 
11
+ [![Home Page](https://img.shields.io/badge/🌐%20%20Project-FantasyVLN-blue.svg)](https://fantasy-amap.github.io/fantasy-vln/)
12
+ [![arXiv](https://img.shields.io/badge/Arxiv-2601.13976-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2601.13976)
13
+ [![Code](https://img.shields.io/badge/Code-GitHub-181717.svg?logo=GitHub)](https://github.com/Fantasy-AMAP/fantasy-vln.git)
14
+ [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-FFD21E)](https://huggingface.co/acvlab/FantasyVLN)
15
+ [![ModelScope](https://img.shields.io/badge/ModelScope-Model-624AFF)](https://modelscope.cn/amap_cvlab/FantasyVLN)
16
 
17
+ This project provides the online evaluation and distributed data parallel training code for **FantasyVLN**. The online evaluation is implemented based on the [LH-VLN](https://github.com/HCPLab-SYSU/LH-VLN) benchmark, and the training code is built upon [ms-swift](https://github.com/modelscope/ms-swift) and [qwen-vl](https://github.com/QwenLM/Qwen3-VL).
 
 
18
 
19
  ## Introduction
20
 
21
+ ![Framework](https://github.com/Fantasy-AMAP/fantasy-vln/blob/main/assets/framework.jpg?raw=true)
22
 
23
+ **FantasyVLN** is a unified multimodal Chain-of-Thought (CoT) reasoning framework that enables efficient and precise navigation based on natural language instructions and visual observations. **FantasyVLN** combines the benefits of textual, visual, and multimodal CoT reasoning by constructing a unified representation space across these reasoning modes. To enable efficient reasoning, we align these CoT reasoning modes with non-CoT reasoning during training, while using only non-CoT reasoning at test time. Notably, we perform visual CoT in the latent space of a [VAR](https://github.com/FoundationVision/VAR) model, where only low-scale latent representations are predicted. Compared to traditional pixel-level visual CoT methods, our approach significantly improves both training and inference efficiency.
24
+
25
+ ## Online Evaluation
26
+ We modify the [LH-VLN](https://github.com/HCPLab-SYSU/LH-VLN) codebase to support VLMs and multi-GPU inference.
27
+
28
+ ### Installation
29
+ You can use the following commands to install the required environment, or refer to the LH-VLN environment setup tutorial for more details.
30
+ ```bash
31
+ conda create -n fantasyvln_eval python=3.9
32
+ conda activate fantasyvln_eval
33
+ conda install habitat-sim==0.3.1 headless -c conda-forge -c aihabitat
34
+ pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 xformers
35
+ pip install -r lhvln/requirements.txt
36
+ ```
37
+
38
+ ### Preparing Data
39
+
40
+ **HM3D**
41
+
42
+ LH-VLN uses [HM3D](https://aihabitat.org/datasets/hm3d/) as the scene dataset. The required data splits can be downloaded by following the command below. Note that an application must be submitted to [Matterport](https://matterport.com/legal/matterport-end-user-license-agreement-academic-use-model-data) before using the dataset. For more details, please refer to [this link](https://github.com/facebookresearch/habitat-sim/blob/main/DATASETS.md#habitat-matterport-3d-research-dataset-hm3d).
43
+
44
+ ```bash
45
+ python -m habitat_sim.utils.datasets_download --username <api-token-id> --password <api-token-secret> --uids hm3d_train_v0.2
46
+ python -m habitat_sim.utils.datasets_download --username <api-token-id> --password <api-token-secret> --uids hm3d_val_v0.2
47
+ ```
48
+
49
+ **LH-VLN**
50
+
51
+ LH-VLN dataset is available in [Hugging Face](https://huggingface.co/datasets/Starry123/LHPR-VLN) and [ModelScope](https://modelscope.cn/datasets/starry123/LHPR-VLN). The zipped files included in the downloaded dataset are not required for online evaluation.
52
+
53
+
54
+ Your final directory structure should be like this:
55
+
56
+ ```
57
+ fantasy-vln/
58
+ β”œβ”€β”€ lhvln/
59
+ β”‚ β”œβ”€β”€ data/
60
+ β”‚ β”‚ β”œβ”€β”€ hm3d/
61
+ β”‚ β”‚ β”‚ β”œβ”€β”€ train/
62
+ β”‚ β”‚ β”‚ β”œβ”€β”€ val/
63
+ β”‚ β”‚ β”‚ └── hm3d_annotated_basis.scene_dataset_config.json
64
+ β”‚ β”‚ β”œβ”€β”€ task/
65
+ β”‚ β”‚ β”‚ β”œβ”€β”€ batch_1/
66
+ β”‚ β”‚ β”‚ β”œβ”€β”€ ...
67
+ β”‚ β”‚ β”‚ └── batch_8/
68
+ β”‚ β”‚ β”œβ”€β”€ step_task/
69
+ β”‚ β”‚ β”‚ β”œβ”€β”€ batch_1/
70
+ β”‚ β”‚ β”‚ β”œβ”€β”€ ...
71
+ β”‚ β”‚ β”‚ └── batch_8/
72
+ β”‚ β”‚ └── episode_task/
73
+ β”‚ β”‚ β”œβ”€β”€ batch_1.json.gz
74
+ β”‚ β”‚ β”œβ”€β”€ ...
75
+ β”‚ β”‚ └── batch_8.json.gz
76
+ ```
77
+
78
+ ## Run Evaluation
79
+
80
+ ```bash
81
+ ./eval.sh
82
+ ```
83
+ You must specify the following parameters before runing the script:
84
+ - `HAB_GPU_ID`: GPU id used by Habitat-Sim for environment simulation; should be a valid physical GPU and not overlap with `RUN_GPU_IDS`.
85
+ - `RUN_GPU_IDS`: Comma-separated list of GPU ids for inference processes; each GPU launches one process and corresponds to a subset of test data.
86
+ - `SAVE_PATHS`: Comma-separated list of output directories where logs and evaluation results are saved.
87
+ - `MODEL_IDS`: Comma-separated list of model checkpoint paths; must have the same length and order as `SAVE_PATHS`.
88
+
89
+ ## Training
90
+
91
+ ### Installation
92
+ ```bash
93
+ conda create -n fantasyvln_train python=3.10
94
+ conda activate fantasyvln_train
95
+ pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 xformers
96
+ pip install requirements.txt
97
+ ```
98
+
99
+ ### Prepare Training Data
100
+ You can generate training data by runing the following commands:
101
+ ```bash
102
+ hf download Starry123/LHPR-VLN batch_{1..8}.zip --repo-type dataset --local-dir ./data/images
103
+ for z in data/image/batch_*.zip; do unzip -o "$z" -d "${z%.zip}"; done
104
+
105
+ # Prepare non-CoT json data
106
+ python data/prepare_swift_data.py --set_name train --base_dir ./data/images --data_augmentation
107
+ python data/prepare_swift_data.py --set_name val --base_dir ./data/images --data_augmentation
108
+
109
+ # Prepare T-CoT json data
110
+ python data/prepare_tocot_data.py --excel_path data/tcot_annotations/excel_files --input_jsonl data/json_files/swift_his_20_train_aug.jsonl
111
+
112
+ # Prepare V-CoT json data
113
+ python data/prepare_tocot_data.py --scale_schedule 3 input_jsonl data/json_files/swift_his_20_train_aug.jsonl
114
+
115
+ # Prepare MM-CoT json data
116
+ python data/prepare_mmcot_data.py --vcot_json_path data/json_files/vcot_swift_his_20_train_aug.jsonl --tcot_json_path data/json_files/tcot_swift_his_20_train_aug.jsonl --save_as_ummcot_format True
117
+ ```
118
+ PS: We used Qwen-VL-Max to generate textual CoT annotations for the data in `swift_his_20_train_aug.jsonl`. However, due to data licensing and privacy compliance considerations, we cannot release these annotations publicly. You may reproduce them by following the same procedure (describled in our paper).
119
+
120
+ The final directory structure should be like this:
121
+ ```bash
122
+ fantasy-vln/
123
+ β”œβ”€β”€ data/
124
+ β”‚ β”œβ”€β”€ json_files/
125
+ β”‚ β”‚ β”œβ”€β”€ swift_his_20_train_aug.jsonl
126
+ β”‚ β”‚ β”œβ”€β”€ tcot_swift_his_20_train_aug.jsonl
127
+ β”‚ β”‚ β”œβ”€β”€ vcot_swift_his_20_train_aug.jsonl
128
+ β”‚ β”‚ β”œβ”€β”€ ummcot_swift_his_20_train_aug.jsonl
129
+ β”‚ β”œβ”€β”€ images/
130
+ β”‚ β”‚ β”œβ”€β”€ batch_1
131
+ β”‚ β”‚ β”œβ”€β”€ batch_2
132
+ β”‚ β”‚ β”œβ”€β”€ batch_3
133
+ β”‚ β”‚ β”œβ”€β”€ batch_4
134
+ β”‚ β”‚ β”œβ”€β”€ batch_5
135
+ β”‚ β”‚ β”œβ”€β”€ batch_6
136
+ β”‚ β”‚ β”œβ”€β”€ batch_7
137
+ β”‚ β”‚ β”œβ”€β”€ batch_8
138
+ ```
139
+
140
+ ### Run Training
141
+ ```bash
142
+ ./train.sh
143
+ ```
144
 
145
  ## Citation
146
 
147
+ If you find this work helpful, please consider giving us a ⭐️ and citing:
148
 
149
  ```bibtex
150
+ @inproceedings{fantasyvln2026zuo,
151
  title={FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation},
152
+ shorttitle={FantasyVLN},
153
  author={Zuo, Jing and Mu, Lingzhou and Jiang, Fan and Ma, Chengcheng and Xu, Mu and Qi, Yonggang},
154
+ booktitle = {Proceedings of the {IEEE}/{CVF} Conference on Computer Vision and Pattern Recognition ({CVPR})},
155
+ year = {2026}
156
  }
157
  ```