acvlab
/

FantasyVLN

@@ -8,27 +8,150 @@ pipeline_tag: robotics
 # FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
-**FantasyVLN** is a unified multimodal Chain-of-Thought (CoT) reasoning framework that enables efficient and precise navigation based on natural language instructions and visual observations.
-- **Paper:** [FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation](https://huggingface.co/papers/2601.13976)
-- **Project Page:** [https://fantasy-amap.github.io/fantasy-vln/](https://fantasy-amap.github.io/fantasy-vln/)
-- **Code:** [https://github.com/Fantasy-AMAP/fantasy-vln](https://github.com/Fantasy-AMAP/fantasy-vln)
 ## Introduction
-Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences.
-FantasyVLN combines the benefits of textual, visual, and multimodal CoT reasoning by constructing a unified representation space across these reasoning modes. To enable efficient reasoning, we align these CoT reasoning modes with non-CoT reasoning during training, while using only non-CoT reasoning at test time. Notably, we perform visual CoT in the latent space of a [VAR](https://github.com/FoundationVision/VAR) model, where only low-scale latent representations are predicted. Compared to traditional pixel-level visual CoT methods, our approach significantly improves both training and inference efficiency, reducing inference latency by an order of magnitude compared to explicit CoT methods.
 ## Citation
-If you find this work helpful, please consider citing:
 ```bibtex
-@article{zuo2025fantasyvln,
   title={FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation},
   author={Zuo, Jing and Mu, Lingzhou and Jiang, Fan and Ma, Chengcheng and Xu, Mu and Qi, Yonggang},
-  journal={arXiv preprint arXiv:2601.13976},
-  year={2025}
 }
 ```

 # FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
+[![Home Page](https://img.shields.io/badge/🌐%20%20Project-FantasyVLN-blue.svg)](https://fantasy-amap.github.io/fantasy-vln/)
+[![arXiv](https://img.shields.io/badge/Arxiv-2601.13976-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2601.13976)
+[![Code](https://img.shields.io/badge/Code-GitHub-181717.svg?logo=GitHub)](https://github.com/Fantasy-AMAP/fantasy-vln.git)
+[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-FFD21E)](https://huggingface.co/acvlab/FantasyVLN)
+[![ModelScope](https://img.shields.io/badge/ModelScope-Model-624AFF)](https://modelscope.cn/amap_cvlab/FantasyVLN)
+This project provides the online evaluation and distributed data parallel training code for **FantasyVLN**. The online evaluation is implemented based on the [LH-VLN](https://github.com/HCPLab-SYSU/LH-VLN) benchmark, and the training code is built upon [ms-swift](https://github.com/modelscope/ms-swift) and [qwen-vl](https://github.com/QwenLM/Qwen3-VL).
 ## Introduction
+![Framework](https://github.com/Fantasy-AMAP/fantasy-vln/blob/main/assets/framework.jpg?raw=true)
+**FantasyVLN** is a unified multimodal Chain-of-Thought (CoT) reasoning framework that enables efficient and precise navigation based on natural language instructions and visual observations. **FantasyVLN** combines the benefits of textual, visual, and multimodal CoT reasoning by constructing a unified representation space across these reasoning modes. To enable efficient reasoning, we align these CoT reasoning modes with non-CoT reasoning during training, while using only non-CoT reasoning at test time. Notably, we perform visual CoT in the latent space of a [VAR](https://github.com/FoundationVision/VAR) model, where only low-scale latent representations are predicted. Compared to traditional pixel-level visual CoT methods, our approach significantly improves both training and inference efficiency.
+## Online Evaluation
+We modify the [LH-VLN](https://github.com/HCPLab-SYSU/LH-VLN) codebase to support VLMs and multi-GPU inference.
+### Installation
+You can use the following commands to install the required environment, or refer to the LH-VLN environment setup tutorial for more details.
+```bash
+conda create -n fantasyvln_eval python=3.9
+conda activate fantasyvln_eval
+conda install habitat-sim==0.3.1 headless -c conda-forge -c aihabitat
+pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 xformers
+pip install -r lhvln/requirements.txt
+```
+### Preparing Data
+**HM3D**
+LH-VLN uses [HM3D](https://aihabitat.org/datasets/hm3d/) as the scene dataset. The required data splits can be downloaded by following the command below. Note that an application must be submitted to [Matterport](https://matterport.com/legal/matterport-end-user-license-agreement-academic-use-model-data) before using the dataset. For more details, please refer to [this link](https://github.com/facebookresearch/habitat-sim/blob/main/DATASETS.md#habitat-matterport-3d-research-dataset-hm3d).
+```bash
+python -m habitat_sim.utils.datasets_download --username <api-token-id> --password <api-token-secret> --uids hm3d_train_v0.2
+python -m habitat_sim.utils.datasets_download --username <api-token-id> --password <api-token-secret> --uids hm3d_val_v0.2
+```
+**LH-VLN**
+LH-VLN dataset is available in [Hugging Face](https://huggingface.co/datasets/Starry123/LHPR-VLN) and [ModelScope](https://modelscope.cn/datasets/starry123/LHPR-VLN). The zipped files included in the downloaded dataset are not required for online evaluation.
+Your final directory structure should be like this:
+```
+fantasy-vln/
+├── lhvln/
+│   ├── data/
+│   │   ├── hm3d/
+│   │   │   ├── train/
+│   │   │   ├── val/
+│   │   │   └── hm3d_annotated_basis.scene_dataset_config.json
+│   │   ├── task/
+│   │   │   ├── batch_1/
+│   │   │   ├── ...
+│   │   │   └── batch_8/
+│   │   ├── step_task/
+│   │   │   ├── batch_1/
+│   │   │   ├── ...
+│   │   │   └── batch_8/
+│   │   └── episode_task/
+│   │       ├── batch_1.json.gz
+│   │       ├── ...
+│   │       └── batch_8.json.gz
+```
+## Run Evaluation
+```bash
+./eval.sh
+```
+You must specify the following parameters before runing the script:
+- `HAB_GPU_ID`: GPU id used by Habitat-Sim for environment simulation; should be a valid physical GPU and not overlap with `RUN_GPU_IDS`.
+- `RUN_GPU_IDS`: Comma-separated list of GPU ids for inference processes; each GPU launches one process and corresponds to a subset of test data.
+- `SAVE_PATHS`: Comma-separated list of output directories where logs and evaluation results are saved.
+- `MODEL_IDS`: Comma-separated list of model checkpoint paths; must have the same length and order as `SAVE_PATHS`.
+## Training
+### Installation
+```bash
+conda create -n fantasyvln_train python=3.10
+conda activate fantasyvln_train
+pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 xformers
+pip install requirements.txt
+```
+### Prepare Training Data
+You can generate training data by runing the following commands:
+```bash
+hf download Starry123/LHPR-VLN batch_{1..8}.zip --repo-type dataset --local-dir ./data/images
+for z in data/image/batch_*.zip; do unzip -o "$z" -d "${z%.zip}"; done
+# Prepare non-CoT json data
+python data/prepare_swift_data.py --set_name train --base_dir ./data/images --data_augmentation
+python data/prepare_swift_data.py --set_name val --base_dir ./data/images --data_augmentation
+# Prepare T-CoT json data
+python data/prepare_tocot_data.py --excel_path data/tcot_annotations/excel_files --input_jsonl data/json_files/swift_his_20_train_aug.jsonl
+# Prepare V-CoT json data
+python data/prepare_tocot_data.py --scale_schedule 3 input_jsonl data/json_files/swift_his_20_train_aug.jsonl
+# Prepare MM-CoT json data
+python data/prepare_mmcot_data.py --vcot_json_path data/json_files/vcot_swift_his_20_train_aug.jsonl --tcot_json_path data/json_files/tcot_swift_his_20_train_aug.jsonl --save_as_ummcot_format True
+```
+PS: We used Qwen-VL-Max to generate textual CoT annotations for the data in `swift_his_20_train_aug.jsonl`. However, due to data licensing and privacy compliance considerations, we cannot release these annotations publicly. You may reproduce them by following the same procedure (describled in our paper).
+The final directory structure should be like this:
+```bash
+fantasy-vln/
+├── data/
+│   ├── json_files/
+│   │   ├── swift_his_20_train_aug.jsonl
+│   │   ├── tcot_swift_his_20_train_aug.jsonl
+│   │   ├── vcot_swift_his_20_train_aug.jsonl
+│   │   ├── ummcot_swift_his_20_train_aug.jsonl
+│   ├── images/
+│   │   ├── batch_1
+│   │   ├── batch_2
+│   │   ├── batch_3
+│   │   ├── batch_4
+│   │   ├── batch_5
+│   │   ├── batch_6
+│   │   ├── batch_7
+│   │   ├── batch_8
+```
+### Run Training
+```bash
+./train.sh
+```
 ## Citation
+If you find this work helpful, please consider giving us a ⭐️ and citing:
 ```bibtex
+@inproceedings{fantasyvln2026zuo,
   title={FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation},
+  shorttitle={FantasyVLN},
   author={Zuo, Jing and Mu, Lingzhou and Jiang, Fan and Ma, Chengcheng and Xu, Mu and Qi, Yonggang},
+  booktitle = {Proceedings of the {IEEE}/{CVF} Conference on Computer Vision and Pattern Recognition ({CVPR})},
+  year = {2026}
 }
 ```