| --- |
| pipeline_tag: image-to-3d |
| --- |
| |
| <div align="center"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/69672d93bece445e6907b7a2/Ju4n-ceuPYTYlo__v9b7Q.png" width="50%"> |
| </div> |
| <h2 align="center">AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model</h2> |
|
|
| <div align="center"> |
| <a href="https://yutian10.github.io">Yutian Chen</a>, |
| <a href="https://guoshi28.github.io">Shi Guo</a>, |
| <a href="https://rbjin.github.io/">Renbiao Jin</a>, |
| <a href="https://scholar.google.com/citations?user=9b5dE40AAAAJ&hl=en">Tianshuo Yang</a>, |
| <a href="https://caixin98.github.io/">Xin Cai</a>, |
| <a href="https://luo0207.github.io/yawenluo/">Yawen Luo</a>, |
| <a href="">Mingxin Yang</a> |
| <a href="https://mulinyu.github.io/">Mulin Yu</a>, |
| <a href="https://eveneveno.github.io/lnxu/">Linning Xu</a>, |
| <a href="https://tianfan.info/">Tianfan Xue</a> |
| </div> |
|
|
| <br> |
| <p align="center"> <a href='https://yutian10.github.io/AnyRecon/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> |
| <a href="https://arxiv.org/pdf/2604.19747"><img src="https://img.shields.io/static/v1?label=Arxiv&message=AnyRecon&color=red&logo=arxiv"></a> |
| <a href='https://github.com/OpenImagingLab/AnyRecon'><img src='https://img.shields.io/badge/Github-Code-blue?logo=github'></a> |
| <a href='https://huggingface.co/Yutian10/AnyRecon/tree/main'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow'></a> |
| </p> |
| <p align="center"> |
| <video |
| src="https://cdn-uploads.huggingface.co/production/uploads/69672d93bece445e6907b7a2/4NgvIKzEeIYeywJZAX2YG.mp4" |
| autoplay |
| muted |
| loop |
| playsinline |
| width="100%"> |
| </video> |
| </p> |
|
|
| ## π Abstract |
| Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. |
|
|
| ## π οΈ Environment Setup |
|
|
| ```bash |
| git clone https://github.com/OpenImagingLab/AnyRecon.git |
| cd AnyRecon |
| conda create -n anyrecon python=3.10 -y |
| conda activate anyrecon |
| pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118 |
| pip install -r requirements.txt |
| ``` |
|
|
| ## π Quick Start |
|
|
| ### Inference |
| You can run the inference using the provided python script (ensure you have downloaded the required weights and placed them in the `./checkpoints` folder): |
|
|
| ```bash |
| python run_AnyRecon.py \ |
| --root_dir example/valley \ |
| --output_dir example/valley \ |
| --lora_path full_attention.ckpt |
| ``` |
|
|
| ## π Citation |
| If you find our work helpful, please cite it: |
| ```bibtex |
| @article{chen2026anyrecon, |
| title={AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model}, |
| author={Chen, Yutian and Guo, Shi and Jin, Renbiao and Yang, Tianshuo and Cai, Xin and Luo, Yawen and Yang, Mingxin and Yu, Mulin and Xu, Linning and Xue, Tianfan}, |
| journal={arXiv preprint arXiv:2604.19747}, |
| year={2026} |
| } |
| ``` |
|
|
| ## π Acknowledgments |
| Thanks to these great repositories: [Wan2.1](https://github.com/Wan-Video/Wan2.1) and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio). |