README.md · gonsaBRK/coarse2real at main

coarse2real / README.md

gonsaBRK

Update README.md

55ad037 verified 3 days ago

preview code

raw

history blame contribute delete

9.63 kB

	---
	license: cc-by-nc-nd-4.0
	language:
	- en
	base_model:
	- Wan-AI/Wan2.1-T2V-14B
	library_name: pytorch
	pipeline_tag: text-to-video
	tags:
	- video-generation
	- text-to-video
	- controllable-video-generation
	- generative-rendering
	- neural-rendering
	- computer-graphics
	- 3d-simulation
	- crowd-simulation
	- diffusion
	- wan
	- research
	- non-commercial

	gated: true
	extra_gated_prompt: >-
	C2R model weights are released for non-commercial research and educational use
	only under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
	International (CC BY-NC-ND 4.0) license.

	By requesting access, you agree that you will use the model weights only for
	non-commercial research or educational purposes, will not use them for any
	commercial product or service, will not redistribute the original or modified
	weights, and will provide proper attribution when using this work.

	extra_gated_fields:
	"Full name": text
	"Affiliation": text
	"Institutional or professional email": text
	"Intended use":
	type: select
	options:
	- Academic research
	- Education
	- Internal non-commercial evaluation
	- Other non-commercial use
	"I agree to use the model weights for non-commercial purposes only": checkbox
	"I agree not to redistribute the original or modified model weights": checkbox
	"I agree to provide proper attribution if I use this work": checkbox
	---

	<h1 align="center">C2R: Coarse-to-Real</h1>

	<p align="center">
	<a href="https://gonzalogn.com/">Gonzalo Gomez-Nogales</a><sup>1</sup>,
	<a href="https://yiconghong.me/">Yicong Hong</a><sup>2</sup>,
	<a href="https://chongjiange.github.io/">Chongjian Ge</a><sup>2</sup>,
	Peiye Zhuang<sup>3</sup>,
	<a href="https://dancasas.github.io/">Dan Casas</a><sup>1</sup>,
	<a href="https://zhouyisjtu.github.io/">Yi Zhou</a><sup>3</sup>
	</p>

	<p align="center">
	<sup>1</sup>Universidad Rey Juan Carlos
	<sup>2</sup>Adobe Research
	<sup>3</sup>Roblox
	</p>

	<p align="center">
	<a href="https://arxiv.org/abs/2601.22301">
	<img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv Paper">
	</a>
	<a href="https://github.com/GonzaloGNogales/coarse2real">
	<img src="https://img.shields.io/badge/Code-GitHub-black?logo=github" alt="GitHub Code">
	</a>
	</p>

	## Model Summary

	C2R (Coarse-to-Real) is a generative rendering framework that synthesizes realistic urban crowd videos from coarse 3D simulation videos. Given a text prompt and a coarse control video, C2R generates realistic videos while preserving the input scene layout, camera motion, and human trajectories.

	The model is designed for controllable video generation from minimal 3D input. It uses a two-stage synthetic-real domain-hedging strategy: first learning a strong video generative prior from large-scale real footage, then introducing controllability through a small amount of paired synthetic coarse-to-fine data.

	This Hugging Face repository contains the released C2R 14B model weights, including:

	- C2R DiT backbone checkpoint
	- C2R DINO adapter checkpoint

	The inference code is available in the GitHub repository:

	```bash
	git clone https://github.com/GonzaloGNogales/coarse2real.git
	```

	## Model Details

	- Model name: C2R: Coarse-to-Real
	- Task: Controllable video generation / generative rendering
	- Input: Text prompt + coarse 3D control video
	- Output: Realistic generated video
	- Backbone: Wan2.1 14B
	- Control features: DINOv3-based video features
	- Release type: Inference-only
	- License for weights: CC BY-NC-ND 4.0
	- Access: Gated access required

	## Repository Files

	This model repository provides the C2R-specific checkpoints:

	```text
	c2r-dit-backbone-14B.safetensors
	c2r-dino-adapter.safetensors
	```

	The Wan2.1 14B base model is required separately and should be downloaded from:

	```text
	Wan-AI/Wan2.1-T2V-14B
	```

	C2R uses the Wan2.1 14B base folder for the text encoder, VAE, and tokenizer assets.

	## Installation

	Please use the official C2R inference codebase:

	```bash
	git clone https://github.com/GonzaloGNogales/coarse2real.git
	cd coarse2real

	conda env create -f c2r-setup.yml
	conda activate coarse2real
	```

	The default environment includes the recommended runtime dependencies for inference.

	## Download Weights

	First, download the Wan2.1 14B base weights:

	```bash
	mkdir -p models/wan
	hf download Wan-AI/Wan2.1-T2V-14B \
	--local-dir models/wan
	```

	Expected Wan2.1 files include:

	```text
	models/wan/models_t5_umt5-xxl-enc-bf16.pth
	models/wan/Wan2.1_VAE.pth
	models/wan/google/umt5-xxl/...
	```

	Then download the C2R DiT backbone:

	```bash
	mkdir -p models/pretrained_dit_backbone
	hf download gonsaBRK/coarse2real c2r-dit-backbone-14B.safetensors \
	--local-dir models/pretrained_dit_backbone
	```

	Download the C2R DINO adapter:

	```bash
	mkdir -p models/dino_adapter
	hf download gonsaBRK/coarse2real c2r-dino-adapter.safetensors \
	--local-dir models/dino_adapter
	```

	C2R also uses the DINOv3 backbone `facebook/dinov3-vitb16-pretrain-lvd1689m` for control-video features. For offline or cluster inference, download it locally:

	```bash
	mkdir -p models/dino/dinov3-vitb16-pretrain-lvd1689m
	hf download facebook/dinov3-vitb16-pretrain-lvd1689m \
	--local-dir models/dino/dinov3-vitb16-pretrain-lvd1689m
	```

	Then set the local path in the inference config:

	```json
	"dino_model_path": "models/dino/dinov3-vitb16-pretrain-lvd1689m"
	```

	## Usage

	C2R requires:

	- A text prompt
	- A coarse 3D control video
	- The C2R DiT backbone checkpoint
	- The C2R DINO adapter checkpoint
	- The Wan2.1 14B base model assets

	Prompts are read from:

	```text
	inference/c2r-prompts.txt
	```

	Control videos are read from:

	```text
	inference/control_videos
	```

	Supported control video extensions:

	```text
	.mp4 .mov .mkv .avi .webm .m4v
	```

	## Run Inference

	Single GPU:

	```bash
	bash inference/launch_1gpu.sh
	```

	USP multi-GPU, for splitting one generation across multiple GPUs:

	```bash
	bash inference/launch_multigpu_usp.sh
	```

	DP multi-GPU, for generating many results in parallel:

	```bash
	bash inference/launch_multigpu_dp.sh
	```

	You can also run a config directly:

	```bash
	python -m inference.run_inference --config inference/config_1gpu.json
	```

	or with `torchrun`:

	```bash
	torchrun --standalone --nproc_per_node=8 -m inference.run_inference \
	--config inference/config_multigpu_usp.json
	```

	## Gradio Demo

	The GitHub codebase also includes a local Gradio demo:

	```bash
	bash inference/launch_gradio.sh
	```

	By default, the demo binds to:

	```text
	127.0.0.1:7860
	```

	For remote cluster usage, open an SSH tunnel from your local machine:

	```bash
	ssh -L 7860:127.0.0.1:7860 your_user@cluster-login-host
	```

	Then open:

	```text
	http://127.0.0.1:7860
	```

	## Prompt Enhancement

	C2R supports optional prompt enhancement:

	```json
	"prompt_enhancement_mode": "enhanced"
	```

	This mode uses Qwen3 VLM/LLM models to describe the control video and fuse that information with the user prompt before generation. It may improve generation quality, but adds preprocessing time.

	For fastest inference, use:

	```json
	"prompt_enhancement_mode": "off"
	```

	## Intended Use

	This model is intended for:

	- Non-commercial research
	- Academic evaluation
	- Generative rendering research
	- Controllable video generation research
	- Computer graphics and simulation research
	- Testing coarse-to-real video synthesis from 3D simulation inputs

	## Out-of-Scope Use

	The model weights are not intended for:

	- Commercial use
	- Redistribution of modified versions
	- Production deployment without additional validation
	- Generating misleading, harmful, or deceptive media
	- Use cases that violate the license terms of this model or any upstream dependency

	## Limitations

	This is an inference-only research release. The generated videos may contain visual artifacts, temporal inconsistencies, inaccurate fine details, or deviations from the input prompt. Performance may vary depending on the quality, structure, and domain of the coarse control video.

	The model is optimized for coarse 3D simulation videos of populated urban scenes. Results outside this domain may be less reliable.

	## License

	The model weights in this repository are released under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.

	Allowed:

	- Use for non-commercial research and education
	- Sharing the original work with proper attribution

	Not allowed:

	- Commercial use
	- Redistribution of modified versions of the model weights

	The inference code is released separately under the PolyForm Noncommercial License 1.0.0 in the GitHub repository.

	Third-party dependencies and base models are subject to their own licenses.

	## Citation

	If you use this work in academic research, please cite:

	```bibtex
	@misc{gomeznogales2026coarsetoreal,
	title = {Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes},
	author = {Gomez-Nogales, Gonzalo and Hong, Yicong and Ge, Chongjian and Zhuang, Peiye and Comino-Trinidad, Marc and Casas, Dan and Zhou, Yi},
	year = {2026},
	eprint = {2601.22301},
	archivePrefix = {arXiv},
	primaryClass = {cs.CV},
	doi = {10.48550/arXiv.2601.22301},
	url = {https://arxiv.org/abs/2601.22301}
	}
	```

	## Contact

	For questions or collaborations, please contact:

	- Gonzalo Gomez-Nogales
	[gonzalo.gomez@urjc.es](mailto:gonzalo.gomez@urjc.es)

	- Yi Zhou
	[yizhou@roblox.com](mailto:yizhou@roblox.com)
	[zhouyisjtu2012@gmail.com](mailto:zhouyisjtu2012@gmail.com)