docs(README): citation year = 2026.05.13

77b8fb6 3 days ago

7.03 kB

	---
	license: apache-2.0
	base_model: OpenGVLab/InternVL3_5-1B
	pipeline_tag: robotics
	library_name: pytorch
	tags:
	- vision-language-navigation
	- VLN
	- VLNVerse
	- InternVL
	- robot-navigation
	- sim2real
	- embodied-AI
	model-index:
	- name: RyWorld VLN — Stage 1 Discrete (step 15000)
	results:
	- task:
	type: vision-language-navigation
	name: VLN coarse/val_unseen
	dataset:
	name: VLNVerse coarse/val_unseen
	type: Eyz/VLNVerse_data
	split: val_unseen
	metrics:
	- type: success_rate
	value: 51.14
	name: Success Rate (%)
	- type: spl
	value: 49.22
	name: SPL (%)
	- type: oracle_success_rate
	value: 64.79
	name: Oracle Success Rate (%)
	- type: navigation_error
	value: 3.727
	name: Navigation Error (m)
	- type: ndtw
	value: 0.9445
	name: nDTW
	---

	# RyWorld VLN — Stage 1 Discrete (step 15000)

	Vision-language navigation policy built on InternVL3.5-1B with a separate
	StopHead and ProprioProjector. Trained on the VLNVerse coarse/fine training
	set; evaluated on the official `coarse/val_unseen` 835-episode benchmark using
	the [vlnverse_emr](https://github.com/sihaoevery/vlnverse) evaluation framework.

	## Headline result

	On the full VLNVerse coarse/val_unseen (835 episodes) with `stop_threshold = 0.95`:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Success Rate (SR) \| 51.14% \|
	\| SPL \| 49.22% \|
	\| Oracle Success Rate (OSR) \| 64.79% \|
	\| Navigation Error (NE) \| 3.727 m \|
	\| nDTW \| 0.9445 \|
	\| Mean Trajectory Length \| 6.121 m \|

	## Comparison vs VLNVerse paper baselines

	Reproduced inside the official `vlnverse_emr` framework on the same `coarse/val_unseen`
	split. Baseline numbers from VLNVerse paper ([arXiv:2512.19021](https://arxiv.org/abs/2512.19021), Table 3):

	\| Method \| SR ↑ \| SPL ↑ \| Δ vs RyWorld \|
	\|--------\|------\|-------\|--------------\|
	\| CMA (VLN-CE) \| 32.15% \| 29.06% \| −18.99 / −20.16 \|
	\| Seq2Seq \| 31.91% \| 29.68% \| −19.23 / −19.54 \|
	\| HNR \| 36.02% \| 33.67% \| −15.12 / −15.55 \|
	\| RDP \| 41.61% \| 37.53% \| −9.53 / −11.69 \|
	\| GAMA (paper SOTA) \| 42.45% \| 38.89% \| −8.69 / −10.33 \|
	\| RyWorld @ thr=0.95 (this model) \| 51.14% \| 49.22% \| — \|

	## Architecture

	```
	Inputs: Outputs (per chunk position, chunk_size=4):
	- RGB 256x256 (Isaac live or - Discrete head xattn: 4-way CE
	pre-rendered training video) (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
	- Instruction text (formal variant) - Stop head xattn: BCE-with-pos_weight
	- Proprio history N=8 keyframes soft target stop_proximity = exp(-d/tau)
	(body-frame deltas [dx, dy, tau=4.33 aligned to success_radius=3 m
	cos(dtheta), sin(dtheta)])
	- Previous action class history
	(decision-point keyframe selector)

	Backbone: OpenGVLab/InternVL3_5-1B (InternViT-300M + Qwen3-0.6B, d_model=1024,
	vision tower frozen, LoRA r=8 on language, mlp1 trainable)
	Connector: ProprioProjector (continuous proprio -> 1024 embedding)
	```

	Detailed architecture & training in `docs/RYWORLD_ARCHITECTURE.md` of the source repo.

	## Per-segment performance

	SR broken down by reference path length (`shortest_path_length`):

	\| Path length (m) \| n \| SR \| NE (m) \|
	\|-----------------\|---\|----\|--------\|
	\| [ 0, 5) \| 151 \| 66.9% \| 2.55 \|
	\| [ 5, 8) \| 360 \| 59.4% \| 3.07 \|
	\| [ 8, 12) \| 226 \| 42.9% \| 4.33 \|
	\| [12, 18) \| 96 \| 14.6% \| 6.58 \|
	\| [18, 30) \| 2 \| 50.0% \| 4.55 \|

	The drop on long paths (>12 m) is the dominant remaining gap; addressing it
	likely requires either training-time long-horizon planning supervision or
	larger `forward_distance` per high-level action.

	## Stop head behavior (151,740 chunk-positions)

	\| Statistic \| Value \|
	\|-----------\|-------\|
	\| stop_prob median \| 0.752 \|
	\| stop_prob p90 \| 0.897 \|
	\| pathA fire (argmax==Stop, natural) \| 2.51% \|
	\| pathB fire (threshold override) \| 0.68% \|
	\| no-stop \| 96.81% \|

	`stop_threshold=0.95` was selected via a 4-point sweep (0.88/0.92/0.95/0.97) on
	a 30-episode smoke subset before the full run. Higher thresholds (0.97+) cause
	overshoot regressions on the long-path segment; 0.95 is the empirical sweet spot.

	## How to use

	### 1. Load the checkpoint

	```python
	import torch
	from omegaconf import OmegaConf
	import sys
	sys.path.insert(0, "/path/to/ry-dynamics-vln-ryworld")
	from ryworld.training.train_ryworld_vlm import build_model_from_yaml
	from ryworld.training.ryworld_train_utils_vlm import load_vlm_checkpoint

	cfg = OmegaConf.merge(
	OmegaConf.load("stage1_discrete.yaml"),
	OmegaConf.load("a100_4gpu_discrete.yaml"),
	)
	model = build_model_from_yaml(cfg, device=torch.device("cuda"))
	load_vlm_checkpoint(model, None, "ckpt_step0015000_final.pt", strict_model=False)
	model.train(False) # inference mode
	```

	### 2. Evaluate on VLNVerse

	```bash
	cd /path/to/ry-dynamics-vln-ryworld
	conda activate vlnverse # Isaac Sim 4.5 + torch 2.7.1 + cu126
	export OMNI_KIT_ACCEPT_EULA=YES

	bash scripts/eval/run_eval_structured.sh \
	--ckpt ckpt_step0015000_final.pt \
	--tag eval_replicate \
	--stop-thr 0.95
	```

	See `scripts/eval/run_eval_structured.sh` for the eval pipeline (records meta.json + per_episode.csv + appends to docs/eval_ledger.jsonl).

	## Training data

	- VLNVerse coarse + fine train (~12,000 trajectories, 33 indoor scenes)
	- Pre-rendered RGB videos at 256x256 (10 fps)
	- Discrete action labels (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
	- Formal-variant instruction text

	Trained on 4x A100 80 GB with `chunk_size=4` multi-step CE supervision +
	StopHead BCE (pos_weight=5.0, tau=4.33).

	## Files in this repo

	\| File \| Description \|
	\|------\|-------------\|
	\| `ckpt_step0015000_final.pt` \| Main checkpoint (2.81 GB) \|
	\| `stage1_discrete.yaml` \| Base training config \|
	\| `a100_4gpu_discrete.yaml` \| Production overlay (4x A100) \|
	\| `h1_ryworld_cfg_vlnverse_coarse_val_unseen.py` \| Eval config (vlnverse_emr) \|
	\| `eval_results/` \| Full eval artifacts: per-shard meta.json, per_episode.csv, server.log.gz with STOP_DEBUG \|
	\| `EVAL_SUMMARY.md` \| One-page summary of headline metrics \|

	## Citation

	```bibtex
	@misc{ryworld2026,
	title = {RyWorld: Vision-Language Navigation with a Unified Multimodal World Model},
	author = {{wei.tao, RUYi Dynamics}},
	year = {2026.05.13},
	url = {https://huggingface.co/ruyidynamics/ryworld-vln-discrete}
	}
	```

	If you use this model on the VLNVerse benchmark, please also cite the underlying
	benchmark paper:

	```bibtex
	@article{vlnverse2025,
	title = {VLNVerse: A Large-Scale Extensible Benchmark for Vision-Language Navigation},
	author = {Sihao Yu and Yuxuan Zhang and others},
	journal = {arXiv preprint arXiv:2512.19021},
	year = {2025}
	}
	```

	## License

	Apache-2.0 (model weights & code).

	Note: VLNVerse data and Isaac Sim assets retain their own licenses (see the
	[VLNVerse repo](https://github.com/sihaoevery/vlnverse) for details).