File size: 7,028 Bytes
4a61963 85fdee6 77b8fb6 70fb0ec 4a61963 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 | ---
license: apache-2.0
base_model: OpenGVLab/InternVL3_5-1B
pipeline_tag: robotics
library_name: pytorch
tags:
- vision-language-navigation
- VLN
- VLNVerse
- InternVL
- robot-navigation
- sim2real
- embodied-AI
model-index:
- name: RyWorld VLN — Stage 1 Discrete (step 15000)
results:
- task:
type: vision-language-navigation
name: VLN coarse/val_unseen
dataset:
name: VLNVerse coarse/val_unseen
type: Eyz/VLNVerse_data
split: val_unseen
metrics:
- type: success_rate
value: 51.14
name: Success Rate (%)
- type: spl
value: 49.22
name: SPL (%)
- type: oracle_success_rate
value: 64.79
name: Oracle Success Rate (%)
- type: navigation_error
value: 3.727
name: Navigation Error (m)
- type: ndtw
value: 0.9445
name: nDTW
---
# RyWorld VLN — Stage 1 Discrete (step 15000)
**Vision-language navigation policy** built on InternVL3.5-1B with a separate
StopHead and ProprioProjector. Trained on the VLNVerse coarse/fine training
set; evaluated on the official `coarse/val_unseen` 835-episode benchmark using
the [vlnverse_emr](https://github.com/sihaoevery/vlnverse) evaluation framework.
## Headline result
On the full **VLNVerse coarse/val_unseen (835 episodes)** with `stop_threshold = 0.95`:
| Metric | Value |
|--------|-------|
| **Success Rate (SR)** | **51.14%** |
| **SPL** | **49.22%** |
| Oracle Success Rate (OSR) | 64.79% |
| Navigation Error (NE) | 3.727 m |
| nDTW | 0.9445 |
| Mean Trajectory Length | 6.121 m |
## Comparison vs VLNVerse paper baselines
Reproduced inside the official `vlnverse_emr` framework on the same `coarse/val_unseen`
split. Baseline numbers from VLNVerse paper ([arXiv:2512.19021](https://arxiv.org/abs/2512.19021), Table 3):
| Method | SR ↑ | SPL ↑ | Δ vs RyWorld |
|--------|------|-------|--------------|
| CMA (VLN-CE) | 32.15% | 29.06% | −18.99 / −20.16 |
| Seq2Seq | 31.91% | 29.68% | −19.23 / −19.54 |
| HNR | 36.02% | 33.67% | −15.12 / −15.55 |
| RDP | 41.61% | 37.53% | −9.53 / −11.69 |
| GAMA (paper SOTA) | 42.45% | 38.89% | **−8.69 / −10.33** |
| **RyWorld @ thr=0.95 (this model)** | **51.14%** | **49.22%** | — |
## Architecture
```
Inputs: Outputs (per chunk position, chunk_size=4):
- RGB 256x256 (Isaac live or - Discrete head xattn: 4-way CE
pre-rendered training video) (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Instruction text (formal variant) - Stop head xattn: BCE-with-pos_weight
- Proprio history N=8 keyframes soft target stop_proximity = exp(-d/tau)
(body-frame deltas [dx, dy, tau=4.33 aligned to success_radius=3 m
cos(dtheta), sin(dtheta)])
- Previous action class history
(decision-point keyframe selector)
Backbone: OpenGVLab/InternVL3_5-1B (InternViT-300M + Qwen3-0.6B, d_model=1024,
vision tower frozen, LoRA r=8 on language, mlp1 trainable)
Connector: ProprioProjector (continuous proprio -> 1024 embedding)
```
Detailed architecture & training in `docs/RYWORLD_ARCHITECTURE.md` of the source repo.
## Per-segment performance
SR broken down by reference path length (`shortest_path_length`):
| Path length (m) | n | SR | NE (m) |
|-----------------|---|----|--------|
| [ 0, 5) | 151 | 66.9% | 2.55 |
| [ 5, 8) | 360 | 59.4% | 3.07 |
| [ 8, 12) | 226 | 42.9% | 4.33 |
| [12, 18) | 96 | 14.6% | 6.58 |
| [18, 30) | 2 | 50.0% | 4.55 |
The drop on long paths (>12 m) is the dominant remaining gap; addressing it
likely requires either training-time long-horizon planning supervision or
larger `forward_distance` per high-level action.
## Stop head behavior (151,740 chunk-positions)
| Statistic | Value |
|-----------|-------|
| stop_prob median | 0.752 |
| stop_prob p90 | 0.897 |
| pathA fire (argmax==Stop, natural) | 2.51% |
| pathB fire (threshold override) | 0.68% |
| no-stop | 96.81% |
`stop_threshold=0.95` was selected via a 4-point sweep (0.88/0.92/0.95/0.97) on
a 30-episode smoke subset before the full run. Higher thresholds (0.97+) cause
overshoot regressions on the long-path segment; 0.95 is the empirical sweet spot.
## How to use
### 1. Load the checkpoint
```python
import torch
from omegaconf import OmegaConf
import sys
sys.path.insert(0, "/path/to/ry-dynamics-vln-ryworld")
from ryworld.training.train_ryworld_vlm import build_model_from_yaml
from ryworld.training.ryworld_train_utils_vlm import load_vlm_checkpoint
cfg = OmegaConf.merge(
OmegaConf.load("stage1_discrete.yaml"),
OmegaConf.load("a100_4gpu_discrete.yaml"),
)
model = build_model_from_yaml(cfg, device=torch.device("cuda"))
load_vlm_checkpoint(model, None, "ckpt_step0015000_final.pt", strict_model=False)
model.train(False) # inference mode
```
### 2. Evaluate on VLNVerse
```bash
cd /path/to/ry-dynamics-vln-ryworld
conda activate vlnverse # Isaac Sim 4.5 + torch 2.7.1 + cu126
export OMNI_KIT_ACCEPT_EULA=YES
bash scripts/eval/run_eval_structured.sh \
--ckpt ckpt_step0015000_final.pt \
--tag eval_replicate \
--stop-thr 0.95
```
See `scripts/eval/run_eval_structured.sh` for the eval pipeline (records meta.json + per_episode.csv + appends to docs/eval_ledger.jsonl).
## Training data
- **VLNVerse coarse + fine train** (~12,000 trajectories, 33 indoor scenes)
- Pre-rendered RGB videos at 256x256 (10 fps)
- Discrete action labels (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Formal-variant instruction text
Trained on 4x A100 80 GB with `chunk_size=4` multi-step CE supervision +
StopHead BCE (pos_weight=5.0, tau=4.33).
## Files in this repo
| File | Description |
|------|-------------|
| `ckpt_step0015000_final.pt` | Main checkpoint (2.81 GB) |
| `stage1_discrete.yaml` | Base training config |
| `a100_4gpu_discrete.yaml` | Production overlay (4x A100) |
| `h1_ryworld_cfg_vlnverse_coarse_val_unseen.py` | Eval config (vlnverse_emr) |
| `eval_results/` | Full eval artifacts: per-shard meta.json, per_episode.csv, server.log.gz with STOP_DEBUG |
| `EVAL_SUMMARY.md` | One-page summary of headline metrics |
## Citation
```bibtex
@misc{ryworld2026,
title = {RyWorld: Vision-Language Navigation with a Unified Multimodal World Model},
author = {{wei.tao, RUYi Dynamics}},
year = {2026.05.13},
url = {https://huggingface.co/ruyidynamics/ryworld-vln-discrete}
}
```
If you use this model on the VLNVerse benchmark, please also cite the underlying
benchmark paper:
```bibtex
@article{vlnverse2025,
title = {VLNVerse: A Large-Scale Extensible Benchmark for Vision-Language Navigation},
author = {Sihao Yu and Yuxuan Zhang and others},
journal = {arXiv preprint arXiv:2512.19021},
year = {2025}
}
```
## License
Apache-2.0 (model weights & code).
Note: VLNVerse data and Isaac Sim assets retain their own licenses (see the
[VLNVerse repo](https://github.com/sihaoevery/vlnverse) for details).
|