File size: 7,028 Bytes
4a61963
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85fdee6
77b8fb6
70fb0ec
4a61963
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
license: apache-2.0
base_model: OpenGVLab/InternVL3_5-1B
pipeline_tag: robotics
library_name: pytorch
tags:
  - vision-language-navigation
  - VLN
  - VLNVerse
  - InternVL
  - robot-navigation
  - sim2real
  - embodied-AI
model-index:
  - name: RyWorld VLN  Stage 1 Discrete (step 15000)
    results:
      - task:
          type: vision-language-navigation
          name: VLN coarse/val_unseen
        dataset:
          name: VLNVerse coarse/val_unseen
          type: Eyz/VLNVerse_data
          split: val_unseen
        metrics:
          - type: success_rate
            value: 51.14
            name: Success Rate (%)
          - type: spl
            value: 49.22
            name: SPL (%)
          - type: oracle_success_rate
            value: 64.79
            name: Oracle Success Rate (%)
          - type: navigation_error
            value: 3.727
            name: Navigation Error (m)
          - type: ndtw
            value: 0.9445
            name: nDTW
---

# RyWorld VLN — Stage 1 Discrete (step 15000)

**Vision-language navigation policy** built on InternVL3.5-1B with a separate
StopHead and ProprioProjector. Trained on the VLNVerse coarse/fine training
set; evaluated on the official `coarse/val_unseen` 835-episode benchmark using
the [vlnverse_emr](https://github.com/sihaoevery/vlnverse) evaluation framework.

## Headline result

On the full **VLNVerse coarse/val_unseen (835 episodes)** with `stop_threshold = 0.95`:

| Metric | Value |
|--------|-------|
| **Success Rate (SR)** | **51.14%** |
| **SPL** | **49.22%** |
| Oracle Success Rate (OSR) | 64.79% |
| Navigation Error (NE) | 3.727 m |
| nDTW | 0.9445 |
| Mean Trajectory Length | 6.121 m |

## Comparison vs VLNVerse paper baselines

Reproduced inside the official `vlnverse_emr` framework on the same `coarse/val_unseen`
split. Baseline numbers from VLNVerse paper ([arXiv:2512.19021](https://arxiv.org/abs/2512.19021), Table 3):

| Method | SR ↑ | SPL ↑ | Δ vs RyWorld |
|--------|------|-------|--------------|
| CMA (VLN-CE) | 32.15% | 29.06% | −18.99 / −20.16 |
| Seq2Seq | 31.91% | 29.68% | −19.23 / −19.54 |
| HNR | 36.02% | 33.67% | −15.12 / −15.55 |
| RDP | 41.61% | 37.53% |  −9.53 / −11.69 |
| GAMA (paper SOTA) | 42.45% | 38.89% |  **−8.69 / −10.33** |
| **RyWorld @ thr=0.95 (this model)** | **51.14%** | **49.22%** | — |

## Architecture

```
Inputs:                              Outputs (per chunk position, chunk_size=4):
- RGB 256x256 (Isaac live or         - Discrete head xattn: 4-way CE
  pre-rendered training video)         (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Instruction text (formal variant)  - Stop head xattn: BCE-with-pos_weight
- Proprio history N=8 keyframes        soft target stop_proximity = exp(-d/tau)
  (body-frame deltas [dx, dy,          tau=4.33 aligned to success_radius=3 m
   cos(dtheta), sin(dtheta)])
- Previous action class history
  (decision-point keyframe selector)

Backbone: OpenGVLab/InternVL3_5-1B (InternViT-300M + Qwen3-0.6B, d_model=1024,
          vision tower frozen, LoRA r=8 on language, mlp1 trainable)
Connector: ProprioProjector (continuous proprio -> 1024 embedding)
```

Detailed architecture & training in `docs/RYWORLD_ARCHITECTURE.md` of the source repo.

## Per-segment performance

SR broken down by reference path length (`shortest_path_length`):

| Path length (m) | n | SR | NE (m) |
|-----------------|---|----|--------|
| [ 0,  5) | 151 | 66.9% | 2.55 |
| [ 5,  8) | 360 | 59.4% | 3.07 |
| [ 8, 12) | 226 | 42.9% | 4.33 |
| [12, 18) |  96 | 14.6% | 6.58 |
| [18, 30) |   2 | 50.0% | 4.55 |

The drop on long paths (>12 m) is the dominant remaining gap; addressing it
likely requires either training-time long-horizon planning supervision or
larger `forward_distance` per high-level action.

## Stop head behavior (151,740 chunk-positions)

| Statistic | Value |
|-----------|-------|
| stop_prob median | 0.752 |
| stop_prob p90 | 0.897 |
| pathA fire (argmax==Stop, natural) | 2.51% |
| pathB fire (threshold override) | 0.68% |
| no-stop | 96.81% |

`stop_threshold=0.95` was selected via a 4-point sweep (0.88/0.92/0.95/0.97) on
a 30-episode smoke subset before the full run. Higher thresholds (0.97+) cause
overshoot regressions on the long-path segment; 0.95 is the empirical sweet spot.

## How to use

### 1. Load the checkpoint

```python
import torch
from omegaconf import OmegaConf
import sys
sys.path.insert(0, "/path/to/ry-dynamics-vln-ryworld")
from ryworld.training.train_ryworld_vlm import build_model_from_yaml
from ryworld.training.ryworld_train_utils_vlm import load_vlm_checkpoint

cfg = OmegaConf.merge(
    OmegaConf.load("stage1_discrete.yaml"),
    OmegaConf.load("a100_4gpu_discrete.yaml"),
)
model = build_model_from_yaml(cfg, device=torch.device("cuda"))
load_vlm_checkpoint(model, None, "ckpt_step0015000_final.pt", strict_model=False)
model.train(False)  # inference mode
```

### 2. Evaluate on VLNVerse

```bash
cd /path/to/ry-dynamics-vln-ryworld
conda activate vlnverse  # Isaac Sim 4.5 + torch 2.7.1 + cu126
export OMNI_KIT_ACCEPT_EULA=YES

bash scripts/eval/run_eval_structured.sh \
  --ckpt ckpt_step0015000_final.pt \
  --tag eval_replicate \
  --stop-thr 0.95
```

See `scripts/eval/run_eval_structured.sh` for the eval pipeline (records meta.json + per_episode.csv + appends to docs/eval_ledger.jsonl).

## Training data

- **VLNVerse coarse + fine train** (~12,000 trajectories, 33 indoor scenes)
- Pre-rendered RGB videos at 256x256 (10 fps)
- Discrete action labels (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Formal-variant instruction text

Trained on 4x A100 80 GB with `chunk_size=4` multi-step CE supervision +
StopHead BCE (pos_weight=5.0, tau=4.33).

## Files in this repo

| File | Description |
|------|-------------|
| `ckpt_step0015000_final.pt` | Main checkpoint (2.81 GB) |
| `stage1_discrete.yaml` | Base training config |
| `a100_4gpu_discrete.yaml` | Production overlay (4x A100) |
| `h1_ryworld_cfg_vlnverse_coarse_val_unseen.py` | Eval config (vlnverse_emr) |
| `eval_results/` | Full eval artifacts: per-shard meta.json, per_episode.csv, server.log.gz with STOP_DEBUG |
| `EVAL_SUMMARY.md` | One-page summary of headline metrics |

## Citation

```bibtex
@misc{ryworld2026,
  title  = {RyWorld: Vision-Language Navigation with a Unified Multimodal World Model},
  author = {{wei.tao, RUYi Dynamics}},
  year   = {2026.05.13},
  url    = {https://huggingface.co/ruyidynamics/ryworld-vln-discrete}
}
```

If you use this model on the VLNVerse benchmark, please also cite the underlying
benchmark paper:

```bibtex
@article{vlnverse2025,
  title   = {VLNVerse: A Large-Scale Extensible Benchmark for Vision-Language Navigation},
  author  = {Sihao Yu and Yuxuan Zhang and others},
  journal = {arXiv preprint arXiv:2512.19021},
  year    = {2025}
}
```

## License

Apache-2.0 (model weights & code).

Note: VLNVerse data and Isaac Sim assets retain their own licenses (see the
[VLNVerse repo](https://github.com/sihaoevery/vlnverse) for details).