ttll0928's picture
docs(README): citation year = 2026.05.13
77b8fb6
---
license: apache-2.0
base_model: OpenGVLab/InternVL3_5-1B
pipeline_tag: robotics
library_name: pytorch
tags:
- vision-language-navigation
- VLN
- VLNVerse
- InternVL
- robot-navigation
- sim2real
- embodied-AI
model-index:
- name: RyWorld VLN Stage 1 Discrete (step 15000)
results:
- task:
type: vision-language-navigation
name: VLN coarse/val_unseen
dataset:
name: VLNVerse coarse/val_unseen
type: Eyz/VLNVerse_data
split: val_unseen
metrics:
- type: success_rate
value: 51.14
name: Success Rate (%)
- type: spl
value: 49.22
name: SPL (%)
- type: oracle_success_rate
value: 64.79
name: Oracle Success Rate (%)
- type: navigation_error
value: 3.727
name: Navigation Error (m)
- type: ndtw
value: 0.9445
name: nDTW
---
# RyWorld VLN — Stage 1 Discrete (step 15000)
**Vision-language navigation policy** built on InternVL3.5-1B with a separate
StopHead and ProprioProjector. Trained on the VLNVerse coarse/fine training
set; evaluated on the official `coarse/val_unseen` 835-episode benchmark using
the [vlnverse_emr](https://github.com/sihaoevery/vlnverse) evaluation framework.
## Headline result
On the full **VLNVerse coarse/val_unseen (835 episodes)** with `stop_threshold = 0.95`:
| Metric | Value |
|--------|-------|
| **Success Rate (SR)** | **51.14%** |
| **SPL** | **49.22%** |
| Oracle Success Rate (OSR) | 64.79% |
| Navigation Error (NE) | 3.727 m |
| nDTW | 0.9445 |
| Mean Trajectory Length | 6.121 m |
## Comparison vs VLNVerse paper baselines
Reproduced inside the official `vlnverse_emr` framework on the same `coarse/val_unseen`
split. Baseline numbers from VLNVerse paper ([arXiv:2512.19021](https://arxiv.org/abs/2512.19021), Table 3):
| Method | SR ↑ | SPL ↑ | Δ vs RyWorld |
|--------|------|-------|--------------|
| CMA (VLN-CE) | 32.15% | 29.06% | −18.99 / −20.16 |
| Seq2Seq | 31.91% | 29.68% | −19.23 / −19.54 |
| HNR | 36.02% | 33.67% | −15.12 / −15.55 |
| RDP | 41.61% | 37.53% | −9.53 / −11.69 |
| GAMA (paper SOTA) | 42.45% | 38.89% | **−8.69 / −10.33** |
| **RyWorld @ thr=0.95 (this model)** | **51.14%** | **49.22%** | — |
## Architecture
```
Inputs: Outputs (per chunk position, chunk_size=4):
- RGB 256x256 (Isaac live or - Discrete head xattn: 4-way CE
pre-rendered training video) (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Instruction text (formal variant) - Stop head xattn: BCE-with-pos_weight
- Proprio history N=8 keyframes soft target stop_proximity = exp(-d/tau)
(body-frame deltas [dx, dy, tau=4.33 aligned to success_radius=3 m
cos(dtheta), sin(dtheta)])
- Previous action class history
(decision-point keyframe selector)
Backbone: OpenGVLab/InternVL3_5-1B (InternViT-300M + Qwen3-0.6B, d_model=1024,
vision tower frozen, LoRA r=8 on language, mlp1 trainable)
Connector: ProprioProjector (continuous proprio -> 1024 embedding)
```
Detailed architecture & training in `docs/RYWORLD_ARCHITECTURE.md` of the source repo.
## Per-segment performance
SR broken down by reference path length (`shortest_path_length`):
| Path length (m) | n | SR | NE (m) |
|-----------------|---|----|--------|
| [ 0, 5) | 151 | 66.9% | 2.55 |
| [ 5, 8) | 360 | 59.4% | 3.07 |
| [ 8, 12) | 226 | 42.9% | 4.33 |
| [12, 18) | 96 | 14.6% | 6.58 |
| [18, 30) | 2 | 50.0% | 4.55 |
The drop on long paths (>12 m) is the dominant remaining gap; addressing it
likely requires either training-time long-horizon planning supervision or
larger `forward_distance` per high-level action.
## Stop head behavior (151,740 chunk-positions)
| Statistic | Value |
|-----------|-------|
| stop_prob median | 0.752 |
| stop_prob p90 | 0.897 |
| pathA fire (argmax==Stop, natural) | 2.51% |
| pathB fire (threshold override) | 0.68% |
| no-stop | 96.81% |
`stop_threshold=0.95` was selected via a 4-point sweep (0.88/0.92/0.95/0.97) on
a 30-episode smoke subset before the full run. Higher thresholds (0.97+) cause
overshoot regressions on the long-path segment; 0.95 is the empirical sweet spot.
## How to use
### 1. Load the checkpoint
```python
import torch
from omegaconf import OmegaConf
import sys
sys.path.insert(0, "/path/to/ry-dynamics-vln-ryworld")
from ryworld.training.train_ryworld_vlm import build_model_from_yaml
from ryworld.training.ryworld_train_utils_vlm import load_vlm_checkpoint
cfg = OmegaConf.merge(
OmegaConf.load("stage1_discrete.yaml"),
OmegaConf.load("a100_4gpu_discrete.yaml"),
)
model = build_model_from_yaml(cfg, device=torch.device("cuda"))
load_vlm_checkpoint(model, None, "ckpt_step0015000_final.pt", strict_model=False)
model.train(False) # inference mode
```
### 2. Evaluate on VLNVerse
```bash
cd /path/to/ry-dynamics-vln-ryworld
conda activate vlnverse # Isaac Sim 4.5 + torch 2.7.1 + cu126
export OMNI_KIT_ACCEPT_EULA=YES
bash scripts/eval/run_eval_structured.sh \
--ckpt ckpt_step0015000_final.pt \
--tag eval_replicate \
--stop-thr 0.95
```
See `scripts/eval/run_eval_structured.sh` for the eval pipeline (records meta.json + per_episode.csv + appends to docs/eval_ledger.jsonl).
## Training data
- **VLNVerse coarse + fine train** (~12,000 trajectories, 33 indoor scenes)
- Pre-rendered RGB videos at 256x256 (10 fps)
- Discrete action labels (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Formal-variant instruction text
Trained on 4x A100 80 GB with `chunk_size=4` multi-step CE supervision +
StopHead BCE (pos_weight=5.0, tau=4.33).
## Files in this repo
| File | Description |
|------|-------------|
| `ckpt_step0015000_final.pt` | Main checkpoint (2.81 GB) |
| `stage1_discrete.yaml` | Base training config |
| `a100_4gpu_discrete.yaml` | Production overlay (4x A100) |
| `h1_ryworld_cfg_vlnverse_coarse_val_unseen.py` | Eval config (vlnverse_emr) |
| `eval_results/` | Full eval artifacts: per-shard meta.json, per_episode.csv, server.log.gz with STOP_DEBUG |
| `EVAL_SUMMARY.md` | One-page summary of headline metrics |
## Citation
```bibtex
@misc{ryworld2026,
title = {RyWorld: Vision-Language Navigation with a Unified Multimodal World Model},
author = {{wei.tao, RUYi Dynamics}},
year = {2026.05.13},
url = {https://huggingface.co/ruyidynamics/ryworld-vln-discrete}
}
```
If you use this model on the VLNVerse benchmark, please also cite the underlying
benchmark paper:
```bibtex
@article{vlnverse2025,
title = {VLNVerse: A Large-Scale Extensible Benchmark for Vision-Language Navigation},
author = {Sihao Yu and Yuxuan Zhang and others},
journal = {arXiv preprint arXiv:2512.19021},
year = {2025}
}
```
## License
Apache-2.0 (model weights & code).
Note: VLNVerse data and Isaac Sim assets retain their own licenses (see the
[VLNVerse repo](https://github.com/sihaoevery/vlnverse) for details).