| --- |
| license: apache-2.0 |
| base_model: OpenGVLab/InternVL3_5-1B |
| pipeline_tag: robotics |
| library_name: pytorch |
| tags: |
| - vision-language-navigation |
| - VLN |
| - VLNVerse |
| - InternVL |
| - robot-navigation |
| - sim2real |
| - embodied-AI |
| model-index: |
| - name: RyWorld VLN — Stage 1 Discrete (step 15000) |
| results: |
| - task: |
| type: vision-language-navigation |
| name: VLN coarse/val_unseen |
| dataset: |
| name: VLNVerse coarse/val_unseen |
| type: Eyz/VLNVerse_data |
| split: val_unseen |
| metrics: |
| - type: success_rate |
| value: 51.14 |
| name: Success Rate (%) |
| - type: spl |
| value: 49.22 |
| name: SPL (%) |
| - type: oracle_success_rate |
| value: 64.79 |
| name: Oracle Success Rate (%) |
| - type: navigation_error |
| value: 3.727 |
| name: Navigation Error (m) |
| - type: ndtw |
| value: 0.9445 |
| name: nDTW |
| --- |
| |
| # RyWorld VLN — Stage 1 Discrete (step 15000) |
|
|
| **Vision-language navigation policy** built on InternVL3.5-1B with a separate |
| StopHead and ProprioProjector. Trained on the VLNVerse coarse/fine training |
| set; evaluated on the official `coarse/val_unseen` 835-episode benchmark using |
| the [vlnverse_emr](https://github.com/sihaoevery/vlnverse) evaluation framework. |
|
|
| ## Headline result |
|
|
| On the full **VLNVerse coarse/val_unseen (835 episodes)** with `stop_threshold = 0.95`: |
| |
| | Metric | Value | |
| |--------|-------| |
| | **Success Rate (SR)** | **51.14%** | |
| | **SPL** | **49.22%** | |
| | Oracle Success Rate (OSR) | 64.79% | |
| | Navigation Error (NE) | 3.727 m | |
| | nDTW | 0.9445 | |
| | Mean Trajectory Length | 6.121 m | |
|
|
| ## Comparison vs VLNVerse paper baselines |
|
|
| Reproduced inside the official `vlnverse_emr` framework on the same `coarse/val_unseen` |
| split. Baseline numbers from VLNVerse paper ([arXiv:2512.19021](https://arxiv.org/abs/2512.19021), Table 3): |
|
|
| | Method | SR ↑ | SPL ↑ | Δ vs RyWorld | |
| |--------|------|-------|--------------| |
| | CMA (VLN-CE) | 32.15% | 29.06% | −18.99 / −20.16 | |
| | Seq2Seq | 31.91% | 29.68% | −19.23 / −19.54 | |
| | HNR | 36.02% | 33.67% | −15.12 / −15.55 | |
| | RDP | 41.61% | 37.53% | −9.53 / −11.69 | |
| | GAMA (paper SOTA) | 42.45% | 38.89% | **−8.69 / −10.33** | |
| | **RyWorld @ thr=0.95 (this model)** | **51.14%** | **49.22%** | — | |
|
|
| ## Architecture |
|
|
| ``` |
| Inputs: Outputs (per chunk position, chunk_size=4): |
| - RGB 256x256 (Isaac live or - Discrete head xattn: 4-way CE |
| pre-rendered training video) (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight) |
| - Instruction text (formal variant) - Stop head xattn: BCE-with-pos_weight |
| - Proprio history N=8 keyframes soft target stop_proximity = exp(-d/tau) |
| (body-frame deltas [dx, dy, tau=4.33 aligned to success_radius=3 m |
| cos(dtheta), sin(dtheta)]) |
| - Previous action class history |
| (decision-point keyframe selector) |
| |
| Backbone: OpenGVLab/InternVL3_5-1B (InternViT-300M + Qwen3-0.6B, d_model=1024, |
| vision tower frozen, LoRA r=8 on language, mlp1 trainable) |
| Connector: ProprioProjector (continuous proprio -> 1024 embedding) |
| ``` |
|
|
| Detailed architecture & training in `docs/RYWORLD_ARCHITECTURE.md` of the source repo. |
|
|
| ## Per-segment performance |
|
|
| SR broken down by reference path length (`shortest_path_length`): |
|
|
| | Path length (m) | n | SR | NE (m) | |
| |-----------------|---|----|--------| |
| | [ 0, 5) | 151 | 66.9% | 2.55 | |
| | [ 5, 8) | 360 | 59.4% | 3.07 | |
| | [ 8, 12) | 226 | 42.9% | 4.33 | |
| | [12, 18) | 96 | 14.6% | 6.58 | |
| | [18, 30) | 2 | 50.0% | 4.55 | |
|
|
| The drop on long paths (>12 m) is the dominant remaining gap; addressing it |
| likely requires either training-time long-horizon planning supervision or |
| larger `forward_distance` per high-level action. |
|
|
| ## Stop head behavior (151,740 chunk-positions) |
|
|
| | Statistic | Value | |
| |-----------|-------| |
| | stop_prob median | 0.752 | |
| | stop_prob p90 | 0.897 | |
| | pathA fire (argmax==Stop, natural) | 2.51% | |
| | pathB fire (threshold override) | 0.68% | |
| | no-stop | 96.81% | |
|
|
| `stop_threshold=0.95` was selected via a 4-point sweep (0.88/0.92/0.95/0.97) on |
| a 30-episode smoke subset before the full run. Higher thresholds (0.97+) cause |
| overshoot regressions on the long-path segment; 0.95 is the empirical sweet spot. |
|
|
| ## How to use |
|
|
| ### 1. Load the checkpoint |
|
|
| ```python |
| import torch |
| from omegaconf import OmegaConf |
| import sys |
| sys.path.insert(0, "/path/to/ry-dynamics-vln-ryworld") |
| from ryworld.training.train_ryworld_vlm import build_model_from_yaml |
| from ryworld.training.ryworld_train_utils_vlm import load_vlm_checkpoint |
| |
| cfg = OmegaConf.merge( |
| OmegaConf.load("stage1_discrete.yaml"), |
| OmegaConf.load("a100_4gpu_discrete.yaml"), |
| ) |
| model = build_model_from_yaml(cfg, device=torch.device("cuda")) |
| load_vlm_checkpoint(model, None, "ckpt_step0015000_final.pt", strict_model=False) |
| model.train(False) # inference mode |
| ``` |
|
|
| ### 2. Evaluate on VLNVerse |
|
|
| ```bash |
| cd /path/to/ry-dynamics-vln-ryworld |
| conda activate vlnverse # Isaac Sim 4.5 + torch 2.7.1 + cu126 |
| export OMNI_KIT_ACCEPT_EULA=YES |
| |
| bash scripts/eval/run_eval_structured.sh \ |
| --ckpt ckpt_step0015000_final.pt \ |
| --tag eval_replicate \ |
| --stop-thr 0.95 |
| ``` |
|
|
| See `scripts/eval/run_eval_structured.sh` for the eval pipeline (records meta.json + per_episode.csv + appends to docs/eval_ledger.jsonl). |
|
|
| ## Training data |
|
|
| - **VLNVerse coarse + fine train** (~12,000 trajectories, 33 indoor scenes) |
| - Pre-rendered RGB videos at 256x256 (10 fps) |
| - Discrete action labels (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight) |
| - Formal-variant instruction text |
|
|
| Trained on 4x A100 80 GB with `chunk_size=4` multi-step CE supervision + |
| StopHead BCE (pos_weight=5.0, tau=4.33). |
| |
| ## Files in this repo |
| |
| | File | Description | |
| |------|-------------| |
| | `ckpt_step0015000_final.pt` | Main checkpoint (2.81 GB) | |
| | `stage1_discrete.yaml` | Base training config | |
| | `a100_4gpu_discrete.yaml` | Production overlay (4x A100) | |
| | `h1_ryworld_cfg_vlnverse_coarse_val_unseen.py` | Eval config (vlnverse_emr) | |
| | `eval_results/` | Full eval artifacts: per-shard meta.json, per_episode.csv, server.log.gz with STOP_DEBUG | |
| | `EVAL_SUMMARY.md` | One-page summary of headline metrics | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{ryworld2026, |
| title = {RyWorld: Vision-Language Navigation with a Unified Multimodal World Model}, |
| author = {{wei.tao, RUYi Dynamics}}, |
| year = {2026.05.13}, |
| url = {https://huggingface.co/ruyidynamics/ryworld-vln-discrete} |
| } |
| ``` |
|
|
| If you use this model on the VLNVerse benchmark, please also cite the underlying |
| benchmark paper: |
|
|
| ```bibtex |
| @article{vlnverse2025, |
| title = {VLNVerse: A Large-Scale Extensible Benchmark for Vision-Language Navigation}, |
| author = {Sihao Yu and Yuxuan Zhang and others}, |
| journal = {arXiv preprint arXiv:2512.19021}, |
| year = {2025} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache-2.0 (model weights & code). |
|
|
| Note: VLNVerse data and Isaac Sim assets retain their own licenses (see the |
| [VLNVerse repo](https://github.com/sihaoevery/vlnverse) for details). |
|
|