--- license: apache-2.0 base_model: OpenGVLab/InternVL3_5-1B pipeline_tag: robotics library_name: pytorch tags: - vision-language-navigation - VLN - VLNVerse - InternVL - robot-navigation - sim2real - embodied-AI model-index: - name: RyWorld VLN — Stage 1 Discrete (step 15000) results: - task: type: vision-language-navigation name: VLN coarse/val_unseen dataset: name: VLNVerse coarse/val_unseen type: Eyz/VLNVerse_data split: val_unseen metrics: - type: success_rate value: 51.14 name: Success Rate (%) - type: spl value: 49.22 name: SPL (%) - type: oracle_success_rate value: 64.79 name: Oracle Success Rate (%) - type: navigation_error value: 3.727 name: Navigation Error (m) - type: ndtw value: 0.9445 name: nDTW --- # RyWorld VLN — Stage 1 Discrete (step 15000) **Vision-language navigation policy** built on InternVL3.5-1B with a separate StopHead and ProprioProjector. Trained on the VLNVerse coarse/fine training set; evaluated on the official `coarse/val_unseen` 835-episode benchmark using the [vlnverse_emr](https://github.com/sihaoevery/vlnverse) evaluation framework. ## Headline result On the full **VLNVerse coarse/val_unseen (835 episodes)** with `stop_threshold = 0.95`: | Metric | Value | |--------|-------| | **Success Rate (SR)** | **51.14%** | | **SPL** | **49.22%** | | Oracle Success Rate (OSR) | 64.79% | | Navigation Error (NE) | 3.727 m | | nDTW | 0.9445 | | Mean Trajectory Length | 6.121 m | ## Comparison vs VLNVerse paper baselines Reproduced inside the official `vlnverse_emr` framework on the same `coarse/val_unseen` split. Baseline numbers from VLNVerse paper ([arXiv:2512.19021](https://arxiv.org/abs/2512.19021), Table 3): | Method | SR ↑ | SPL ↑ | Δ vs RyWorld | |--------|------|-------|--------------| | CMA (VLN-CE) | 32.15% | 29.06% | −18.99 / −20.16 | | Seq2Seq | 31.91% | 29.68% | −19.23 / −19.54 | | HNR | 36.02% | 33.67% | −15.12 / −15.55 | | RDP | 41.61% | 37.53% | −9.53 / −11.69 | | GAMA (paper SOTA) | 42.45% | 38.89% | **−8.69 / −10.33** | | **RyWorld @ thr=0.95 (this model)** | **51.14%** | **49.22%** | — | ## Architecture ``` Inputs: Outputs (per chunk position, chunk_size=4): - RGB 256x256 (Isaac live or - Discrete head xattn: 4-way CE pre-rendered training video) (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight) - Instruction text (formal variant) - Stop head xattn: BCE-with-pos_weight - Proprio history N=8 keyframes soft target stop_proximity = exp(-d/tau) (body-frame deltas [dx, dy, tau=4.33 aligned to success_radius=3 m cos(dtheta), sin(dtheta)]) - Previous action class history (decision-point keyframe selector) Backbone: OpenGVLab/InternVL3_5-1B (InternViT-300M + Qwen3-0.6B, d_model=1024, vision tower frozen, LoRA r=8 on language, mlp1 trainable) Connector: ProprioProjector (continuous proprio -> 1024 embedding) ``` Detailed architecture & training in `docs/RYWORLD_ARCHITECTURE.md` of the source repo. ## Per-segment performance SR broken down by reference path length (`shortest_path_length`): | Path length (m) | n | SR | NE (m) | |-----------------|---|----|--------| | [ 0, 5) | 151 | 66.9% | 2.55 | | [ 5, 8) | 360 | 59.4% | 3.07 | | [ 8, 12) | 226 | 42.9% | 4.33 | | [12, 18) | 96 | 14.6% | 6.58 | | [18, 30) | 2 | 50.0% | 4.55 | The drop on long paths (>12 m) is the dominant remaining gap; addressing it likely requires either training-time long-horizon planning supervision or larger `forward_distance` per high-level action. ## Stop head behavior (151,740 chunk-positions) | Statistic | Value | |-----------|-------| | stop_prob median | 0.752 | | stop_prob p90 | 0.897 | | pathA fire (argmax==Stop, natural) | 2.51% | | pathB fire (threshold override) | 0.68% | | no-stop | 96.81% | `stop_threshold=0.95` was selected via a 4-point sweep (0.88/0.92/0.95/0.97) on a 30-episode smoke subset before the full run. Higher thresholds (0.97+) cause overshoot regressions on the long-path segment; 0.95 is the empirical sweet spot. ## How to use ### 1. Load the checkpoint ```python import torch from omegaconf import OmegaConf import sys sys.path.insert(0, "/path/to/ry-dynamics-vln-ryworld") from ryworld.training.train_ryworld_vlm import build_model_from_yaml from ryworld.training.ryworld_train_utils_vlm import load_vlm_checkpoint cfg = OmegaConf.merge( OmegaConf.load("stage1_discrete.yaml"), OmegaConf.load("a100_4gpu_discrete.yaml"), ) model = build_model_from_yaml(cfg, device=torch.device("cuda")) load_vlm_checkpoint(model, None, "ckpt_step0015000_final.pt", strict_model=False) model.train(False) # inference mode ``` ### 2. Evaluate on VLNVerse ```bash cd /path/to/ry-dynamics-vln-ryworld conda activate vlnverse # Isaac Sim 4.5 + torch 2.7.1 + cu126 export OMNI_KIT_ACCEPT_EULA=YES bash scripts/eval/run_eval_structured.sh \ --ckpt ckpt_step0015000_final.pt \ --tag eval_replicate \ --stop-thr 0.95 ``` See `scripts/eval/run_eval_structured.sh` for the eval pipeline (records meta.json + per_episode.csv + appends to docs/eval_ledger.jsonl). ## Training data - **VLNVerse coarse + fine train** (~12,000 trajectories, 33 indoor scenes) - Pre-rendered RGB videos at 256x256 (10 fps) - Discrete action labels (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight) - Formal-variant instruction text Trained on 4x A100 80 GB with `chunk_size=4` multi-step CE supervision + StopHead BCE (pos_weight=5.0, tau=4.33). ## Files in this repo | File | Description | |------|-------------| | `ckpt_step0015000_final.pt` | Main checkpoint (2.81 GB) | | `stage1_discrete.yaml` | Base training config | | `a100_4gpu_discrete.yaml` | Production overlay (4x A100) | | `h1_ryworld_cfg_vlnverse_coarse_val_unseen.py` | Eval config (vlnverse_emr) | | `eval_results/` | Full eval artifacts: per-shard meta.json, per_episode.csv, server.log.gz with STOP_DEBUG | | `EVAL_SUMMARY.md` | One-page summary of headline metrics | ## Citation ```bibtex @misc{ryworld2026, title = {RyWorld: Vision-Language Navigation with a Unified Multimodal World Model}, author = {{wei.tao, RUYi Dynamics}}, year = {2026.05.13}, url = {https://huggingface.co/ruyidynamics/ryworld-vln-discrete} } ``` If you use this model on the VLNVerse benchmark, please also cite the underlying benchmark paper: ```bibtex @article{vlnverse2025, title = {VLNVerse: A Large-Scale Extensible Benchmark for Vision-Language Navigation}, author = {Sihao Yu and Yuxuan Zhang and others}, journal = {arXiv preprint arXiv:2512.19021}, year = {2025} } ``` ## License Apache-2.0 (model weights & code). Note: VLNVerse data and Isaac Sim assets retain their own licenses (see the [VLNVerse repo](https://github.com/sihaoevery/vlnverse) for details).