File size: 7,028 Bytes

---
license: apache-2.0
base_model: OpenGVLab/InternVL3_5-1B
pipeline_tag: robotics
library_name: pytorch
tags:
  - vision-language-navigation
  - VLN
  - VLNVerse
  - InternVL
  - robot-navigation
  - sim2real
  - embodied-AI
model-index:
  - name: RyWorld VLN — Stage 1 Discrete (step 15000)
    results:
      - task:
          type: vision-language-navigation
          name: VLN coarse/val_unseen
        dataset:
          name: VLNVerse coarse/val_unseen
          type: Eyz/VLNVerse_data
          split: val_unseen
        metrics:
          - type: success_rate
            value: 51.14
            name: Success Rate (%)
          - type: spl
            value: 49.22
            name: SPL (%)
          - type: oracle_success_rate
            value: 64.79
            name: Oracle Success Rate (%)
          - type: navigation_error
            value: 3.727
            name: Navigation Error (m)
          - type: ndtw
            value: 0.9445
            name: nDTW
---

# RyWorld VLN — Stage 1 Discrete (step 15000)

**Vision-language navigation policy** built on InternVL3.5-1B with a separate
StopHead and ProprioProjector. Trained on the VLNVerse coarse/fine training
set; evaluated on the official `coarse/val_unseen` 835-episode benchmark using
the [vlnverse_emr](https://github.com/sihaoevery/vlnverse) evaluation framework.

## Headline result

On the full **VLNVerse coarse/val_unseen (835 episodes)** with `stop_threshold = 0.95`:

| Metric | Value |
|--------|-------|
| **Success Rate (SR)** | **51.14%** |
| **SPL** | **49.22%** |
| Oracle Success Rate (OSR) | 64.79% |
| Navigation Error (NE) | 3.727 m |
| nDTW | 0.9445 |
| Mean Trajectory Length | 6.121 m |

## Comparison vs VLNVerse paper baselines

Reproduced inside the official `vlnverse_emr` framework on the same `coarse/val_unseen`
split. Baseline numbers from VLNVerse paper ([arXiv:2512.19021](https://arxiv.org/abs/2512.19021), Table 3):

| Method | SR ↑ | SPL ↑ | Δ vs RyWorld |
|--------|------|-------|--------------|
| CMA (VLN-CE) | 32.15% | 29.06% | −18.99 / −20.16 |
| Seq2Seq | 31.91% | 29.68% | −19.23 / −19.54 |
| HNR | 36.02% | 33.67% | −15.12 / −15.55 |
| RDP | 41.61% | 37.53% |  −9.53 / −11.69 |
| GAMA (paper SOTA) | 42.45% | 38.89% |  **−8.69 / −10.33** |
| **RyWorld @ thr=0.95 (this model)** | **51.14%** | **49.22%** | — |

## Architecture

```
Inputs:                              Outputs (per chunk position, chunk_size=4):
- RGB 256x256 (Isaac live or         - Discrete head xattn: 4-way CE
  pre-rendered training video)         (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Instruction text (formal variant)  - Stop head xattn: BCE-with-pos_weight
- Proprio history N=8 keyframes        soft target stop_proximity = exp(-d/tau)
  (body-frame deltas [dx, dy,          tau=4.33 aligned to success_radius=3 m
   cos(dtheta), sin(dtheta)])
- Previous action class history
  (decision-point keyframe selector)

Backbone: OpenGVLab/InternVL3_5-1B (InternViT-300M + Qwen3-0.6B, d_model=1024,
          vision tower frozen, LoRA r=8 on language, mlp1 trainable)
Connector: ProprioProjector (continuous proprio -> 1024 embedding)
```

Detailed architecture & training in `docs/RYWORLD_ARCHITECTURE.md` of the source repo.

## Per-segment performance

SR broken down by reference path length (`shortest_path_length`):

| Path length (m) | n | SR | NE (m) |
|-----------------|---|----|--------|
| [ 0,  5) | 151 | 66.9% | 2.55 |
| [ 5,  8) | 360 | 59.4% | 3.07 |
| [ 8, 12) | 226 | 42.9% | 4.33 |
| [12, 18) |  96 | 14.6% | 6.58 |
| [18, 30) |   2 | 50.0% | 4.55 |

The drop on long paths (>12 m) is the dominant remaining gap; addressing it
likely requires either training-time long-horizon planning supervision or
larger `forward_distance` per high-level action.

## Stop head behavior (151,740 chunk-positions)

| Statistic | Value |
|-----------|-------|
| stop_prob median | 0.752 |
| stop_prob p90 | 0.897 |
| pathA fire (argmax==Stop, natural) | 2.51% |
| pathB fire (threshold override) | 0.68% |
| no-stop | 96.81% |

`stop_threshold=0.95` was selected via a 4-point sweep (0.88/0.92/0.95/0.97) on
a 30-episode smoke subset before the full run. Higher thresholds (0.97+) cause
overshoot regressions on the long-path segment; 0.95 is the empirical sweet spot.

## How to use

### 1. Load the checkpoint

```python
import torch
from omegaconf import OmegaConf
import sys
sys.path.insert(0, "/path/to/ry-dynamics-vln-ryworld")
from ryworld.training.train_ryworld_vlm import build_model_from_yaml
from ryworld.training.ryworld_train_utils_vlm import load_vlm_checkpoint

cfg = OmegaConf.merge(
    OmegaConf.load("stage1_discrete.yaml"),
    OmegaConf.load("a100_4gpu_discrete.yaml"),
)
model = build_model_from_yaml(cfg, device=torch.device("cuda"))
load_vlm_checkpoint(model, None, "ckpt_step0015000_final.pt", strict_model=False)
model.train(False)  # inference mode
```

### 2. Evaluate on VLNVerse

```bash
cd /path/to/ry-dynamics-vln-ryworld
conda activate vlnverse  # Isaac Sim 4.5 + torch 2.7.1 + cu126
export OMNI_KIT_ACCEPT_EULA=YES

bash scripts/eval/run_eval_structured.sh \
  --ckpt ckpt_step0015000_final.pt \
  --tag eval_replicate \
  --stop-thr 0.95
```

See `scripts/eval/run_eval_structured.sh` for the eval pipeline (records meta.json + per_episode.csv + appends to docs/eval_ledger.jsonl).

## Training data

- **VLNVerse coarse + fine train** (~12,000 trajectories, 33 indoor scenes)
- Pre-rendered RGB videos at 256x256 (10 fps)
- Discrete action labels (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Formal-variant instruction text

Trained on 4x A100 80 GB with `chunk_size=4` multi-step CE supervision +
StopHead BCE (pos_weight=5.0, tau=4.33).

## Files in this repo

| File | Description |
|------|-------------|
| `ckpt_step0015000_final.pt` | Main checkpoint (2.81 GB) |
| `stage1_discrete.yaml` | Base training config |
| `a100_4gpu_discrete.yaml` | Production overlay (4x A100) |
| `h1_ryworld_cfg_vlnverse_coarse_val_unseen.py` | Eval config (vlnverse_emr) |
| `eval_results/` | Full eval artifacts: per-shard meta.json, per_episode.csv, server.log.gz with STOP_DEBUG |
| `EVAL_SUMMARY.md` | One-page summary of headline metrics |

## Citation

```bibtex
@misc{ryworld2026,
  title  = {RyWorld: Vision-Language Navigation with a Unified Multimodal World Model},
  author = {{wei.tao, RUYi Dynamics}},
  year   = {2026.05.13},
  url    = {https://huggingface.co/ruyidynamics/ryworld-vln-discrete}
}
```

If you use this model on the VLNVerse benchmark, please also cite the underlying
benchmark paper:

```bibtex
@article{vlnverse2025,
  title   = {VLNVerse: A Large-Scale Extensible Benchmark for Vision-Language Navigation},
  author  = {Sihao Yu and Yuxuan Zhang and others},
  journal = {arXiv preprint arXiv:2512.19021},
  year    = {2025}
}
```

## License

Apache-2.0 (model weights & code).

Note: VLNVerse data and Isaac Sim assets retain their own licenses (see the
[VLNVerse repo](https://github.com/sihaoevery/vlnverse) for details).