ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association
Paper β’ 2509.01584 β’ Published β’ 8
Part of the ANIMA Perception Suite by Robot Flow Labs.
Project THOR is ANIMA Wave-6's Tier-1 Foundation SLAM module, implementing the Symmetric Two-view Association (STA) frontend from the ViSTA-SLAM paper.
| Property | Value |
|---|---|
| Input | Two RGB frames β (B, 3, 224, 224) each |
| Output | Quaternion (B,4), Translation (B,3), Pointmap (B,224,224,3) |
| Parameters | ~12.4M (ResNet-18 backbone) |
| Intrinsics | None required β intrinsic-free design |
| Best epoch | 2 |
| Best val loss | 0.764781 |
| Training | 200 epochs, AdamW, lr=1.5e-5, bf16, NVIDIA L4 |
The STA model uses a symmetric encoder that processes two consecutive RGB frames through shared weights, producing:
A Sim(3) pose graph backend handles global consistency and scale-drift correction.
| Format | File | Size | Use Case |
|---|---|---|---|
| PyTorch (.pth) | pytorch/thor_sta_v1.pth |
49.6 MB | Training, fine-tuning |
| SafeTensors | pytorch/thor_sta_v1.safetensors |
49.5 MB | Fast loading, safe |
| ONNX (opset 17) | onnx/thor_sta_v1.onnx |
6.7 MB | Cross-platform inference |
| TensorRT FP16 | tensorrt/thor_sta_v1_fp16.trt |
6.3 MB | Edge deployment (Jetson/L4) |
| TensorRT FP32 | tensorrt/thor_sta_v1_fp32.trt |
11.4 MB | Full precision inference |
import torch
from anima_thor.models.sta_model import STAConfig, STAModel
# Load from this repository
config = STAConfig()
model = STAModel(config)
ckpt = torch.load("pytorch/thor_sta_v1.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()
# Inference
img_a = torch.randn(1, 3, 224, 224) # current frame
img_b = torch.randn(1, 3, 224, 224) # previous frame
with torch.no_grad():
output = model(img_a, img_b)
print(output.quaternion.shape) # (1, 4)
print(output.translation.shape) # (1, 3)
print(output.pointmap.shape) # (1, 224, 224, 3)
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession(
"onnx/thor_sta_v1.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
img_a = np.random.randn(1, 3, 224, 224).astype(np.float32)
img_b = np.random.randn(1, 3, 224, 224).astype(np.float32)
quaternion, translation, pointmap = sess.run(
None, {"img_a": img_a, "img_b": img_b}
)
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
runtime = trt.Runtime(TRT_LOGGER)
with open("tensorrt/thor_sta_v1_fp16.trt", "rb") as f:
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
context.set_input_shape("img_a", (1, 3, 224, 224))
context.set_input_shape("img_b", (1, 3, 224, 224))
# ... allocate buffers and run inference
| Module | Dependency | Topic |
|---|---|---|
| BALDUR | Semantic mapping | Pointmap β voxel grid |
| HEIMDALL | Hierarchical planning | Pose stream @ 30 Hz |
| HERMOD | Exploration | Coverage map |
README.md # This file
paper.pdf # ViSTA-SLAM paper (arXiv:2509.01584)
TRAINING_REPORT.md # Full training report with metrics
anima_module.yaml # ANIMA module manifest
pytorch/thor_sta_v1.pth # PyTorch state dict
pytorch/thor_sta_v1.safetensors # SafeTensors
onnx/thor_sta_v1.onnx # ONNX opset 17
tensorrt/thor_sta_v1_fp16.trt # TensorRT FP16
tensorrt/thor_sta_v1_fp32.trt # TensorRT FP32
checkpoints/best.pth # Best checkpoint (resume training)
configs/training.toml # Training configuration
logs/training_history.json # Epoch-by-epoch metrics (200 epochs)
configs/training.tomlTRAINING_REPORT.md@article{zhang2025vistaslam,
title = {ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association},
author = {Zhang, Ganlin and Qian, Shenhan and Wang, Xi and Cremers, Daniel},
journal = {arXiv preprint arXiv:2509.01584},
year = {2025},
}
Apache 2.0 β Robot Flow Labs / AIFLOW LABS LIMITED