THOR β€” ViSTA-SLAM STA Model

Part of the ANIMA Perception Suite by Robot Flow Labs.

Project THOR is ANIMA Wave-6's Tier-1 Foundation SLAM module, implementing the Symmetric Two-view Association (STA) frontend from the ViSTA-SLAM paper.

Paper

  • Title: ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association
  • Authors: Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers
  • arXiv: 2509.01584
  • Published: 1 September 2025
  • PDF: paper.pdf (included in this repo)

Model Summary

Property Value
Input Two RGB frames β€” (B, 3, 224, 224) each
Output Quaternion (B,4), Translation (B,3), Pointmap (B,224,224,3)
Parameters ~12.4M (ResNet-18 backbone)
Intrinsics None required β€” intrinsic-free design
Best epoch 2
Best val loss 0.764781
Training 200 epochs, AdamW, lr=1.5e-5, bf16, NVIDIA L4

Architecture

The STA model uses a symmetric encoder that processes two consecutive RGB frames through shared weights, producing:

  1. Pose head β€” relative SE(3) camera transformation (quaternion + translation)
  2. Pointmap head β€” dense local 3D pointmap in normalised image coordinates

A Sim(3) pose graph backend handles global consistency and scale-drift correction.

Exported Formats

Format File Size Use Case
PyTorch (.pth) pytorch/thor_sta_v1.pth 49.6 MB Training, fine-tuning
SafeTensors pytorch/thor_sta_v1.safetensors 49.5 MB Fast loading, safe
ONNX (opset 17) onnx/thor_sta_v1.onnx 6.7 MB Cross-platform inference
TensorRT FP16 tensorrt/thor_sta_v1_fp16.trt 6.3 MB Edge deployment (Jetson/L4)
TensorRT FP32 tensorrt/thor_sta_v1_fp32.trt 11.4 MB Full precision inference

Usage

import torch
from anima_thor.models.sta_model import STAConfig, STAModel

# Load from this repository
config = STAConfig()
model = STAModel(config)

ckpt = torch.load("pytorch/thor_sta_v1.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()

# Inference
img_a = torch.randn(1, 3, 224, 224)   # current frame
img_b = torch.randn(1, 3, 224, 224)   # previous frame

with torch.no_grad():
    output = model(img_a, img_b)

print(output.quaternion.shape)   # (1, 4)
print(output.translation.shape)  # (1, 3)
print(output.pointmap.shape)     # (1, 224, 224, 3)

ONNX inference

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession(
    "onnx/thor_sta_v1.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

img_a = np.random.randn(1, 3, 224, 224).astype(np.float32)
img_b = np.random.randn(1, 3, 224, 224).astype(np.float32)

quaternion, translation, pointmap = sess.run(
    None, {"img_a": img_a, "img_b": img_b}
)

TensorRT inference

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
runtime = trt.Runtime(TRT_LOGGER)

with open("tensorrt/thor_sta_v1_fp16.trt", "rb") as f:
    engine = runtime.deserialize_cuda_engine(f.read())

context = engine.create_execution_context()
context.set_input_shape("img_a", (1, 3, 224, 224))
context.set_input_shape("img_b", (1, 3, 224, 224))
# ... allocate buffers and run inference

Downstream Contracts (ANIMA Wave-6)

Module Dependency Topic
BALDUR Semantic mapping Pointmap β†’ voxel grid
HEIMDALL Hierarchical planning Pose stream @ 30 Hz
HERMOD Exploration Coverage map

Files

README.md                          # This file
paper.pdf                          # ViSTA-SLAM paper (arXiv:2509.01584)
TRAINING_REPORT.md                 # Full training report with metrics
anima_module.yaml                  # ANIMA module manifest
pytorch/thor_sta_v1.pth            # PyTorch state dict
pytorch/thor_sta_v1.safetensors    # SafeTensors
onnx/thor_sta_v1.onnx              # ONNX opset 17
tensorrt/thor_sta_v1_fp16.trt      # TensorRT FP16
tensorrt/thor_sta_v1_fp32.trt      # TensorRT FP32
checkpoints/best.pth               # Best checkpoint (resume training)
configs/training.toml              # Training configuration
logs/training_history.json         # Epoch-by-epoch metrics (200 epochs)

Training

  • Hardware: NVIDIA L4 (23GB VRAM)
  • Framework: PyTorch 2.10 + CUDA 12.8
  • Config: See configs/training.toml
  • Report: See TRAINING_REPORT.md

Citation

@article{zhang2025vistaslam,
  title   = {ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association},
  author  = {Zhang, Ganlin and Qian, Shenhan and Wang, Xi and Cremers, Daniel},
  journal = {arXiv preprint arXiv:2509.01584},
  year    = {2025},
}

License

Apache 2.0 β€” Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for ilessio-aiflowlab/project_thor