THOR — ViSTA-SLAM STA Model

Part of the ANIMA Perception Suite by Robot Flow Labs.

Project THOR is ANIMA Wave-6's Tier-1 Foundation SLAM module, implementing the Symmetric Two-view Association (STA) frontend from the ViSTA-SLAM paper.

Paper

Title: ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association
Authors: Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers
arXiv: 2509.01584
Published: 1 September 2025
PDF: paper.pdf (included in this repo)

Model Summary

Property	Value
Input	Two RGB frames — `(B, 3, 224, 224)` each
Output	Quaternion `(B,4)`, Translation `(B,3)`, Pointmap `(B,224,224,3)`
Parameters	~12.4M (ResNet-18 backbone)
Intrinsics	None required — intrinsic-free design
Best epoch	2
Best val loss	0.764781
Training	200 epochs, AdamW, lr=1.5e-5, bf16, NVIDIA L4

Architecture

The STA model uses a symmetric encoder that processes two consecutive RGB frames through shared weights, producing:

Pose head — relative SE(3) camera transformation (quaternion + translation)
Pointmap head — dense local 3D pointmap in normalised image coordinates

A Sim(3) pose graph backend handles global consistency and scale-drift correction.

Exported Formats

Format	File	Size	Use Case
PyTorch (.pth)	`pytorch/thor_sta_v1.pth`	49.6 MB	Training, fine-tuning
SafeTensors	`pytorch/thor_sta_v1.safetensors`	49.5 MB	Fast loading, safe
ONNX (opset 17)	`onnx/thor_sta_v1.onnx`	6.7 MB	Cross-platform inference
TensorRT FP16	`tensorrt/thor_sta_v1_fp16.trt`	6.3 MB	Edge deployment (Jetson/L4)
TensorRT FP32	`tensorrt/thor_sta_v1_fp32.trt`	11.4 MB	Full precision inference

Usage

import torch
from anima_thor.models.sta_model import STAConfig, STAModel

# Load from this repository
config = STAConfig()
model = STAModel(config)

ckpt = torch.load("pytorch/thor_sta_v1.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()

# Inference
img_a = torch.randn(1, 3, 224, 224)   # current frame
img_b = torch.randn(1, 3, 224, 224)   # previous frame

with torch.no_grad():
    output = model(img_a, img_b)

print(output.quaternion.shape)   # (1, 4)
print(output.translation.shape)  # (1, 3)
print(output.pointmap.shape)     # (1, 224, 224, 3)

ONNX inference

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession(
    "onnx/thor_sta_v1.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

img_a = np.random.randn(1, 3, 224, 224).astype(np.float32)
img_b = np.random.randn(1, 3, 224, 224).astype(np.float32)

quaternion, translation, pointmap = sess.run(
    None, {"img_a": img_a, "img_b": img_b}
)

TensorRT inference

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
runtime = trt.Runtime(TRT_LOGGER)

with open("tensorrt/thor_sta_v1_fp16.trt", "rb") as f:
    engine = runtime.deserialize_cuda_engine(f.read())

context = engine.create_execution_context()
context.set_input_shape("img_a", (1, 3, 224, 224))
context.set_input_shape("img_b", (1, 3, 224, 224))
# ... allocate buffers and run inference

Downstream Contracts (ANIMA Wave-6)

Module	Dependency	Topic
BALDUR	Semantic mapping	Pointmap → voxel grid
HEIMDALL	Hierarchical planning	Pose stream @ 30 Hz
HERMOD	Exploration	Coverage map

Files

README.md                          # This file
paper.pdf                          # ViSTA-SLAM paper (arXiv:2509.01584)
TRAINING_REPORT.md                 # Full training report with metrics
anima_module.yaml                  # ANIMA module manifest
pytorch/thor_sta_v1.pth            # PyTorch state dict
pytorch/thor_sta_v1.safetensors    # SafeTensors
onnx/thor_sta_v1.onnx              # ONNX opset 17
tensorrt/thor_sta_v1_fp16.trt      # TensorRT FP16
tensorrt/thor_sta_v1_fp32.trt      # TensorRT FP32
checkpoints/best.pth               # Best checkpoint (resume training)
configs/training.toml              # Training configuration
logs/training_history.json         # Epoch-by-epoch metrics (200 epochs)

Training

Hardware: NVIDIA L4 (23GB VRAM)
Framework: PyTorch 2.10 + CUDA 12.8
Config: See configs/training.toml
Report: See TRAINING_REPORT.md

Citation

@article{zhang2025vistaslam,
  title   = {ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association},
  author  = {Zhang, Ganlin and Qian, Shenhan and Wang, Xi and Cremers, Daniel},
  journal = {arXiv preprint arXiv:2509.01584},
  year    = {2025},
}

License

Apache 2.0 — Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Paper for ilessio-aiflowlab/project_thor

ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association

Paper • 2509.01584 • Published Sep 1, 2025 • 8