VIS-OCCANY — ANIMA 3D Occupancy Prediction Module

Part of the ANIMA Intelligence Compiler Suite by Robot Flow Labs.

Paper

OccAny: Generalized Unconstrained Urban 3D Occupancy Prediction (CVPR 2026) Anh-Quan Cao, Tuan-Hung Vu — Valeo AI

Architecture

OccAny predicts dense 3D occupancy from RGB camera inputs without LiDAR supervision. The ANIMA implementation uses:

  • DINOv2-Small/14 frozen encoder (384-dim patch tokens)
  • 6-layer transformer decoder (384-dim, 6 heads) for geometry prediction
  • Prediction heads: global/local pointmaps, confidence, poses, SAM-style features
  • Novel-view rendering with TTVA (Test-Time View Augmentation)
  • CUDA-optimized trilinear voxelization (scatter_add, 11x faster than Python)
  • Multi-loss training: pointmap L1 + voxel BCE + feature distillation

Benchmarks (Paper Targets)

Benchmark Metric Paper ANIMA Target
SemanticKITTI sequence IoU 25.91 >= 24.5
SemanticKITTI monocular IoU 24.03 >= 22.5
Occ3D-nuScenes surround IoU 34.15 >= 32.0

Exported Formats

Format File Size Use Case
PyTorch (.pth) pytorch/vis_occany_v1.pth 108 MB Training, fine-tuning
SafeTensors pytorch/vis_occany_v1.safetensors 108 MB Fast loading, safe
ONNX onnx/vis_occany_v1.onnx 66 MB Cross-platform inference
TensorRT FP32 tensorrt/vis_occany_v1_fp32.trt 67 MB Full precision inference
TensorRT FP16 tensorrt/vis_occany_v1_fp16.trt 35 MB Edge deployment (Jetson/L4)

Usage

import torch
from safetensors.torch import load_file

# Load weights
state = load_file("pytorch/vis_occany_v1.safetensors")

# Or with full model
from anima_vis_occany.model.reconstruction import ReconstructionStage
model = ReconstructionStage(hidden_dim=384, decoder_depth=6, decoder_heads=6)
model.load_state_dict({k.replace("reconstruction.", ""): v for k, v in state.items() if k.startswith("reconstruction.")})

Training

  • Hardware: NVIDIA L4 (23GB VRAM)
  • Framework: PyTorch 2.11 + CUDA 12.8
  • Precision: bf16 mixed precision (AMP)
  • Optimizer: AdamW, lr=3e-4, weight_decay=0.01
  • Scheduler: Cosine warmup (5% warmup steps)
  • Batch: 32 × 4 gradient accumulation = 128 effective
  • Data: KITTI cached voxels + DINOv2 features + point clouds (7,481 samples)
  • Config: See configs/training.toml

API Endpoints

Endpoint Method Description
/health GET Module health status
/ready GET Checkpoint readiness
/info GET Module metadata
/infer POST Run 3D occupancy inference

Docker

docker compose -f docker-compose.serve.yml --profile serve up -d

License

Apache 2.0 — Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for ilessio-aiflowlab/project_vis_occany