How to use from the
Use from the
Transformers library
# Load model directly
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("Spacewanderer8263/Proxy3D-8B", dtype="auto")
Quick Links

Proxy3D-8B

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Proxy3D-8B is a vision-language model (VLM) specialized in 3D scene understanding and spatial reasoning. It is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct using the Proxy3D method, which produces compact yet comprehensive 3D proxy representations for the vision modality to overcome the limitations of standard 2D pipelines.

Model Description

Spatial intelligence in vision-language models (VLMs) is crucial for reasoning in 3D environments. Proxy3D addresses this by extracting scene features using semantic and geometric encoders from video frames, then performing semantic-aware clustering to obtain a set of proxies in 3D space.

By utilizing these compact proxy representations, the model achieves state-of-the-art performance in 3D visual question answering (VQA), visual grounding, and general spatial intelligence benchmarks while maintaining high efficiency.

Training Procedure

The model was trained using a four-stage progressive iterative pipeline to develop spatial reasoning skills, ranging from initial image-text alignment to complex 3D reasoning on the SpaceSpan dataset.

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning rate: 5e-06
  • Train batch size: 8
  • Total train batch size: 128
  • Optimizer: adamw_torch (betas=(0.9,0.999), epsilon=1e-08)
  • LR scheduler type: cosine
  • LR scheduler warmup ratio: 0.1
  • Number of epochs: 1.0

Framework Versions

  • Transformers 4.55.0
  • Pytorch 2.6.0+cu118
  • Datasets 3.1.0
  • Tokenizers 0.21.1

Usage

Running this model requires a specific environment setup and custom configuration files to handle the Qwen2VLBEVForConditionalGeneration architecture. Please refer to the Setup section of the GitHub repository for detailed instructions on how to install and run inference.

Citation

If you find Proxy3D useful for your research, please cite:

@article{proxy3d2026,
  title={Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment},
  author={Jiang, Jerry and Sun, Haowen and Gudovskiy, Denis and Nakata, Yohei and Okuno, Tomoyuki and Keutzer, Kurt and Zheng Wenzhao},
  journal={arXiv preprint arXiv:2605.08064},
  year={2026}
}

Acknowledgements

This work builds upon several excellent repositories, including Qwen2.5-VL, LLaMAFactory, and GPT4Scene.

Downloads last month
24
Safetensors
Model size
860k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Spacewanderer8263/Proxy3D-8B

Finetuned
(1059)
this model

Paper for Spacewanderer8263/Proxy3D-8B