Spacewanderer8263 nielsr HF Staff commited on
Commit
18b4fe8
·
1 Parent(s): 97648eb

Improve model card and add metadata (#1)

Browse files

- Improve model card and add metadata (c71734f39a506b2ad2564eb7e514077a902cced2)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +46 -32
README.md CHANGED
@@ -1,61 +1,75 @@
1
  ---
2
- library_name: transformers
3
- license: other
4
  base_model: Qwen/Qwen2.5-VL-7B-Instruct
 
 
5
  tags:
6
  - llama-factory
7
  - full
8
  - generated_from_trainer
 
 
 
9
  model-index:
10
  - name: Proxy3D-8B
11
  results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
-
17
  # Proxy3D-8B
18
 
19
- This model is a fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) on the SpaceSpan_318K_masked dataset.
20
 
21
- ## Model description
22
 
23
- More information needed
 
 
 
24
 
25
- ## Intended uses & limitations
26
 
27
- More information needed
28
 
29
- ## Training and evaluation data
30
 
31
- More information needed
32
 
33
- ## Training procedure
34
 
35
- ### Training hyperparameters
36
 
37
  The following hyperparameters were used during training:
38
- - learning_rate: 5e-06
39
- - train_batch_size: 8
40
- - eval_batch_size: 8
41
- - seed: 42
42
- - distributed_type: multi-GPU
43
- - num_devices: 8
44
- - gradient_accumulation_steps: 2
45
- - total_train_batch_size: 128
46
- - total_eval_batch_size: 64
47
- - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
48
- - lr_scheduler_type: cosine
49
- - lr_scheduler_warmup_ratio: 0.1
50
- - num_epochs: 1.0
51
-
52
- ### Training results
53
 
54
-
55
-
56
- ### Framework versions
57
 
58
  - Transformers 4.55.0
59
  - Pytorch 2.6.0+cu118
60
  - Datasets 3.1.0
61
  - Tokenizers 0.21.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
2
  base_model: Qwen/Qwen2.5-VL-7B-Instruct
3
+ library_name: transformers
4
+ license: apache-2.0
5
  tags:
6
  - llama-factory
7
  - full
8
  - generated_from_trainer
9
+ - spatial-intelligence
10
+ - 3d-vision
11
+ pipeline_tag: video-text-to-text
12
  model-index:
13
  - name: Proxy3D-8B
14
  results: []
15
  ---
16
 
 
 
 
17
  # Proxy3D-8B
18
 
19
+ [**Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment**](https://huggingface.co/papers/2605.08064)
20
 
21
+ Proxy3D-8B is a vision-language model (VLM) specialized in 3D scene understanding and spatial reasoning. It is a fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using the **Proxy3D** method, which produces compact yet comprehensive 3D proxy representations for the vision modality to overcome the limitations of standard 2D pipelines.
22
 
23
+ - **Paper:** [arXiv:2605.08064](https://huggingface.co/papers/2605.08064)
24
+ - **Project Page:** [wzzheng.net/Proxy3D](https://wzzheng.net/Proxy3D)
25
+ - **GitHub Repository:** [Spacedreamer2384/Proxy3D](https://github.com/Spacedreamer2384/Proxy3D)
26
+ - **Dataset:** [SpaceSpan-318K](https://huggingface.co/datasets/Spacewanderer8263/Proxy3D-SpaceSpan-318K)
27
 
28
+ ## Model Description
29
 
30
+ Spatial intelligence in vision-language models (VLMs) is crucial for reasoning in 3D environments. Proxy3D addresses this by extracting scene features using semantic and geometric encoders from video frames, then performing semantic-aware clustering to obtain a set of proxies in 3D space.
31
 
32
+ By utilizing these compact proxy representations, the model achieves state-of-the-art performance in 3D visual question answering (VQA), visual grounding, and general spatial intelligence benchmarks while maintaining high efficiency.
33
 
34
+ ## Training Procedure
35
 
36
+ The model was trained using a four-stage progressive iterative pipeline to develop spatial reasoning skills, ranging from initial image-text alignment to complex 3D reasoning on the **SpaceSpan** dataset.
37
 
38
+ ### Training Hyperparameters
39
 
40
  The following hyperparameters were used during training:
41
+ - **Learning rate:** 5e-06
42
+ - **Train batch size:** 8
43
+ - **Total train batch size:** 128
44
+ - **Optimizer:** adamw_torch (betas=(0.9,0.999), epsilon=1e-08)
45
+ - **LR scheduler type:** cosine
46
+ - **LR scheduler warmup ratio:** 0.1
47
+ - **Number of epochs:** 1.0
 
 
 
 
 
 
 
 
48
 
49
+ ### Framework Versions
 
 
50
 
51
  - Transformers 4.55.0
52
  - Pytorch 2.6.0+cu118
53
  - Datasets 3.1.0
54
  - Tokenizers 0.21.1
55
+
56
+ ## Usage
57
+
58
+ Running this model requires a specific environment setup and custom configuration files to handle the `Qwen2VLBEVForConditionalGeneration` architecture. Please refer to the [Setup section of the GitHub repository](https://github.com/Spacedreamer2384/Proxy3D#%EF%B8%8F-setup) for detailed instructions on how to install and run inference.
59
+
60
+ ## Citation
61
+
62
+ If you find Proxy3D useful for your research, please cite:
63
+
64
+ ```bibtex
65
+ @article{proxy3d2026,
66
+ title={Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment},
67
+ author={Jiang, Jerry and Sun, Haowen and Gudovskiy, Denis and Nakata, Yohei and Okuno, Tomoyuki and Keutzer, Kurt and Zheng Wenzhao},
68
+ journal={arXiv preprint arXiv:2605.08064},
69
+ year={2026}
70
+ }
71
+ ```
72
+
73
+ ## Acknowledgements
74
+
75
+ This work builds upon several excellent repositories, including [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [LLaMAFactory](https://github.com/hiyouga/LLaMAFactory), and [GPT4Scene](https://github.com/Qi-Zhangyang/GPT4Scene-and-VLN-R1).