Improve model card for AtomThink-LLaVA1.5-7B

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +103 -9
README.md CHANGED
@@ -1,16 +1,102 @@
1
  ---
2
- datasets:
3
- - Quinn777/AMATH-SFT
4
  base_model:
5
  - liuhaotian/llava-v1.5-7b
 
 
 
 
 
 
 
 
 
 
6
  ---
7
- # Model Card for AtomThink-LlamaV
8
 
9
- The model is post-trained based on llava-v1.5-7b and the AtomThink framework, and can be used to solve complex multimodal mathematical problems.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- # Citation
12
- If you use this dataset in your research, please cite:
13
- ```text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  @article{xiang2025can,
15
  title={Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?},
16
  author={Xiang, Kun and Liu, Zhili and Jiang, Zihao and Nie, Yunshuang and Cai, Kaixin and Yin, Yiyang and Huang, Runhui and Fan, Haoxiang and Li, Hanhui and Huang, Weiran and others},
@@ -26,5 +112,13 @@ If you use this dataset in your research, please cite:
26
  }
27
  ```
28
 
29
- # License
30
- The checkpoint is released under the Apache 2.0 license. Please ensure proper attribution when using this checkpoint.
 
 
 
 
 
 
 
 
 
1
  ---
 
 
2
  base_model:
3
  - liuhaotian/llava-v1.5-7b
4
+ datasets:
5
+ - Quinn777/AMATH-SFT
6
+ license: mit
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
+ tags:
10
+ - multimodal
11
+ - mathematical-reasoning
12
+ - llava
13
+ - slow-thinking
14
  ---
 
15
 
16
+ # Model Card for AtomThink-LLaVA1.5-7B
17
+
18
+ The **AtomThink-LLaVA1.5-7B** model is post-trained based on `llava-v1.5-7b` and the AtomThink framework, designed to solve complex multimodal mathematical problems by incorporating the notion of "slow thinking".
19
+
20
+ It was presented in the paper:
21
+ [**AtomThink: Multimodal Slow Thinking with Atomic Step Reasoning**](https://huggingface.co/papers/2411.11930)
22
+
23
+ The official code repository can be found on [GitHub](https://github.com/Quinn777/AtomThink).
24
+
25
+ ## Overview
26
+
27
+ In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the notion of ``slow thinking'' into multimodal large language models (MLLMs). Our core idea is that models can learn to adaptively use different levels of reasoning to tackle questions of different complexity. We propose a novel paradigm of Self-structured Chain of Thought (SCoT), which comprises of minimal semantic atomic steps. Different from existing methods that rely on structured templates or free-form paradigms, our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomena of overthinking for easier tasks. To introduce structured reasoning into visual cognition, we further design a novel AtomThink framework with four key modules, including (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning (SFT) process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single step utilization rate. We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 times and boosts inference efficiency by 85.3%.
28
+
29
+ ### Key Features
30
+
31
+ - 🧠 Introduces **GPT-o1** style reasoning via long CoT for complex multimodal mathematical tasks.
32
+ - 🛠️ Combines a CoT annotation engine, atomic step fine-tuning, and policy search strategies to enhance reasoning.
33
+ - 📊 A capability evaluation metric to perform a quality assessment of each reasoning step.
34
+ - ⚡ Test-time scaling law in MLLM.
35
+ - 📈 State-of-the-art performance in multimodal mathematical reasoning tasks.
36
+
37
+ <p align="center"> <img src="https://github.com/Quinn777/AtomThink/raw/main/figures/framework.png" alt="AtomThink Framework" width="800"> </p>
38
+ *AtomThink Framework: Overview of the four key modules for structured reasoning.*
39
+
40
+ <p align="center"> <img src="https://github.com/Quinn777/AtomThink/raw/main/figures/fig1.png" alt="Comparison with structured and unstructured reasoning models" width="800"> </p>
41
+ *Comparison with structured and unstructured reasoning models. AtomThink is capable of autonomously generating dynamic structures and lengths based on the type of problem. For text-dominant questions, it bypasses image caption and directly extracts information from the question stem. For low-difficulty problems, it uses fewer tokens compared to o1-like models.*
42
+
43
+ <p align="center"> <img src="https://github.com/Quinn777/AtomThink/raw/main/figures/fig2.png" alt="Complexity comparison" width="800"> </p>
44
+ *Comparison of the average response length in AtomThink-LlamaV over benchmarks with different complexity. (a) As tasks become more challenging, the model proactively utilizes more tokens. (b) The proportion of longer CoT containing a greater number of atomic steps increases in outputs. A higher level signifies increased difficulty. The performance decline margin of AtomThink modes are narrower compared to LLaVA1.5 and LlamaV.*
45
+
46
+ ## Usage
47
+
48
+ You can use the model with the `transformers` library. Below is a simplified example of how to load and use the model for inference. For more advanced usage, including training and evaluation scripts, please refer to the [official GitHub repository](https://github.com/Quinn777/AtomThink).
49
 
50
+ First, ensure you have the necessary libraries installed:
51
+ ```bash
52
+ pip install transformers accelerate pillow requests
53
+ ```
54
+
55
+ ```python
56
+ from transformers import AutoProcessor, LlavaForConditionalGeneration
57
+ from PIL import Image
58
+ import requests
59
+
60
+ # Load the processor and model
61
+ model_id = "Quinn777/AtomThink-LLaVA1.5-7B"
62
+ processor = AutoProcessor.from_pretrained(model_id)
63
+ model = LlavaForConditionalGeneration.from_pretrained(model_id)
64
+
65
+ # Example image (replace with your image path or URL)
66
+ # For a real image:
67
+ # image = Image.open("path/to/your/image.png").convert("RGB")
68
+ # Or from URL:
69
+ image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bird.jpg"
70
+ image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
71
+
72
+ # Example prompt for a mathematical question
73
+ prompt = "USER: <image>
74
+ Solve this mathematical problem step-by-step. ASSISTANT:"
75
+
76
+ # Prepare inputs
77
+ inputs = processor(text=prompt, images=image, return_tensors="pt")
78
+
79
+ # Generate response
80
+ # Use appropriate generation arguments based on your task, e.g., max_new_tokens
81
+ output = model.generate(**inputs, max_new_tokens=256, temperature=0.0)
82
+
83
+ # Decode and print the output
84
+ generated_text = processor.decode(output[0], skip_special_tokens=True)
85
+ print(generated_text)
86
+
87
+ # Example for a simple image description (to show versatility)
88
+ prompt_simple = "USER: <image>
89
+ Describe this image in detail. ASSISTANT:"
90
+ inputs_simple = processor(text=prompt_simple, images=image, return_tensors="pt")
91
+ output_simple = model.generate(**inputs_simple, max_new_tokens=50, temperature=0.0)
92
+ decoded_output_simple = processor.decode(output_simple[0], skip_special_tokens=True)
93
+ print(decoded_output_simple)
94
+ ```
95
+
96
+ ## Citation
97
+
98
+ If you use this model or the associated dataset in your research, please cite:
99
+ ```bibtex
100
  @article{xiang2025can,
101
  title={Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?},
102
  author={Xiang, Kun and Liu, Zhili and Jiang, Zihao and Nie, Yunshuang and Cai, Kaixin and Yin, Yiyang and Huang, Runhui and Fan, Haoxiang and Li, Hanhui and Huang, Weiran and others},
 
112
  }
113
  ```
114
 
115
+ ## License
116
+
117
+ This project is licensed under the [MIT License](https://github.com/Quinn777/AtomThink/blob/main/LICENSE). Please ensure proper attribution when using this checkpoint.
118
+
119
+ ## Acknowledgement
120
+
121
+ We would like to thank the following repositories for their contributions:
122
+ - [hiyouga/LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): This library was used for training.
123
+ - [bklieger-groq/g1](https://github.com/bklieger-groq/g1): This library was used for data processing.
124
+ - [openreasoner/openr](https://github.com/openreasoner/openr): This tool was helpful for deploying the process supervision model.