Improve model card: Add metadata, links, features, and usage example
Browse filesThis PR significantly enhances the model card for `AtomThink-LlamaV` by:
- Adding `pipeline_tag: image-text-to-text` and `library_name: transformers` to the metadata, improving discoverability and integration with the Hugging Face ecosystem.
- Correcting the `license` to `mit`, as specified in the official GitHub repository.
- Including direct links to the paper, project page, and GitHub repository for easy access to related resources.
- Adding an overview of the model from the paper's abstract and highlighting its key features.
- Providing a runnable Python code snippet for quick inference using the `transformers` library.
- Removing the irrelevant "File information" section, as per documentation guidelines.
These updates make the model card more informative, user-friendly, and compliant with Hugging Face Hub standards.
|
@@ -1,16 +1,111 @@
|
|
| 1 |
---
|
| 2 |
-
datasets:
|
| 3 |
-
- Quinn777/AMATH-SFT
|
| 4 |
base_model:
|
| 5 |
- meta-llama/Llama-3.2-11B-Vision-Instruct
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
---
|
|
|
|
| 7 |
# Model Card for AtomThink-LlamaV
|
| 8 |
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
@article{xiang2025can,
|
| 15 |
title={Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?},
|
| 16 |
author={Xiang, Kun and Liu, Zhili and Jiang, Zihao and Nie, Yunshuang and Cai, Kaixin and Yin, Yiyang and Huang, Runhui and Fan, Haoxiang and Li, Hanhui and Huang, Weiran and others},
|
|
@@ -26,5 +121,11 @@ If you use this dataset in your research, please cite:
|
|
| 26 |
}
|
| 27 |
```
|
| 28 |
|
| 29 |
-
# License
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- meta-llama/Llama-3.2-11B-Vision-Instruct
|
| 4 |
+
datasets:
|
| 5 |
+
- Quinn777/AMATH-SFT
|
| 6 |
+
license: mit
|
| 7 |
+
pipeline_tag: image-text-to-text
|
| 8 |
+
library_name: transformers
|
| 9 |
---
|
| 10 |
+
|
| 11 |
# Model Card for AtomThink-LlamaV
|
| 12 |
|
| 13 |
+
This model, **AtomThink-LlamaV**, is a multimodal large language model (MLLM) post-trained based on `Llama-3.2-11B-Vision-Instruct` and the novel **AtomThink** framework. It is designed to tackle complex multimodal mathematical reasoning problems by incorporating the notion of "slow thinking" with atomic step reasoning.
|
| 14 |
+
|
| 15 |
+
AtomThink introduces a novel paradigm called Self-structured Chain of Thought (SCoT), which uses minimal semantic atomic steps. This allows the model to adaptively apply different levels of reasoning for varying task complexities, generating cognitive CoT structures for complex tasks while avoiding "overthinking" for simpler ones.
|
| 16 |
+
|
| 17 |
+
<p align="center">
|
| 18 |
+
<img src="https://raw.githubusercontent.com/Quinn777/AtomThink/main/figures/logo.jpg" alt="Project Logo" width="400">
|
| 19 |
+
</p>
|
| 20 |
+
|
| 21 |
+
- **Paper**: [AtomThink: Multimodal Slow Thinking with Atomic Step Reasoning](https://huggingface.co/papers/2411.11930)
|
| 22 |
+
- **Project Page**: [https://quinn777.github.io/AtomThink/](https://quinn777.github.io/AtomThink/)
|
| 23 |
+
- **Code**: [https://github.com/Quinn777/AtomThink](https://github.com/Quinn777/AtomThink)
|
| 24 |
+
|
| 25 |
+
## Abstract
|
| 26 |
+
In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the notion of ``slow thinking'' into multimodal large language models (MLLMs). Our core idea is that models can learn to adaptively use different levels of reasoning to tackle questions of different complexity. We propose a novel paradigm of Self-structured Chain of Thought (SCoT), which comprises of minimal semantic atomic steps. Different from existing methods that rely on structured templates or free-form paradigms, our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomena of overthinking for easier tasks. To introduce structured reasoning into visual cognition, we further design a novel AtomThink framework with four key modules, including (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning (SFT) process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single step utilization rate. We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10\% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 times and boosts inference efficiency by 85.3\%.
|
| 27 |
+
|
| 28 |
+
## Key Features
|
| 29 |
+
|
| 30 |
+
* 🧠 Introduces **GPT-o1** style reasoning via long Chain of Thought (CoT) for complex multimodal mathematical tasks.
|
| 31 |
+
* 🛠️ Combines a CoT annotation engine, atomic step fine-tuning, and policy search strategies to enhance reasoning.
|
| 32 |
+
* 📊 A capability evaluation metric to perform a quality assessment of each reasoning step.
|
| 33 |
+
* ⚡ Demonstrates test-time scaling law in MLLMs.
|
| 34 |
+
* 📈 Achieves state-of-the-art performance in multimodal mathematical reasoning tasks, with over 10% average accuracy gains on MathVista and MathVerse.
|
| 35 |
+
|
| 36 |
+
<p align="center">
|
| 37 |
+
<img src="https://raw.githubusercontent.com/Quinn777/AtomThink/main/figures/framework.png" alt="AtomThink Framework" width="800">
|
| 38 |
+
</p>
|
| 39 |
+
<p align="center">
|
| 40 |
+
<img src="https://raw.githubusercontent.com/Quinn777/AtomThink/main/figures/fig1.png" alt="Comparison with structured and unstructured reasoning models" width="800">
|
| 41 |
+
</p>
|
| 42 |
+
Comparison with structured and unstructured reasoning models. We are capable of autonomously generating dynamic structures and lengths based on the type of problem. For text-dominant questions as shown on the left, we bypass image caption and directly extracted information from the question stem. For the low-difficulty problem on the right, we use fewer tokens compared to o1-like model.
|
| 43 |
+
|
| 44 |
+
<p align="center">
|
| 45 |
+
<img src="https://raw.githubusercontent.com/Quinn777/AtomThink/main/figures/fig2.png" alt="Comparison of average response length in AtomThink-LlamaV" width="800">
|
| 46 |
+
</p>
|
| 47 |
+
Comparison of the average response length in AtomThink-LlamaV over benchmarks with different complexity. (a) As tasks become more challenging, the model proactively utilizes more tokens. (b) The proportion of longer CoT containing a greater number of atomic steps increases in outputs. A higher level signifies increased difficulty. The performance decline margin of AtomThink modes are more narrow (-20.4% v.s. -30.7% in LLaVA1.5, -30% v.s. -43.0% in LlamaV).
|
| 48 |
+
|
| 49 |
+
## Usage
|
| 50 |
+
|
| 51 |
+
This model can be loaded and used with the `transformers` library for multimodal reasoning tasks. Ensure `trust_remote_code=True` is set for proper loading of custom model architectures.
|
| 52 |
|
| 53 |
+
```python
|
| 54 |
+
import torch
|
| 55 |
+
from transformers import AutoProcessor, AutoModelForConditionalGeneration
|
| 56 |
+
from PIL import Image
|
| 57 |
+
import requests
|
| 58 |
+
|
| 59 |
+
# Load the model and processor
|
| 60 |
+
model_id = "Quinn777/AtomThink-Llama-3.2-Vision-11B"
|
| 61 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
| 62 |
+
model = AutoModelForConditionalGeneration.from_pretrained(
|
| 63 |
+
model_id,
|
| 64 |
+
torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported by your GPU
|
| 65 |
+
device_map="auto",
|
| 66 |
+
trust_remote_code=True # Required for custom modeling files
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
# Example: Multimodal mathematical reasoning
|
| 70 |
+
# You can replace this with your own image path or URL
|
| 71 |
+
# For demonstration, we use an image from the original GitHub repository
|
| 72 |
+
image_url = "https://raw.githubusercontent.com/Quinn777/AtomThink/main/figures/case1.png"
|
| 73 |
+
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
|
| 74 |
+
|
| 75 |
+
prompt = "Given the image, describe the geometric shape and calculate its perimeter if each unit is 1cm."
|
| 76 |
+
|
| 77 |
+
messages = [
|
| 78 |
+
{"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]}
|
| 79 |
+
]
|
| 80 |
+
|
| 81 |
+
# Apply chat template and tokenize
|
| 82 |
+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 83 |
+
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
|
| 84 |
+
|
| 85 |
+
# Generate response
|
| 86 |
+
output_ids = model.generate(**inputs, max_new_tokens=512, temperature=0.0) # temperature=0.0 for deterministic output
|
| 87 |
+
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
|
| 88 |
+
|
| 89 |
+
print(response)
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
## Case Study
|
| 93 |
+
|
| 94 |
+
The original project repository provides atomic samples in the AMATH dataset, showcasing the model's step-by-step reasoning:
|
| 95 |
+
|
| 96 |
+
Example 1:
|
| 97 |
+
<p align="center">
|
| 98 |
+
<img src="https://raw.githubusercontent.com/Quinn777/AtomThink/main/figures/case1.png" alt="Case Study Example 1" width="800">
|
| 99 |
+
</p>
|
| 100 |
+
|
| 101 |
+
Example 2:
|
| 102 |
+
<p align="center">
|
| 103 |
+
<img src="https://raw.githubusercontent.com/Quinn777/AtomThink/main/figures/case2.png" alt="Case Study Example 2" width="800">
|
| 104 |
+
</p>
|
| 105 |
+
|
| 106 |
+
## Citation
|
| 107 |
+
If you use this model or the associated dataset in your research, please cite:
|
| 108 |
+
```bibtex
|
| 109 |
@article{xiang2025can,
|
| 110 |
title={Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?},
|
| 111 |
author={Xiang, Kun and Liu, Zhili and Jiang, Zihao and Nie, Yunshuang and Cai, Kaixin and Yin, Yiyang and Huang, Runhui and Fan, Haoxiang and Li, Hanhui and Huang, Weiran and others},
|
|
|
|
| 121 |
}
|
| 122 |
```
|
| 123 |
|
| 124 |
+
## License
|
| 125 |
+
This project is licensed under the [MIT License](https://github.com/Quinn777/AtomThink/blob/main/LICENSE). Please ensure proper attribution when using this checkpoint.
|
| 126 |
+
|
| 127 |
+
## Acknowledgement
|
| 128 |
+
We would like to thank the following repositories for their contributions:
|
| 129 |
+
- [hiyouga/LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): This library was used for training.
|
| 130 |
+
- [bklieger-groq/g1](https://github.com/bklieger-groq/g1): This library was used for data processing.
|
| 131 |
+
- [openreasoner/openr](https://github.com/openreasoner/openr): This tool was helpful for deploying the process supervision model.
|