This repository provides Asagi-2B, a large-scale Japanese Vision & Language Model (VLM).
Asagi-2B has been trained on an extensive Japanese dataset, incorporating a diverse range of data sources.
A significant portion of the training data is synthesized using models such as the Japanese large language model (CALM3-22B-Chat) and the English Vision & Language Model (Phi3.5-vision-instruct).
Importantly, we do not use LLMs that restrict the usage of their outputs in the license terms (e.g., GPT-4) to synthesize the training data.
Note: ROIS (Ours) is a newly collected dataset crawled from the web specifically for this project.
The dataset consists of image and raw text pairs, which are used to synthesize the training data.
Evaluation
We evaluated our model using Heron-Bench, JA-VLM-Bench-in-the-Wild, and JA-VG-VQA-500.
We used eval-mm library for this evaluation.
Here, models with "†" are not trained with GPT-generated data.
Bold numbers indicate the best performance among all models, and underlined numbers indicate the best performance among models not trained with GPT-generated data.
Model
LM Size
Heron-Bench (LLM (%))
JA-VLM-Bench-In-the-Wild (ROUGE-L)
JA-VLM-Bench-In-the-Wild (LLM (/5.0))
JA-VG-VQA-500 (ROUGE-L)
JA-VG-VQA-500 (LLM (/5.0))
Japanese InstructBLIP Alpha†
7B
14.0
20.8
2.42
-
-
Japanese Stable VLM†
7B
24.2
23.3
2.47
-
-
LLaVA-CALM2-SigLIP†
7B
43.3
47.2
3.15
17.4
3.21
Llama-3-EvoVLM-JP-v2
8B
39.3
41.4
2.92
23.5
2.96
VILA-jp
13B
57.2
52.3
3.69
16.2
3.62
Asagi-2B†
1.8B
44.7
48.8
3.26
53.7
3.69
Asagi-4B†
3.7B
49.3
49.6
3.38
55.6
3.78
Asagi-8B†
7.2B
54.7
49.4
3.45
56.43
3.84
Asagi-14B†
13B
55.8
50.8
3.44
56.8
3.84
GPT-4o
-
87.6
37.6
3.85
12.1
3.58
Risks and Limitations
The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.