COMPASS-VLM Phase 1
Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension (推論強化と文書読解の統合による日本語金融VLMの開発)
This model is the Phase 1 checkpoint of the COMPASS project — a Japanese Vision-Language Model (VLM) built on a LLaVA-OneVision-style architecture. Phase 1 produces a general-purpose Japanese VLM through image-caption pretraining and visual instruction tuning. It serves as the vision-grounded foundation for the subsequent reasoning enhancement (Phase 2) and financial domain fine-tuning (Phase 3) stages.
Developed by Atsushi Yanagisawa and Genshin Kakimoto as part of the FT-LLM 2026 free-form task.
- 📦 Code: github.com/AtsushiYanaigsawa768/Compass
- 📚 Collection: Yana/compass
- 📝 Blog (EN): atsushiyanaigsawa768.github.io/mysite/en/blog/compass
Model Details
| Item | Value |
|---|---|
| Model type | Vision-Language Model (LLaVA-OneVision-style) |
| Parameters | ~9B |
| Precision | BF16 |
| Primary language | Japanese (with English support inherited from the base LLM) |
| License | Apache-2.0 (see License) |
Architecture
Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
├──► LLM-JP-4-8B-Instruct ──► Output Text
Input Text ──────────────────────────────────────────────────┘
| Component | Model | Role in Phase 1 |
|---|---|---|
| Vision Encoder | google/siglip2-so400m-patch14-384 |
Frozen in Stage 1-1, trainable (lr = 2e-6) in Stage 1-2 |
| MLP Projector | Linear(1152→4096) → GELU → Linear(4096→4096), ~8M params | Trainable in both stages |
| LLM | llm-jp/llm-jp-4-8b-instruct (8B) |
Frozen by default; trainable via LoRA in Stage 1-2 |
Training Procedure
Phase 1 follows the two-stage recipe popularized by LLaVA-1.5 / LLaVA-OneVision, adapted to Japanese data.
Stage 1-1 — Image Caption Pretraining
- Goal: Align vision tokens with the LLM embedding space.
- Trainable: MLP projector only.
- Datasets:
- STAIR Captions (license_id = 4 only, with multi-caption random sampling providing 5× effective diversity)
- Yana/ft-llm-2026-ocr-dataset
- Learning rate: 1e-3 · Epochs: 2 · Effective batch size: 128
Stage 1-2 — Visual Instruction Tuning
- Goal: Enable VQA and instruction following in Japanese.
- Trainable: MLP projector + LLM (via LoRA, r = 64, α = 128) + Vision Encoder (lr = 2e-6).
- Datasets:
- Yana/ft-llm-2026-qa-dataset
- llm-jp/ja-vg-vqa-conversation (~90k on Visual Genome images)
- SakanaAI/JA-VG-VQA-500
- Learning rate: 2e-5 · Epochs: 1 · Effective batch size: 128
Common Hyperparameters
| Parameter | Value |
|---|---|
| Per-device batch size | 2 |
| Gradient accumulation steps | 64 |
| Warmup ratio | 0.03 |
| Weight decay | 0.0 |
| Max sequence length | 2048 |
| Mixed precision | BF16 |
| Seed | 42 |
Training uses NCCL and supports torchrun, SLURM, and OpenMPI. Gradient checkpointing is enabled by default. An H100 80GB GPU is recommended.
Chat Template
The model uses the LLM-JP v4 instruct template:
以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。
### 指示:
<image>
この画像を見て、質問に答えてください。
{user_question}
### 応答:
{assistant_answer}<|eos|>
Special tokens:
| Token | Purpose |
|---|---|
<image> |
Image placeholder replaced by vision embeddings |
| `< | eos |
Typical prompts used during training:
- Stage 1-1 caption prompt:
この画像を端的に説明してください。("Please briefly describe this image.") - Stage 1-2 VQA prompt:
この画像を見て、質問に答えてください。("Look at this image and answer the question.")
Intended Use
Direct Use
- Japanese image captioning
- Japanese visual question answering (VQA)
- Foundation checkpoint for downstream fine-tuning (e.g., document understanding, financial reasoning)
Downstream Use
This checkpoint is specifically intended to be continued into:
- Phase 2 — reasoning enhancement via SFT + DPO distilled from Qwen3-30B → Yana/compass-vlm-phase2
- Phase 3 — Japanese financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / domain-specific QA → Yana/compass-vlm
Out-of-Scope Use
- High-stakes decision making (medical, legal, financial advisory, etc.) without human oversight.
- Generation of factual claims without verification; the model can hallucinate.
- Use in languages other than Japanese and English is not evaluated.
Evaluation
Phase 1 is evaluated qualitatively via automatically generated raw outputs on:
- STAIR Captions License ID 5 held-out samples
- OCR held-out samples from the training OCR corpus
Quantitative benchmarks (GSM8K, JP Harness, EDINET Bench) are reported for the full pipeline rather than Phase 1 alone. See the project repository for numbers.
Limitations and Biases
- The vision encoder was pretrained on web-scale image data and may reflect biases present therein.
- The LLM backbone (LLM-JP-4-8B) was trained primarily on Japanese and English corpora; performance in other languages is not guaranteed.
- OCR quality on small-font or low-resolution documents is limited.
- This Phase 1 checkpoint has not received the reasoning enhancement (Phase 2) or financial domain adaptation (Phase 3), so its behavior on multi-step reasoning and financial documents will be weaker than the final COMPASS model.
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
model_id = "Yana/compass-vlm-phase1"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
For the full inference pipeline (image preprocessing with SigLIP-v2, <image> token expansion, and AnyRes handling), please refer to the phase1/ directory in the GitHub repository.
Citation
If you use this model, please cite the COMPASS project:
@misc{compass2026,
title = {COMPASS: Development of a Japanese Financial VLM through
Integration of Reasoning Enhancement and Document Comprehension},
author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
year = {2026},
howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
note = {FT-LLM 2026 free-form task}
}
Please also cite the upstream works (LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP, STAIR Captions, ja-vg-vqa) as appropriate.
License
This model is released under the Apache License 2.0.
Note on training data and Japanese copyright law: Under Article 30-4 of the Japanese Copyright Act, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.
Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.
Acknowledgements
Built on top of outstanding open-source work, including:
- Downloads last month
- 14