COMPASS-VLM Phase 1

Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension (推論強化と文書読解の統合による日本語金融VLMの開発)

This model is the Phase 1 checkpoint of the COMPASS project — a Japanese Vision-Language Model (VLM) built on a LLaVA-OneVision-style architecture. Phase 1 produces a general-purpose Japanese VLM through image-caption pretraining and visual instruction tuning. It serves as the vision-grounded foundation for the subsequent reasoning enhancement (Phase 2) and financial domain fine-tuning (Phase 3) stages.

Developed by Atsushi Yanagisawa and Genshin Kakimoto as part of the FT-LLM 2026 free-form task.

📦 Code: github.com/AtsushiYanaigsawa768/Compass
📚 Collection: Yana/compass
📝 Blog (EN): atsushiyanaigsawa768.github.io/mysite/en/blog/compass

Model Details

Item	Value
Model type	Vision-Language Model (LLaVA-OneVision-style)
Parameters	~9B
Precision	BF16
Primary language	Japanese (with English support inherited from the base LLM)
License	Apache-2.0 (see License)

Architecture

Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
                                                             ├──► LLM-JP-4-8B-Instruct ──► Output Text
Input Text ──────────────────────────────────────────────────┘

Component	Model	Role in Phase 1
Vision Encoder	`google/siglip2-so400m-patch14-384`	Frozen in Stage 1-1, trainable (lr = 2e-6) in Stage 1-2
MLP Projector	Linear(1152→4096) → GELU → Linear(4096→4096), ~8M params	Trainable in both stages
LLM	`llm-jp/llm-jp-4-8b-instruct` (8B)	Frozen by default; trainable via LoRA in Stage 1-2

Training Procedure

Phase 1 follows the two-stage recipe popularized by LLaVA-1.5 / LLaVA-OneVision, adapted to Japanese data.

Stage 1-1 — Image Caption Pretraining

Goal: Align vision tokens with the LLM embedding space.
Trainable: MLP projector only.
Datasets:
- STAIR Captions (license_id = 4 only, with multi-caption random sampling providing 5× effective diversity)
- Yana/ft-llm-2026-ocr-dataset
Learning rate: 1e-3 · Epochs: 2 · Effective batch size: 128

Stage 1-2 — Visual Instruction Tuning

Goal: Enable VQA and instruction following in Japanese.
Trainable: MLP projector + LLM (via LoRA, r = 64, α = 128) + Vision Encoder (lr = 2e-6).
Datasets:
- Yana/ft-llm-2026-qa-dataset
- llm-jp/ja-vg-vqa-conversation (~90k on Visual Genome images)
- SakanaAI/JA-VG-VQA-500
Learning rate: 2e-5 · Epochs: 1 · Effective batch size: 128

Common Hyperparameters

Parameter	Value
Per-device batch size	2
Gradient accumulation steps	64
Warmup ratio	0.03
Weight decay	0.0
Max sequence length	2048
Mixed precision	BF16
Seed	42

Training uses NCCL and supports torchrun, SLURM, and OpenMPI. Gradient checkpointing is enabled by default. An H100 80GB GPU is recommended.

Chat Template

The model uses the LLM-JP v4 instruct template:

以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。

### 指示:
<image>
この画像を見て、質問に答えてください。
{user_question}

### 応答:
{assistant_answer}<|eos|>

Special tokens:

Token	Purpose
`<image>`	Image placeholder replaced by vision embeddings
`<	eos

Typical prompts used during training:

Stage 1-1 caption prompt: この画像を端的に説明してください。 ("Please briefly describe this image.")
Stage 1-2 VQA prompt: この画像を見て、質問に答えてください。 ("Look at this image and answer the question.")

Intended Use

Direct Use

Japanese image captioning
Japanese visual question answering (VQA)
Foundation checkpoint for downstream fine-tuning (e.g., document understanding, financial reasoning)

Downstream Use

This checkpoint is specifically intended to be continued into:

Phase 2 — reasoning enhancement via SFT + DPO distilled from Qwen3-30B → Yana/compass-vlm-phase2
Phase 3 — Japanese financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / domain-specific QA → Yana/compass-vlm

Out-of-Scope Use

High-stakes decision making (medical, legal, financial advisory, etc.) without human oversight.
Generation of factual claims without verification; the model can hallucinate.
Use in languages other than Japanese and English is not evaluated.

Evaluation

Phase 1 is evaluated qualitatively via automatically generated raw outputs on:

STAIR Captions License ID 5 held-out samples
OCR held-out samples from the training OCR corpus

Quantitative benchmarks (GSM8K, JP Harness, EDINET Bench) are reported for the full pipeline rather than Phase 1 alone. See the project repository for numbers.

Limitations and Biases

The vision encoder was pretrained on web-scale image data and may reflect biases present therein.
The LLM backbone (LLM-JP-4-8B) was trained primarily on Japanese and English corpora; performance in other languages is not guaranteed.
OCR quality on small-font or low-resolution documents is limited.
This Phase 1 checkpoint has not received the reasoning enhancement (Phase 2) or financial domain adaptation (Phase 3), so its behavior on multi-step reasoning and financial documents will be weaker than the final COMPASS model.

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch

model_id = "Yana/compass-vlm-phase1"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

For the full inference pipeline (image preprocessing with SigLIP-v2, <image> token expansion, and AnyRes handling), please refer to the phase1/ directory in the GitHub repository.

Citation

If you use this model, please cite the COMPASS project:

@misc{compass2026,
  title  = {COMPASS: Development of a Japanese Financial VLM through
            Integration of Reasoning Enhancement and Document Comprehension},
  author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
  year   = {2026},
  howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
  note   = {FT-LLM 2026 free-form task}
}

Please also cite the upstream works (LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP, STAIR Captions, ja-vg-vqa) as appropriate.

License

This model is released under the Apache License 2.0.

Note on training data and Japanese copyright law: Under Article 30-4 of the Japanese Copyright Act, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.

Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.