COMPASS-VLM Phase 1

Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension (推論強化と文書読解の統合による日本語金融VLMの開発)

This model is the Phase 1 checkpoint of the COMPASS project — a Japanese Vision-Language Model (VLM) built on a LLaVA-OneVision-style architecture. Phase 1 produces a general-purpose Japanese VLM through image-caption pretraining and visual instruction tuning. It serves as the vision-grounded foundation for the subsequent reasoning enhancement (Phase 2) and financial domain fine-tuning (Phase 3) stages.

Developed by Atsushi Yanagisawa and Genshin Kakimoto as part of the FT-LLM 2026 free-form task.


Model Details

Item Value
Model type Vision-Language Model (LLaVA-OneVision-style)
Parameters ~9B
Precision BF16
Primary language Japanese (with English support inherited from the base LLM)
License Apache-2.0 (see License)

Architecture

Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
                                                             ├──► LLM-JP-4-8B-Instruct ──► Output Text
Input Text ──────────────────────────────────────────────────┘
Component Model Role in Phase 1
Vision Encoder google/siglip2-so400m-patch14-384 Frozen in Stage 1-1, trainable (lr = 2e-6) in Stage 1-2
MLP Projector Linear(1152→4096) → GELU → Linear(4096→4096), ~8M params Trainable in both stages
LLM llm-jp/llm-jp-4-8b-instruct (8B) Frozen by default; trainable via LoRA in Stage 1-2

Training Procedure

Phase 1 follows the two-stage recipe popularized by LLaVA-1.5 / LLaVA-OneVision, adapted to Japanese data.

Stage 1-1 — Image Caption Pretraining

  • Goal: Align vision tokens with the LLM embedding space.
  • Trainable: MLP projector only.
  • Datasets:
  • Learning rate: 1e-3 · Epochs: 2 · Effective batch size: 128

Stage 1-2 — Visual Instruction Tuning

Common Hyperparameters

Parameter Value
Per-device batch size 2
Gradient accumulation steps 64
Warmup ratio 0.03
Weight decay 0.0
Max sequence length 2048
Mixed precision BF16
Seed 42

Training uses NCCL and supports torchrun, SLURM, and OpenMPI. Gradient checkpointing is enabled by default. An H100 80GB GPU is recommended.


Chat Template

The model uses the LLM-JP v4 instruct template:

以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。

### 指示:
<image>
この画像を見て、質問に答えてください。
{user_question}

### 応答:
{assistant_answer}<|eos|>

Special tokens:

Token Purpose
<image> Image placeholder replaced by vision embeddings
`< eos

Typical prompts used during training:

  • Stage 1-1 caption prompt: この画像を端的に説明してください。 ("Please briefly describe this image.")
  • Stage 1-2 VQA prompt: この画像を見て、質問に答えてください。 ("Look at this image and answer the question.")

Intended Use

Direct Use

  • Japanese image captioning
  • Japanese visual question answering (VQA)
  • Foundation checkpoint for downstream fine-tuning (e.g., document understanding, financial reasoning)

Downstream Use

This checkpoint is specifically intended to be continued into:

  • Phase 2 — reasoning enhancement via SFT + DPO distilled from Qwen3-30B → Yana/compass-vlm-phase2
  • Phase 3 — Japanese financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / domain-specific QA → Yana/compass-vlm

Out-of-Scope Use

  • High-stakes decision making (medical, legal, financial advisory, etc.) without human oversight.
  • Generation of factual claims without verification; the model can hallucinate.
  • Use in languages other than Japanese and English is not evaluated.

Evaluation

Phase 1 is evaluated qualitatively via automatically generated raw outputs on:

  • STAIR Captions License ID 5 held-out samples
  • OCR held-out samples from the training OCR corpus

Quantitative benchmarks (GSM8K, JP Harness, EDINET Bench) are reported for the full pipeline rather than Phase 1 alone. See the project repository for numbers.


Limitations and Biases

  • The vision encoder was pretrained on web-scale image data and may reflect biases present therein.
  • The LLM backbone (LLM-JP-4-8B) was trained primarily on Japanese and English corpora; performance in other languages is not guaranteed.
  • OCR quality on small-font or low-resolution documents is limited.
  • This Phase 1 checkpoint has not received the reasoning enhancement (Phase 2) or financial domain adaptation (Phase 3), so its behavior on multi-step reasoning and financial documents will be weaker than the final COMPASS model.

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch

model_id = "Yana/compass-vlm-phase1"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

For the full inference pipeline (image preprocessing with SigLIP-v2, <image> token expansion, and AnyRes handling), please refer to the phase1/ directory in the GitHub repository.


Citation

If you use this model, please cite the COMPASS project:

@misc{compass2026,
  title  = {COMPASS: Development of a Japanese Financial VLM through
            Integration of Reasoning Enhancement and Document Comprehension},
  author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
  year   = {2026},
  howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
  note   = {FT-LLM 2026 free-form task}
}

Please also cite the upstream works (LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP, STAIR Captions, ja-vg-vqa) as appropriate.


License

This model is released under the Apache License 2.0.

Note on training data and Japanese copyright law: Under Article 30-4 of the Japanese Copyright Act, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.

Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.


Acknowledgements

Built on top of outstanding open-source work, including:

Downloads last month
14
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Yana/compass-vlm-phase1

Finetuned
(22)
this model
Adapters
1 model

Datasets used to train Yana/compass-vlm-phase1

Collection including Yana/compass-vlm-phase1

Papers for Yana/compass-vlm-phase1