Uni-MuMER-Qwen3.5-2B

This model is a Uni-MuMER fine-tuned version of Qwen3.5-2B for Handwritten Mathematical Expression Recognition (HMER) -- converting images of handwritten math expressions into LaTeX.

Uni-MuMER uses unified multi-task fine-tuning with three auxiliary tasks (Tree-CoT, Error-Driven Learning, Symbol Counting) to inject domain-specific knowledge into a general-purpose vision-language model.

Paper: Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition (NeurIPS 2025 Spotlight)

Model Details

Property	Value
Base Model	Qwen/Qwen3.5-2B
Architecture	Qwen3.5 (Gated DeltaNet + Gated Attention hybrid, unified multimodal)
Parameters	2.2B
Training Data	Uni-MuMER-Data (~1.6M samples, 26 subsets)
Fine-tuning	Full (all parameters, including vision encoder)
Precision	bf16
Template	`qwen3_5_nothink` (via LLaMA-Factory)

Training Configuration

Hyperparameter	Value
Epochs	1
Learning Rate	1e-5 (cosine schedule, 10% warmup)
Effective Batch Size	512 (8 GPUs x 4 per-device x 16 accumulation)
DeepSpeed	ZeRO-3
Image Max Pixels	262,144 (512x512)
Cutoff Length	2048 tokens
Hardware	8x NVIDIA A100 80GB
Training Time	~21 hours
Final Train Loss	0.0298

Benchmark Results

ExpRate (Exact Match Rate)

Dataset	Samples	Uni-MuMER-3B	This Model	Qwen3.5-4B	Qwen3-VL-2B	Qwen3-VL-4B
CROHME 2014	986	82.25%	83.98%	82.56%	83.27%	82.35%
CROHME 2016	1,147	78.29%	81.17%	78.20%	78.55%	79.34%
CROHME 2019	1,199	79.82%	80.15%	75.98%	79.40%	78.98%
CROHME 2023 Test	2,300	69.52%	69.43%	66.74%	70.96%	69.17%
CROHME 2023 Val	555	68.11%	69.91%	67.75%	70.63%	66.67%
HME100K Test	24,607	69.50%	70.43%	70.02%	69.31%	69.79%
Im2LaTeXv2 Test	10,118	76.99%	77.82%	77.20%	77.11%	77.46%
MathWriting Test	7,643	53.03%	51.84%	54.32%	50.66%	53.15%
Average		72.19%	73.09%	71.60%	72.49%	72.11%

CDM Metrics (Visual-Equivalence-Aware)

Dataset	CDM F1	CDM ExpRate
CROHME 2014	0.9690	87.10%
CROHME 2016	0.9690	82.70%
CROHME 2019	0.9690	82.70%
HME100K Test	0.9720	73.60%
Im2LaTeXv2 Test	0.9950	93.70%
MathWriting Test	0.9510	70.70%
Average	0.9700	81.15%

Usage

Inference with vLLM

from vllm import LLM, SamplingParams

model = LLM(
    model="phxember/Uni-MuMER-Qwen3.5-2B",
    trust_remote_code=True,
    max_model_len=2048,
)

sampling_params = SamplingParams(temperature=0.2, top_p=0.8, max_tokens=2048)

# See https://github.com/BFlameSwift/Uni-MuMER for full inference scripts

Batch Evaluation

git clone https://github.com/BFlameSwift/Uni-MuMER.git
cd Uni-MuMER
bash eval/eval_all.sh -m phxember/Uni-MuMER-Qwen3.5-2B

Requirements

transformers >= 5.0.0 (required for Qwen3.5 architecture support)
vllm >= 0.19.0 for inference (note: vLLM 0.19.0 requires transformers < 5, use separate environments for training vs inference)

Related Models

Model	Base	Avg ExpRate	HuggingFace
Uni-MuMER-Qwen2.5-VL-3B	Qwen2.5-VL-3B	72.19%	Link
Uni-MuMER-Qwen2.5-VL-7B	Qwen2.5-VL-7B	-	Link
Uni-MuMER-Qwen3.5-2B	Qwen3.5-2B	73.09%	This model
Uni-MuMER-Qwen3.5-4B	Qwen3.5-4B	71.60%	Link
Uni-MuMER-Qwen3-VL-2B	Qwen3-VL-2B	72.49%	Link
Uni-MuMER-Qwen3-VL-4B	Qwen3-VL-4B	72.11%	Link

Citation

@article{li2025unimumer,
  title = {Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition},
  author = {Li, Yu and Jiang, Jin and Zhu, Jianhua and Peng, Shuai and Wei, Baole and Zhou, Yuxuan and Gao, Liangcai},
  year = {2025},
  journal = {arXiv preprint arXiv:2505.23566},
}