Uni-MuMER-Qwen3-VL-2B

This model is a Uni-MuMER fine-tuned version of Qwen3-VL-2B-Instruct for Handwritten Mathematical Expression Recognition (HMER) -- converting images of handwritten math expressions into LaTeX.

Uni-MuMER uses unified multi-task fine-tuning with three auxiliary tasks (Tree-CoT, Error-Driven Learning, Symbol Counting) to inject domain-specific knowledge into a general-purpose vision-language model.

Paper: Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition (NeurIPS 2025 Spotlight)

Model Details

Property Value
Base Model Qwen/Qwen3-VL-2B-Instruct
Architecture Qwen3-VL (dedicated vision-language model with ViT encoder)
Parameters 2.1B
Training Data Uni-MuMER-Data (~1.6M samples, 26 subsets)
Fine-tuning Full (all parameters, including vision encoder)
Precision bf16
Template qwen3_vl_nothink (via LLaMA-Factory)

Training Configuration

Hyperparameter Value
Epochs 1
Learning Rate 1e-5 (cosine schedule, 10% warmup)
Effective Batch Size 512 (8 GPUs x 4 per-device x 16 accumulation)
DeepSpeed ZeRO-3
Image Max Pixels 262,144 (512x512)
Cutoff Length 2048 tokens
Hardware 8x NVIDIA A100 80GB
Training Time ~18 hours
Final Train Loss 0.0433

Benchmark Results

ExpRate (Exact Match Rate)

Dataset Samples Uni-MuMER-3B Qwen3.5-2B Qwen3.5-4B This Model Qwen3-VL-4B
CROHME 2014 986 82.25% 83.98% 82.56% 83.27% 82.35%
CROHME 2016 1,147 78.29% 81.17% 78.20% 78.55% 79.34%
CROHME 2019 1,199 79.82% 80.15% 75.98% 79.40% 78.98%
CROHME 2023 Test 2,300 69.52% 69.43% 66.74% 70.96% 69.17%
CROHME 2023 Val 555 68.11% 69.91% 67.75% 70.63% 66.67%
HME100K Test 24,607 69.50% 70.43% 70.02% 69.31% 69.79%
Im2LaTeXv2 Test 10,118 76.99% 77.82% 77.20% 77.11% 77.46%
MathWriting Test 7,643 53.03% 51.84% 54.32% 50.66% 53.15%
Average 72.19% 73.09% 71.60% 72.49% 72.11%

CDM Metrics (Visual-Equivalence-Aware)

Dataset CDM F1 CDM ExpRate
CROHME 2014 0.9700 86.50%
CROHME 2016 0.9660 81.50%
CROHME 2019 0.9700 82.60%
HME100K Test 0.9710 72.40%
Im2LaTeXv2 Test 0.9940 92.60%
MathWriting Test 0.9480 69.00%
Average 0.9692 80.35%

Usage

Inference with vLLM

from vllm import LLM, SamplingParams

model = LLM(
    model="phxember/Uni-MuMER-Qwen3-VL-2B",
    trust_remote_code=True,
    max_model_len=2048,
)

sampling_params = SamplingParams(temperature=0.2, top_p=0.8, max_tokens=2048)

# See https://github.com/BFlameSwift/Uni-MuMER for full inference scripts

Batch Evaluation

git clone https://github.com/BFlameSwift/Uni-MuMER.git
cd Uni-MuMER
bash eval/eval_all.sh -m phxember/Uni-MuMER-Qwen3-VL-2B

Related Models

Model Base Avg ExpRate HuggingFace
Uni-MuMER-Qwen2.5-VL-3B Qwen2.5-VL-3B 72.19% Link
Uni-MuMER-Qwen2.5-VL-7B Qwen2.5-VL-7B - Link
Uni-MuMER-Qwen3.5-2B Qwen3.5-2B 73.09% Link
Uni-MuMER-Qwen3.5-4B Qwen3.5-4B 71.60% Link
Uni-MuMER-Qwen3-VL-2B Qwen3-VL-2B 72.49% This model
Uni-MuMER-Qwen3-VL-4B Qwen3-VL-4B 72.11% Link

Citation

@article{li2025unimumer,
  title = {Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition},
  author = {Li, Yu and Jiang, Jin and Zhu, Jianhua and Peng, Shuai and Wei, Baole and Zhou, Yuxuan and Gao, Liangcai},
  year = {2025},
  journal = {arXiv preprint arXiv:2505.23566},
}
Downloads last month
43
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phxember/Uni-MuMER-Qwen3-VL-2B

Finetuned
(179)
this model
Quantizations
2 models

Dataset used to train phxember/Uni-MuMER-Qwen3-VL-2B

Collection including phxember/Uni-MuMER-Qwen3-VL-2B

Paper for phxember/Uni-MuMER-Qwen3-VL-2B