Uni-MuMER
Collection
Dataset & Models for paper "Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition" • 9 items • Updated
This model is a Uni-MuMER fine-tuned version of Qwen3.5-2B for Handwritten Mathematical Expression Recognition (HMER) -- converting images of handwritten math expressions into LaTeX.
Uni-MuMER uses unified multi-task fine-tuning with three auxiliary tasks (Tree-CoT, Error-Driven Learning, Symbol Counting) to inject domain-specific knowledge into a general-purpose vision-language model.
Paper: Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition (NeurIPS 2025 Spotlight)
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-2B |
| Architecture | Qwen3.5 (Gated DeltaNet + Gated Attention hybrid, unified multimodal) |
| Parameters | 2.2B |
| Training Data | Uni-MuMER-Data (~1.6M samples, 26 subsets) |
| Fine-tuning | Full (all parameters, including vision encoder) |
| Precision | bf16 |
| Template | qwen3_5_nothink (via LLaMA-Factory) |
| Hyperparameter | Value |
|---|---|
| Epochs | 1 |
| Learning Rate | 1e-5 (cosine schedule, 10% warmup) |
| Effective Batch Size | 512 (8 GPUs x 4 per-device x 16 accumulation) |
| DeepSpeed | ZeRO-3 |
| Image Max Pixels | 262,144 (512x512) |
| Cutoff Length | 2048 tokens |
| Hardware | 8x NVIDIA A100 80GB |
| Training Time | ~21 hours |
| Final Train Loss | 0.0298 |
| Dataset | Samples | Uni-MuMER-3B | This Model | Qwen3.5-4B | Qwen3-VL-2B | Qwen3-VL-4B |
|---|---|---|---|---|---|---|
| CROHME 2014 | 986 | 82.25% | 83.98% | 82.56% | 83.27% | 82.35% |
| CROHME 2016 | 1,147 | 78.29% | 81.17% | 78.20% | 78.55% | 79.34% |
| CROHME 2019 | 1,199 | 79.82% | 80.15% | 75.98% | 79.40% | 78.98% |
| CROHME 2023 Test | 2,300 | 69.52% | 69.43% | 66.74% | 70.96% | 69.17% |
| CROHME 2023 Val | 555 | 68.11% | 69.91% | 67.75% | 70.63% | 66.67% |
| HME100K Test | 24,607 | 69.50% | 70.43% | 70.02% | 69.31% | 69.79% |
| Im2LaTeXv2 Test | 10,118 | 76.99% | 77.82% | 77.20% | 77.11% | 77.46% |
| MathWriting Test | 7,643 | 53.03% | 51.84% | 54.32% | 50.66% | 53.15% |
| Average | 72.19% | 73.09% | 71.60% | 72.49% | 72.11% |
| Dataset | CDM F1 | CDM ExpRate |
|---|---|---|
| CROHME 2014 | 0.9690 | 87.10% |
| CROHME 2016 | 0.9690 | 82.70% |
| CROHME 2019 | 0.9690 | 82.70% |
| HME100K Test | 0.9720 | 73.60% |
| Im2LaTeXv2 Test | 0.9950 | 93.70% |
| MathWriting Test | 0.9510 | 70.70% |
| Average | 0.9700 | 81.15% |
from vllm import LLM, SamplingParams
model = LLM(
model="phxember/Uni-MuMER-Qwen3.5-2B",
trust_remote_code=True,
max_model_len=2048,
)
sampling_params = SamplingParams(temperature=0.2, top_p=0.8, max_tokens=2048)
# See https://github.com/BFlameSwift/Uni-MuMER for full inference scripts
git clone https://github.com/BFlameSwift/Uni-MuMER.git
cd Uni-MuMER
bash eval/eval_all.sh -m phxember/Uni-MuMER-Qwen3.5-2B
transformers >= 5.0.0 (required for Qwen3.5 architecture support)vllm >= 0.19.0 for inference (note: vLLM 0.19.0 requires transformers < 5, use separate environments for training vs inference)| Model | Base | Avg ExpRate | HuggingFace |
|---|---|---|---|
| Uni-MuMER-Qwen2.5-VL-3B | Qwen2.5-VL-3B | 72.19% | Link |
| Uni-MuMER-Qwen2.5-VL-7B | Qwen2.5-VL-7B | - | Link |
| Uni-MuMER-Qwen3.5-2B | Qwen3.5-2B | 73.09% | This model |
| Uni-MuMER-Qwen3.5-4B | Qwen3.5-4B | 71.60% | Link |
| Uni-MuMER-Qwen3-VL-2B | Qwen3-VL-2B | 72.49% | Link |
| Uni-MuMER-Qwen3-VL-4B | Qwen3-VL-4B | 72.11% | Link |
@article{li2025unimumer,
title = {Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition},
author = {Li, Yu and Jiang, Jin and Zhu, Jianhua and Peng, Shuai and Wei, Baole and Zhou, Yuxuan and Gao, Liangcai},
year = {2025},
journal = {arXiv preprint arXiv:2505.23566},
}