omnes-flores (technology preview)
The omnes-flores is a unified NLP framework for LLMs consisting of three components:
- The
LScomponent takes documents as input, and outputs results oflanguage identificationandsentence segmentationtasks- Corresponding model is omnes-flores-40-lang-41-treebank-v0-ls.
- The
WXcomponent takes a sentence and its language, and outputs results ofword segmentationandlanguage-specific part-of-speech taggingtasks- Corresponding model is omnes-flores-40-lang-41-treebank-v0-wx.
- The
UDcomponent takes a sentence and its language, constituent word list and language, and outputs results ofdependency parsingtask- Corresponding model is on this page.
By executing these three tasks in sequence using the Python library omnes-flores, you can obtain dependency parsing results corresponding to the language of the input text simply by inputting text, regardless of the language.
For details, please read the Requirements and Install sections in omnes-flores repository.
41 treebanks Used for LoRA SFT
This model was trained using training data from 40 UD languages, consisting of 41 treebanks.
The Japanese word unit is LUW.
(日本語の単語分割基準は国語研長単位です。)
The following 40 UD treebanks, which have both a commercially available license and over 40k UD tokens in the train set, were select to train the LoRA models of omnes-flores-40-lang-41-treebank-v0.
- UD_Armenian-ArmTDP, UD_Belarusian-HSE, UD_Bororo-BDT, UD_Chinese-GSD, UD_Chinese-GSDSimp, UD_Croatian-SET, UD_Czech-CAC, UD_Danish-DDT, UD_Dutch-Alpino, UD_English-EWT, UD_Estonian-EWT, UD_Finnish-TDT, UD_French-GSD, UD_German-GSD, UD_Haitian_Creole-Adolphe, UD_Hebrew-IAHLTwiki, UD_Icelandic-GC, UD_Indonesian-GSD, UD_Irish-IDT, UD_Japanese-GSDLUW, UD_Korean-Kaist, UD_Latvian-LVTB, UD_Lithuanian-ALKSNIS, UD_Naija-NSC, UD_Norwegian-Nynorsk, UD_Persian-PerDT, UD_Portuguese-Porttinari, UD_Romanian-RRT, UD_Russian-GSD, UD_Scottish_Gaelic-ARCOSG, UD_Serbian-SET, UD_Sindhi-Isra, UD_Slovak-SNK, UD_Slovenian-SSJ, UD_Spanish-GSD, UD_Swedish-Talbanken, UD_Thai-TUD, UD_Turkish-BOUN, UD_Ukrainian-ParlaMint, UD_Western_Armenian-ArmTDP,
In addition, a proprietary treebank was used for training, which were specially licensed from the National Institute for Japanese Language and Linguistics exclusively for training this model.
- UD_Japanese-BCCWJLUW (excluding PN newspaper articles)
Evalution Results
See the NLP2026 paper and its poster material (written in Japanese) for details.
Acknowledgements
This work was conducted as part of a collaborative research project between Recruit Co., Ltd. and the National Institute for Japanese Language and Linguistics.
Citations
You are encouraged to cite one of the following papers if you use omnes-flores models:
@inproceedings{matsuda-etal-2025-step,
title = "Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of {LLM}s",
author = "Matsuda, Hiroshi and
Ma, Chunpeng and
Asahara, Masayuki",
editor = "Sagae, Kenji and
Oepen, Stephan",
booktitle = "Proceedings of the 18th International Conference on Parsing Technologies (IWPT, SyntaxFest 2025)",
month = aug,
year = "2025",
address = "Ljubljana, Slovenia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.iwpt-1.2/",
pages = "11--19",
ISBN = "979-8-89176-294-7",
abstract = "Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches."
}
@misc{matsuda2025stepbystepinstructionssimpletabular,
title={Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs},
author={Hiroshi Matsuda and Chunpeng Ma and Masayuki Asahara},
year={2025},
eprint={2506.09983},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.09983},
}
Model tree for megagonlabs/omnes-flores-40-lang-41-treebank-v0
Base model
google/gemma-2-9b