omnes-flores (technology preview)

The omnes-flores is a unified NLP framework for LLMs consisting of three components:

The LS component takes documents as input, and outputs results of language identification and sentence segmentation tasks
- Corresponding model is omnes-flores-40-lang-41-treebank-v0-ls.
The WX component takes a sentence and its language, and outputs results of word segmentation and language-specific part-of-speech tagging tasks
- Corresponding model is omnes-flores-40-lang-41-treebank-v0-wx.
The UD component takes a sentence and its language, constituent word list and language, and outputs results of dependency parsing task
- Corresponding model is on this page.

By executing these three tasks in sequence using the Python library omnes-flores, you can obtain dependency parsing results corresponding to the language of the input text simply by inputting text, regardless of the language.

For details, please read the Requirements and Install sections in omnes-flores repository.

41 treebanks Used for LoRA SFT

This model was trained using training data from 40 UD languages, consisting of 41 treebanks.

The Japanese word unit is LUW.
(日本語の単語分割基準は国語研長単位です。)

The following 40 UD treebanks, which have both a commercially available license and over 40k UD tokens in the train set, were select to train the LoRA models of omnes-flores-40-lang-41-treebank-v0.

UD_Armenian-ArmTDP, UD_Belarusian-HSE, UD_Bororo-BDT, UD_Chinese-GSD, UD_Chinese-GSDSimp, UD_Croatian-SET, UD_Czech-CAC, UD_Danish-DDT, UD_Dutch-Alpino, UD_English-EWT, UD_Estonian-EWT, UD_Finnish-TDT, UD_French-GSD, UD_German-GSD, UD_Haitian_Creole-Adolphe, UD_Hebrew-IAHLTwiki, UD_Icelandic-GC, UD_Indonesian-GSD, UD_Irish-IDT, UD_Japanese-GSDLUW, UD_Korean-Kaist, UD_Latvian-LVTB, UD_Lithuanian-ALKSNIS, UD_Naija-NSC, UD_Norwegian-Nynorsk, UD_Persian-PerDT, UD_Portuguese-Porttinari, UD_Romanian-RRT, UD_Russian-GSD, UD_Scottish_Gaelic-ARCOSG, UD_Serbian-SET, UD_Sindhi-Isra, UD_Slovak-SNK, UD_Slovenian-SSJ, UD_Spanish-GSD, UD_Swedish-Talbanken, UD_Thai-TUD, UD_Turkish-BOUN, UD_Ukrainian-ParlaMint, UD_Western_Armenian-ArmTDP,

In addition, a proprietary treebank was used for training, which were specially licensed from the National Institute for Japanese Language and Linguistics exclusively for training this model.

UD_Japanese-BCCWJLUW (excluding PN newspaper articles)

Evalution Results

See the NLP2026 paper and its poster material (written in Japanese) for details.

Acknowledgements

This work was conducted as part of a collaborative research project between Recruit Co., Ltd. and the National Institute for Japanese Language and Linguistics.

Citations

You are encouraged to cite one of the following papers if you use omnes-flores models:

@inproceedings{matsuda-etal-2025-step,
    title = "Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of {LLM}s",
    author = "Matsuda, Hiroshi  and
      Ma, Chunpeng  and
      Asahara, Masayuki",
    editor = "Sagae, Kenji  and
      Oepen, Stephan",
    booktitle = "Proceedings of the 18th International Conference on Parsing Technologies (IWPT, SyntaxFest 2025)",
    month = aug,
    year = "2025",
    address = "Ljubljana, Slovenia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.iwpt-1.2/",
    pages = "11--19",
    ISBN = "979-8-89176-294-7",
    abstract = "Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches."
}

@misc{matsuda2025stepbystepinstructionssimpletabular,
      title={Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs}, 
      author={Hiroshi Matsuda and Chunpeng Ma and Masayuki Asahara},
      year={2025},
      eprint={2506.09983},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.09983}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for megagonlabs/omnes-flores-40-lang-41-treebank-v0

Base model

google/gemma-2-9b

Finetuned

(385)

this model

Dataset used to train megagonlabs/omnes-flores-40-lang-41-treebank-v0

Paper for megagonlabs/omnes-flores-40-lang-41-treebank-v0

Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Paper • 2506.09983 • Published Jun 11, 2025