Update README.md
Browse files
README.md
CHANGED
|
@@ -64,6 +64,104 @@ Unweighted average CER (%) and WER (%) on internal and official competition test
|
|
| 64 |
| **MEDUSA-4B 0.1** | **14.7** | **44.5** | 8.15 | 5.60 | 12.0 |
|
| 65 |
| **MEDUSA-9B 0.1** | **13.2** | **42.6** | 8.03 | **5.24** | **10.8** |
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
---
|
| 68 |
|
| 69 |
## Intended use
|
|
|
|
| 64 |
| **MEDUSA-4B 0.1** | **14.7** | **44.5** | 8.15 | 5.60 | 12.0 |
|
| 65 |
| **MEDUSA-9B 0.1** | **13.2** | **42.6** | 8.03 | **5.24** | **10.8** |
|
| 66 |
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## Training data
|
| 71 |
+
|
| 72 |
+
MEDUSA 0.1 is trained on two tiers of image–text data:
|
| 73 |
+
|
| 74 |
+
- **Gold** — paired image–text data following heterogeneous transcription conventions, used for visual adaptation across manuscript styles and editorial traditions.
|
| 75 |
+
- **Platinum** — image–text pairs aligned with the [CATMuS](https://catmus-guidelines.github.io/) diplomatic transcription guidelines, used for final specialization toward the target task.
|
| 76 |
+
|
| 77 |
+
The total training pool amounts to approximately **645,000 line-level image–text pairs**.
|
| 78 |
+
|
| 79 |
+
### Gold datasets
|
| 80 |
+
|
| 81 |
+
| Dataset | Language | Level | Lines |
|
| 82 |
+
|---|---|---|---|
|
| 83 |
+
| Original data | Multilingual | Page | 18,352 |
|
| 84 |
+
| COMETA [[1]](#ref-cometa) | Occitan | Page | 118,105 |
|
| 85 |
+
| Torino L-II-14 [[2]](#ref-torino) | Old French | Page | 36,823 |
|
| 86 |
+
| Tridis [[3]](#ref-tridis) | Multilingual | Line | 166,784 |
|
| 87 |
+
| FROC-MSS [[4]](#ref-froc) | Old French / Occitan | Page | 3,636 |
|
| 88 |
+
| iForal [[5]](#ref-iforal) | Latin / Old Portuguese | Page | 8,009 |
|
| 89 |
+
| DISTINGUO [[6]](#ref-distinguo) | Latin | Page | 15,190 |
|
| 90 |
+
| AMSMB [[7]](#ref-amsmb) | Latin / Catalan | Page | 3,369 |
|
| 91 |
+
| HTR-School-Vienna-2025 [[8]](#ref-vienna25) | Latin | Page | 7,477 |
|
| 92 |
+
| Paris Bible Project [[9]](#ref-pbp) | Latin | Page | 1,606 |
|
| 93 |
+
| St-Victor (M. Vernet) [[10]](#ref-stvictor) | Latin | Page | 10,736 |
|
| 94 |
+
| Wien ÖNB Cod. 2160 f. 164-184 [[11]](#ref-wien) | Latin | Page | 2,681 |
|
| 95 |
+
| Bifrost [[12]](#ref-bifrost) | Old Norse | Page | 873 |
|
| 96 |
+
| Klosterneuburg [[13]](#ref-klosterneuburg) | Middle High German | Page | 4,758 |
|
| 97 |
+
| Faithful transcriptions [[14]](#ref-faithful) | Multilingual | Page | 8,001 |
|
| 98 |
+
| StABS Ratsbücher [[15]](#ref-stabs) | Middle High German | Page | 8,371 |
|
| 99 |
+
| Inzigkofen [[16]](#ref-inzigkofen) | Middle High German | Page | 8,321 |
|
| 100 |
+
| **Total (Gold)** | | | **423,092** |
|
| 101 |
+
|
| 102 |
+
### Platinum datasets
|
| 103 |
+
|
| 104 |
+
| Dataset | Language | Level | Lines |
|
| 105 |
+
|---|---|---|---|
|
| 106 |
+
| Original data | Multilingual | Page | 10,506 |
|
| 107 |
+
| CMMHWR dataset [[17]](#ref-cmmhwr) | Multilingual | Page | 149,741 |
|
| 108 |
+
| CATMuS Medieval [[18]](#ref-catmus) | Middle Dutch, Old English | Line | 47,084 |
|
| 109 |
+
| GATMUZA [[19]](#ref-gatmuza) | Occitan | Page | 2,117 |
|
| 110 |
+
| TranscriboQuest 2025 [[20]](#ref-transcriboquest) | Multilingual | Page | 1,278 |
|
| 111 |
+
| HTR-School-Vienna-02 [[21]](#ref-vienna02) | Old Czech | Page | 1,336 |
|
| 112 |
+
| Padeřov-Bible [[22]](#ref-paderov) | Old Czech | Page | 7,177 |
|
| 113 |
+
| 2024–medieval-czech [[23]](#ref-czech24) | Old Czech | Page | 2,748 |
|
| 114 |
+
| **Total (Platinum)** | | | **221,987** |
|
| 115 |
+
|
| 116 |
+
### References
|
| 117 |
+
|
| 118 |
+
<a id="ref-cometa"></a>[1] Wiedner, M. *COMETA : Corpus de l'occitan médiéval comparatif et annoté*. https://zenodo.org/records/15300719
|
| 119 |
+
|
| 120 |
+
<a id="ref-torino"></a>[2] Camps, J.-B., O'Connor, P. *Torino_L-II-14: HTR Training Dataset for the manuscript Turin, Biblioteca nazionale universitaria, MS L. II. 14* (2024). https://github.com/RESCAPE-Biblissima/Torino_L-II-14
|
| 121 |
+
|
| 122 |
+
<a id="ref-tridis"></a>[3] Torres, S. *Tridis* (revision e8d811f) (2025). https://doi.org/10.57967/hf/5001
|
| 123 |
+
|
| 124 |
+
<a id="ref-froc"></a>[4] Camps, J.-B. *FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models* (2018). https://github.com/Jean-Baptiste-Camps/FROC-MSS
|
| 125 |
+
|
| 126 |
+
<a id="ref-iforal"></a>[5] Projet iForal. *iForal Dataset: Medieval Portuguese Manuscripts HTR Data*. https://github.com/Arch-W/iForal-Dataset
|
| 127 |
+
|
| 128 |
+
<a id="ref-distinguo"></a>[6] Burghart, M., Yatsyk, S. *DISTINGUO: Ground truth for handwritten text recognition (HTR) on collections of distinctions (late 13th to late 15th century)* (2024). https://doi.org/10.34847/NKL.48AD8B8D
|
| 129 |
+
|
| 130 |
+
<a id="ref-amsmb"></a>[7] Coll Ardanuy, M., Cuadrada, C., Sarobe, R. *A Dataset for Handwritten Text Recognition in Medieval Notarial Charters Written on Parchment* (2025). https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/0VB0MC
|
| 131 |
+
|
| 132 |
+
<a id="ref-vienna25"></a>[8] Odstrčilík, J. et al. *HTR Winter School in Vienna 2025 – Late Medieval Latin Group: Ground Truth Dataset for Late-Medieval Latin Scripts* (2025).
|
| 133 |
+
|
| 134 |
+
<a id="ref-pbp"></a>[9] Wrisley, D., The Paris Bible Project, Gueville, E. *parisbible/ground_truth: Ground truth v1.0.0 for the Paris Bible Project* (Feb 2023). https://doi.org/10.5281/zenodo.7653691
|
| 135 |
+
|
| 136 |
+
<a id="ref-stvictor"></a>[10] Vernet, M. *Saint Victor MS dataset (abbreviated and expanded ALTO)* (Jan 2023). https://doi.org/10.5281/zenodo.7510410
|
| 137 |
+
|
| 138 |
+
<a id="ref-wien"></a>[11] Attwood et al. *Wien ÖNB cod. 2160 f. 164-184 ground truth from HTR Winter School 2022* (Dec 2022). https://doi.org/10.5281/zenodo.7537204
|
| 139 |
+
|
| 140 |
+
<a id="ref-bifrost"></a>[12] Kapitan, K.A., Vidal-Gorène, C. *Crossing the Bifrost: Towards an open access FAIR HTR model for Old Norse manuscripts* (May 2025). https://doi.org/10.5281/zenodo.15366896
|
| 141 |
+
|
| 142 |
+
<a id="ref-klosterneuburg"></a>[13] Berger, M. et al. *Klosterneuburg, Stiftsbibl., Cod. 48 – Ground Truth: Initial release* (Dec 2022). https://doi.org/10.5281/zenodo.7466928
|
| 143 |
+
|
| 144 |
+
<a id="ref-faithful"></a>[14] Eichenberger, N., Suwelack, H. *Faithful Transcriptions Data Set: TEI/XML-encoded transcriptions of medieval theological manuscripts* (Oct 2021). https://doi.org/10.5281/zenodo.5582483
|
| 145 |
+
|
| 146 |
+
<a id="ref-stabs"></a>[15] Hodel, T., Schoch, D., Dängeli, P. *Handwritten text recognition ground truth set: StABS Ratsbücher O10, Urfehdenbuch X* (Aug 2021). https://doi.org/10.5281/zenodo.5153263
|
| 147 |
+
|
| 148 |
+
<a id="ref-inzigkofen"></a>[16] Eichenberger, N. *Transcriptions from medieval manuscripts related to the Augustinian canonesses in Inzigkofen* (Dec 2025). https://doi.org/10.5281/zenodo.17978574
|
| 149 |
+
|
| 150 |
+
<a id="ref-cmmhwr"></a>[17] Clérice, T., Kiessling, B. *ICDAR 2026 Competition on Multilingual Medieval Handwriting Recognition* (Jan 2026). https://doi.org/10.5281/zenodo.18270331
|
| 151 |
+
|
| 152 |
+
<a id="ref-catmus"></a>[18] Clérice, T. et al. *CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond* (Feb 2024). https://inria.hal.science/hal-04453952
|
| 153 |
+
|
| 154 |
+
<a id="ref-gatmuza"></a>[19] Camps, J.-B. *Lo GAT MUZA: CATMuS-conformant HTR Ground-Truth Data for Medieval Occitan* (2026). https://github.com/LostMa-ERC/gatmuza
|
| 155 |
+
|
| 156 |
+
<a id="ref-transcriboquest"></a>[20] McDonough, C. et al. *TranscriboQuest 2025 Medieval Vernacular Religious Texts* (Sep 2025). https://doi.org/10.5281/zenodo.17062963
|
| 157 |
+
|
| 158 |
+
<a id="ref-vienna02"></a>[21] Veličkaitė, V. et al. *HTR Winter School 2025 – Medieval Czech – Biblioteka Jagiellonska BJ Rkp 441 IV* (Dec 2025).
|
| 159 |
+
|
| 160 |
+
<a id="ref-paderov"></a>[22] Michalcová, A. et al. *Padeřov-Bible-handwriting-ground-truth: Initial release* (Dec 2022). https://doi.org/10.5281/zenodo.7467034
|
| 161 |
+
|
| 162 |
+
<a id="ref-czech24"></a>[23] Plechatý, M. et al. *HTR Winter School 2024 – Medieval Czech – Prague Bible (1488)* (Dec 2024).
|
| 163 |
+
|
| 164 |
+
|
| 165 |
---
|
| 166 |
|
| 167 |
## Intended use
|