TheoMoins commited on
Commit
37efd06
·
verified ·
1 Parent(s): f9c780a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md CHANGED
@@ -64,6 +64,104 @@ Unweighted average CER (%) and WER (%) on internal and official competition test
64
  | **MEDUSA-4B 0.1** | **14.7** | **44.5** | 8.15 | 5.60 | 12.0 |
65
  | **MEDUSA-9B 0.1** | **13.2** | **42.6** | 8.03 | **5.24** | **10.8** |
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ---
68
 
69
  ## Intended use
 
64
  | **MEDUSA-4B 0.1** | **14.7** | **44.5** | 8.15 | 5.60 | 12.0 |
65
  | **MEDUSA-9B 0.1** | **13.2** | **42.6** | 8.03 | **5.24** | **10.8** |
66
 
67
+
68
+ ---
69
+
70
+ ## Training data
71
+
72
+ MEDUSA 0.1 is trained on two tiers of image–text data:
73
+
74
+ - **Gold** — paired image–text data following heterogeneous transcription conventions, used for visual adaptation across manuscript styles and editorial traditions.
75
+ - **Platinum** — image–text pairs aligned with the [CATMuS](https://catmus-guidelines.github.io/) diplomatic transcription guidelines, used for final specialization toward the target task.
76
+
77
+ The total training pool amounts to approximately **645,000 line-level image–text pairs**.
78
+
79
+ ### Gold datasets
80
+
81
+ | Dataset | Language | Level | Lines |
82
+ |---|---|---|---|
83
+ | Original data | Multilingual | Page | 18,352 |
84
+ | COMETA [[1]](#ref-cometa) | Occitan | Page | 118,105 |
85
+ | Torino L-II-14 [[2]](#ref-torino) | Old French | Page | 36,823 |
86
+ | Tridis [[3]](#ref-tridis) | Multilingual | Line | 166,784 |
87
+ | FROC-MSS [[4]](#ref-froc) | Old French / Occitan | Page | 3,636 |
88
+ | iForal [[5]](#ref-iforal) | Latin / Old Portuguese | Page | 8,009 |
89
+ | DISTINGUO [[6]](#ref-distinguo) | Latin | Page | 15,190 |
90
+ | AMSMB [[7]](#ref-amsmb) | Latin / Catalan | Page | 3,369 |
91
+ | HTR-School-Vienna-2025 [[8]](#ref-vienna25) | Latin | Page | 7,477 |
92
+ | Paris Bible Project [[9]](#ref-pbp) | Latin | Page | 1,606 |
93
+ | St-Victor (M. Vernet) [[10]](#ref-stvictor) | Latin | Page | 10,736 |
94
+ | Wien ÖNB Cod. 2160 f. 164-184 [[11]](#ref-wien) | Latin | Page | 2,681 |
95
+ | Bifrost [[12]](#ref-bifrost) | Old Norse | Page | 873 |
96
+ | Klosterneuburg [[13]](#ref-klosterneuburg) | Middle High German | Page | 4,758 |
97
+ | Faithful transcriptions [[14]](#ref-faithful) | Multilingual | Page | 8,001 |
98
+ | StABS Ratsbücher [[15]](#ref-stabs) | Middle High German | Page | 8,371 |
99
+ | Inzigkofen [[16]](#ref-inzigkofen) | Middle High German | Page | 8,321 |
100
+ | **Total (Gold)** | | | **423,092** |
101
+
102
+ ### Platinum datasets
103
+
104
+ | Dataset | Language | Level | Lines |
105
+ |---|---|---|---|
106
+ | Original data | Multilingual | Page | 10,506 |
107
+ | CMMHWR dataset [[17]](#ref-cmmhwr) | Multilingual | Page | 149,741 |
108
+ | CATMuS Medieval [[18]](#ref-catmus) | Middle Dutch, Old English | Line | 47,084 |
109
+ | GATMUZA [[19]](#ref-gatmuza) | Occitan | Page | 2,117 |
110
+ | TranscriboQuest 2025 [[20]](#ref-transcriboquest) | Multilingual | Page | 1,278 |
111
+ | HTR-School-Vienna-02 [[21]](#ref-vienna02) | Old Czech | Page | 1,336 |
112
+ | Padeřov-Bible [[22]](#ref-paderov) | Old Czech | Page | 7,177 |
113
+ | 2024–medieval-czech [[23]](#ref-czech24) | Old Czech | Page | 2,748 |
114
+ | **Total (Platinum)** | | | **221,987** |
115
+
116
+ ### References
117
+
118
+ <a id="ref-cometa"></a>[1] Wiedner, M. *COMETA : Corpus de l'occitan médiéval comparatif et annoté*. https://zenodo.org/records/15300719
119
+
120
+ <a id="ref-torino"></a>[2] Camps, J.-B., O'Connor, P. *Torino_L-II-14: HTR Training Dataset for the manuscript Turin, Biblioteca nazionale universitaria, MS L. II. 14* (2024). https://github.com/RESCAPE-Biblissima/Torino_L-II-14
121
+
122
+ <a id="ref-tridis"></a>[3] Torres, S. *Tridis* (revision e8d811f) (2025). https://doi.org/10.57967/hf/5001
123
+
124
+ <a id="ref-froc"></a>[4] Camps, J.-B. *FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models* (2018). https://github.com/Jean-Baptiste-Camps/FROC-MSS
125
+
126
+ <a id="ref-iforal"></a>[5] Projet iForal. *iForal Dataset: Medieval Portuguese Manuscripts HTR Data*. https://github.com/Arch-W/iForal-Dataset
127
+
128
+ <a id="ref-distinguo"></a>[6] Burghart, M., Yatsyk, S. *DISTINGUO: Ground truth for handwritten text recognition (HTR) on collections of distinctions (late 13th to late 15th century)* (2024). https://doi.org/10.34847/NKL.48AD8B8D
129
+
130
+ <a id="ref-amsmb"></a>[7] Coll Ardanuy, M., Cuadrada, C., Sarobe, R. *A Dataset for Handwritten Text Recognition in Medieval Notarial Charters Written on Parchment* (2025). https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/0VB0MC
131
+
132
+ <a id="ref-vienna25"></a>[8] Odstrčilík, J. et al. *HTR Winter School in Vienna 2025 – Late Medieval Latin Group: Ground Truth Dataset for Late-Medieval Latin Scripts* (2025).
133
+
134
+ <a id="ref-pbp"></a>[9] Wrisley, D., The Paris Bible Project, Gueville, E. *parisbible/ground_truth: Ground truth v1.0.0 for the Paris Bible Project* (Feb 2023). https://doi.org/10.5281/zenodo.7653691
135
+
136
+ <a id="ref-stvictor"></a>[10] Vernet, M. *Saint Victor MS dataset (abbreviated and expanded ALTO)* (Jan 2023). https://doi.org/10.5281/zenodo.7510410
137
+
138
+ <a id="ref-wien"></a>[11] Attwood et al. *Wien ÖNB cod. 2160 f. 164-184 ground truth from HTR Winter School 2022* (Dec 2022). https://doi.org/10.5281/zenodo.7537204
139
+
140
+ <a id="ref-bifrost"></a>[12] Kapitan, K.A., Vidal-Gorène, C. *Crossing the Bifrost: Towards an open access FAIR HTR model for Old Norse manuscripts* (May 2025). https://doi.org/10.5281/zenodo.15366896
141
+
142
+ <a id="ref-klosterneuburg"></a>[13] Berger, M. et al. *Klosterneuburg, Stiftsbibl., Cod. 48 – Ground Truth: Initial release* (Dec 2022). https://doi.org/10.5281/zenodo.7466928
143
+
144
+ <a id="ref-faithful"></a>[14] Eichenberger, N., Suwelack, H. *Faithful Transcriptions Data Set: TEI/XML-encoded transcriptions of medieval theological manuscripts* (Oct 2021). https://doi.org/10.5281/zenodo.5582483
145
+
146
+ <a id="ref-stabs"></a>[15] Hodel, T., Schoch, D., Dängeli, P. *Handwritten text recognition ground truth set: StABS Ratsbücher O10, Urfehdenbuch X* (Aug 2021). https://doi.org/10.5281/zenodo.5153263
147
+
148
+ <a id="ref-inzigkofen"></a>[16] Eichenberger, N. *Transcriptions from medieval manuscripts related to the Augustinian canonesses in Inzigkofen* (Dec 2025). https://doi.org/10.5281/zenodo.17978574
149
+
150
+ <a id="ref-cmmhwr"></a>[17] Clérice, T., Kiessling, B. *ICDAR 2026 Competition on Multilingual Medieval Handwriting Recognition* (Jan 2026). https://doi.org/10.5281/zenodo.18270331
151
+
152
+ <a id="ref-catmus"></a>[18] Clérice, T. et al. *CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond* (Feb 2024). https://inria.hal.science/hal-04453952
153
+
154
+ <a id="ref-gatmuza"></a>[19] Camps, J.-B. *Lo GAT MUZA: CATMuS-conformant HTR Ground-Truth Data for Medieval Occitan* (2026). https://github.com/LostMa-ERC/gatmuza
155
+
156
+ <a id="ref-transcriboquest"></a>[20] McDonough, C. et al. *TranscriboQuest 2025 Medieval Vernacular Religious Texts* (Sep 2025). https://doi.org/10.5281/zenodo.17062963
157
+
158
+ <a id="ref-vienna02"></a>[21] Veličkaitė, V. et al. *HTR Winter School 2025 – Medieval Czech – Biblioteka Jagiellonska BJ Rkp 441 IV* (Dec 2025).
159
+
160
+ <a id="ref-paderov"></a>[22] Michalcová, A. et al. *Padeřov-Bible-handwriting-ground-truth: Initial release* (Dec 2022). https://doi.org/10.5281/zenodo.7467034
161
+
162
+ <a id="ref-czech24"></a>[23] Plechatý, M. et al. *HTR Winter School 2024 – Medieval Czech – Prague Bible (1488)* (Dec 2024).
163
+
164
+
165
  ---
166
 
167
  ## Intended use