Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- he
|
| 4 |
+
pipeline_tag: token-classification
|
| 5 |
+
tags:
|
| 6 |
+
- Transformers
|
| 7 |
+
- PyTorch
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
| 11 |
+
|
| 12 |
+
## MenakBERT
|
| 13 |
+
|
| 14 |
+
A Hebrew BERT-style masked language model operating over characters, pre-trained by masking spans of characters, similarly to SpanBERT (Joshi et al., 2020).
|
| 15 |
+
A Hebrew diacritizer based on a BERT-style char-level backbone. Predicts diacritical marks in a seq2seq fashion.
|
| 16 |
+
|
| 17 |
+
### Model Description
|
| 18 |
+
|
| 19 |
+
This model is takes tau/tavbert-he and adds a three headed classification head that outputs 3 sequences corresponding to 3 types of Hebrew Niqqud (diacritics).
|
| 20 |
+
It was finetuned on the dataset generously provided by Elazar Gershuni of Nakdimon.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
- **Developed by:** Jacob Gidron, Ido Cohen and Idan Pinto
|
| 24 |
+
- **Model type:** Bert
|
| 25 |
+
- **Language:** Hebrew
|
| 26 |
+
- **Finetuned from model:** tau/tavbert-he
|
| 27 |
+
|
| 28 |
+
<!-- ### Model Sources [optional] -->
|
| 29 |
+
|
| 30 |
+
<!-- Provide the basic links for the model. -->
|
| 31 |
+
|
| 32 |
+
- **Repository:** https://github.com/jacobgidron/MenakBert
|
| 33 |
+
<!-- - **Paper [optional]:** [More Information Needed] -->
|
| 34 |
+
<!-- - **Demo [optional]:** [More Information Needed] -->
|
| 35 |
+
|
| 36 |
+
## Use
|
| 37 |
+
|
| 38 |
+
The model expects undotted Hebrew text, that may contain numbers and punctuation.
|
| 39 |
+
|
| 40 |
+
The output is three sequences of diacritical marks, corresponding with:
|
| 41 |
+
1. Dot distinguishing the letters Shin vs Sin.
|
| 42 |
+
2. The dot in the center of a letter that in some case changes pronunciation of certain letters, and in other cases creating a similar affect as an emphasis on the letter, or gemination.
|
| 43 |
+
3. All the rest of the marks, used mostly for vocalization.
|
| 44 |
+
|
| 45 |
+
The length of each sequence is the same as the input - each mark corresponding with the char at the same possition in the input.
|
| 46 |
+
|
| 47 |
+
The provided script weaves the sequences together.
|
| 48 |
+
|
| 49 |
+
## How to Get Started with the Model
|
| 50 |
+
|
| 51 |
+
Use the code below to get started with the model.
|
| 52 |
+
|
| 53 |
+
[More Information Needed]
|
| 54 |
+
|
| 55 |
+
### Training Data
|
| 56 |
+
|
| 57 |
+
The backbone tau/tavber-he was trained on OSCAR (Ortiz, 2019) Hebrew section (10 GB text, 20 million sentences).
|
| 58 |
+
The fine tuning was done on the Nakdimon dataset, which can be found at https://github.com/elazarg/hebrew_diacritized and contains 274,436 dotted Hebrew tokens across 413 documents.
|
| 59 |
+
For more information see https://arxiv.org/abs/2105.05209
|
| 60 |
+
|
| 61 |
+
<!-- #### Metrics -->
|
| 62 |
+
|
| 63 |
+
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 64 |
+
|
| 65 |
+
<!-- [More Information Needed] -->
|
| 66 |
+
|
| 67 |
+
<!-- ### Results -->
|
| 68 |
+
|
| 69 |
+
<!-- [More Information Needed] -->
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
## Model Card Contact
|
| 73 |
+
|
| 74 |
+
Ido Cohen - its.ido@gmail.com
|