DanielDDDS commited on
Commit
76df709
·
verified ·
1 Parent(s): 824ce9f

dictabert-large P1_add_weights — relaxed F1=62.6%

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ vocab.txt filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,121 +1,35 @@
1
  ---
2
- license: mit
3
- datasets:
4
- - DanielDDDS/recipe-modifications-v2
5
- language:
6
- - he
7
- metrics:
8
- - f1
9
- - precision
10
- - recall
11
- base_model:
12
- - dicta-il/dictabert
13
- pipeline_tag: token-classification
14
  tags:
15
- - NER
16
- - Hebrew
17
  - recipe
18
- - CRF
19
- - DictaBERT
20
  ---
21
- ================================================================================
22
- README — DanielDDDs/hebrew-recipe-modification-ner
23
- Model Repository
24
- https://huggingface.co/DanielDDDs/hebrew-recipe-modification-ner
25
- ================================================================================
26
-
27
- OVERVIEW
28
- --------
29
- A fine-tuned DictaBERT + CRF model for Named Entity Recognition of recipe
30
- modifications in Hebrew YouTube cooking comments. The model identifies spans
31
- where commenters describe substitutions, quantity changes, technique changes,
32
- and additions to recipes. This checkpoint (P10) is the best-performing
33
- configuration in a progressive training series evaluated against both human-
34
- annotated gold examples and silver-labeled data.
35
-
36
- --------------------------------------------------------------------------------
37
- FILE MANIFEST
38
- --------------------------------------------------------------------------------
39
-
40
- best_model.pt DictaBERT + CRF model weights.
41
- Progressive training config: P10.
42
- Gold F1 : 47.35% (P 43.94%, R 51.33%)
43
- Silver F1: 56.05% (P 56.52%, R 55.58%)
44
-
45
- id2label.json Integer ID → label string mapping.
46
- Keys: 0 … 4 → O, I-SUBSTITUTION,
47
- I-QUANTITY, I-TECHNIQUE, I-ADDITION
48
-
49
- label2id.json Label string → integer ID mapping
50
- (reverse of id2label.json).
51
-
52
- training_summary.json Final training run metrics and
53
- hyperparameters for the P10 configuration.
54
-
55
- evaluation/
56
- gold_results.json Evaluation on the 496-example human gold set.
57
- F1 : 47.35%
58
- P : 43.94%
59
- R : 51.33%
60
-
61
- silver_results.json Evaluation on the silver-labeled test set.
62
- F1 : 56.05%
63
- P : 56.52%
64
- R : 55.58%
65
-
66
- --------------------------------------------------------------------------------
67
- MODEL ARCHITECTURE
68
- --------------------------------------------------------------------------------
69
- Base encoder : DictaBERT (Hebrew BERT trained by the Dicta Institute)
70
- Decoder : Conditional Random Field (CRF) layer
71
- Tagging scheme: IO (no B- prefix; spans are contiguous I- sequences)
72
- Training data : processed/train_merged.jsonl from the companion dataset repo
73
- (thread-aware tokenization, merged silver + guided splits)
74
-
75
- --------------------------------------------------------------------------------
76
- LABEL SCHEMA
77
- --------------------------------------------------------------------------------
78
- O Not a recipe modification span
79
- I-SUBSTITUTION Ingredient or component substitution
80
- I-QUANTITY Quantity or measurement change
81
- I-TECHNIQUE Cooking technique change
82
- I-ADDITION Addition of a new ingredient or step
83
-
84
- --------------------------------------------------------------------------------
85
- PERFORMANCE SUMMARY
86
- --------------------------------------------------------------------------------
87
-
88
- Evaluation set Precision Recall F1
89
- ------------------- --------- ------ ------
90
- Gold (496 examples) 43.94 % 51.33 % 47.35 %
91
- Silver (test split) 56.52 % 55.58 % 56.05 %
92
-
93
- Baseline reference: teacher model upper-bound metrics are available in
94
- the companion dataset repo at evaluation/teacher_upper_bound.json.
95
-
96
- --------------------------------------------------------------------------------
97
- USAGE NOTES
98
- --------------------------------------------------------------------------------
99
- • Load best_model.pt with a DictaBERT + CRF inference wrapper.
100
- • Use id2label.json / label2id.json to map model outputs to span types.
101
- • Input text should be tokenized consistently with the DictaBERT tokenizer
102
- used during training (see training_summary.json for tokenizer details).
103
- • The model was developed for naturally occurring Hebrew cooking discourse;
104
- performance on formal recipe text may differ.
105
-
106
- --------------------------------------------------------------------------------
107
- COMPANION DATASET
108
- --------------------------------------------------------------------------------
109
- DanielDDDs/recipe-modifications-v2
110
- https://huggingface.co/datasets/DanielDDDs/recipe-modifications-v2
111
-
112
- Contains raw comment threads, silver labels, gold annotations, full
113
- processed splits, and all ablation / P-series training summaries.
114
-
115
- --------------------------------------------------------------------------------
116
- CITATION / CONTACT
117
- --------------------------------------------------------------------------------
118
- Repository owner : DanielDDDs
119
- Hugging Face URL : https://huggingface.co/DanielDDDs/hebrew-recipe-modification-ner
120
 
121
- ================================================================================
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: he
 
 
 
 
 
 
 
 
 
 
 
3
  tags:
4
+ - token-classification
5
+ - hebrew
6
  - recipe
7
+ - ner
8
+ license: mit
9
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
+ # Hebrew Recipe Modification NER
12
+
13
+ DictaBERT-large fine-tuned for recipe modification extraction from Hebrew YouTube comments.
14
+ Trained with class weighting (P1) on silver labels from a 3-pass LLM teacher pipeline (v2).
15
+
16
+ ## Labels
17
+ - `B/I-SUBSTITUTION` — ingredient substitution
18
+ - `B/I-ADDITION` — ingredient addition
19
+ - `B/I-QUANTITY` — quantity change
20
+ - `B/I-TECHNIQUE` — technique change
21
+
22
+ ## Usage
23
+ ```python
24
+ from transformers import pipeline
25
+ pipe = pipeline("token-classification",
26
+ model="DanielDDDS/hebrew-recipe-modification-ner",
27
+ aggregation_strategy="simple")
28
+ pipe("אפשר להחליף חמאה בשמן קוקוס")
29
+ ```
30
+
31
+ ## Performance (corrected gold test set, n=496, 38 spans)
32
+ - Exact Entity F1: 25.5%
33
+ - Relaxed Entity F1: 62.6%
34
+ - Model: DictaBERT-large + linear head, class weights (P1)
35
+ - Beats LLM teacher on relaxed F1 (teacher: 48.4%)
config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "models/checkpoints/dictabert_large/P1_add_weights/best_model",
3
+ "architectures": [
4
+ "BertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 1024,
11
+ "id2label": {
12
+ "0": "O",
13
+ "1": "B-SUBSTITUTION",
14
+ "2": "I-SUBSTITUTION",
15
+ "3": "B-QUANTITY",
16
+ "4": "I-QUANTITY",
17
+ "5": "B-TECHNIQUE",
18
+ "6": "I-TECHNIQUE",
19
+ "7": "B-ADDITION",
20
+ "8": "I-ADDITION"
21
+ },
22
+ "initializer_range": 0.02,
23
+ "intermediate_size": 4096,
24
+ "label2id": {
25
+ "B-ADDITION": 7,
26
+ "B-QUANTITY": 3,
27
+ "B-SUBSTITUTION": 1,
28
+ "B-TECHNIQUE": 5,
29
+ "I-ADDITION": 8,
30
+ "I-QUANTITY": 4,
31
+ "I-SUBSTITUTION": 2,
32
+ "I-TECHNIQUE": 6,
33
+ "O": 0
34
+ },
35
+ "layer_norm_eps": 1e-12,
36
+ "max_position_embeddings": 512,
37
+ "model_type": "bert",
38
+ "newmodern": true,
39
+ "num_attention_heads": 16,
40
+ "num_hidden_layers": 24,
41
+ "pad_token_id": 0,
42
+ "position_embedding_type": "absolute",
43
+ "torch_dtype": "float32",
44
+ "transformers_version": "4.48.0",
45
+ "type_vocab_size": 2,
46
+ "use_cache": true,
47
+ "vocab_size": 128000
48
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff700a3be2787e08af27a6585dddba67ede1fa30ea8187ee11da28b6e781953e
3
+ size 1735723004
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[UNK]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[CLS]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[SEP]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[PAD]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "[BLANK]",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "clean_up_tokenization_spaces": true,
53
+ "cls_token": "[CLS]",
54
+ "do_lower_case": true,
55
+ "extra_special_tokens": {},
56
+ "mask_token": "[MASK]",
57
+ "model_max_length": 512,
58
+ "pad_token": "[PAD]",
59
+ "sep_token": "[SEP]",
60
+ "strip_accents": null,
61
+ "tokenize_chinese_chars": true,
62
+ "tokenizer_class": "BertTokenizer",
63
+ "unk_token": "[UNK]"
64
+ }
vocab.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0fb90bfa35244d26f0065d1fcd0b5becc3da3d44d616a7e2aacaf6320b9fa2d0
3
+ size 1500244