DanielDDDS commited on
Commit
824ce9f
·
verified ·
1 Parent(s): fe0dff7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - DanielDDDS/recipe-modifications-v2
5
+ language:
6
+ - he
7
+ metrics:
8
+ - f1
9
+ - precision
10
+ - recall
11
+ base_model:
12
+ - dicta-il/dictabert
13
+ pipeline_tag: token-classification
14
+ tags:
15
+ - NER
16
+ - Hebrew
17
+ - recipe
18
+ - CRF
19
+ - DictaBERT
20
+ ---
21
+ ================================================================================
22
+ README — DanielDDDs/hebrew-recipe-modification-ner
23
+ Model Repository
24
+ https://huggingface.co/DanielDDDs/hebrew-recipe-modification-ner
25
+ ================================================================================
26
+
27
+ OVERVIEW
28
+ --------
29
+ A fine-tuned DictaBERT + CRF model for Named Entity Recognition of recipe
30
+ modifications in Hebrew YouTube cooking comments. The model identifies spans
31
+ where commenters describe substitutions, quantity changes, technique changes,
32
+ and additions to recipes. This checkpoint (P10) is the best-performing
33
+ configuration in a progressive training series evaluated against both human-
34
+ annotated gold examples and silver-labeled data.
35
+
36
+ --------------------------------------------------------------------------------
37
+ FILE MANIFEST
38
+ --------------------------------------------------------------------------------
39
+
40
+ best_model.pt DictaBERT + CRF model weights.
41
+ Progressive training config: P10.
42
+ Gold F1 : 47.35% (P 43.94%, R 51.33%)
43
+ Silver F1: 56.05% (P 56.52%, R 55.58%)
44
+
45
+ id2label.json Integer ID → label string mapping.
46
+ Keys: 0 … 4 → O, I-SUBSTITUTION,
47
+ I-QUANTITY, I-TECHNIQUE, I-ADDITION
48
+
49
+ label2id.json Label string → integer ID mapping
50
+ (reverse of id2label.json).
51
+
52
+ training_summary.json Final training run metrics and
53
+ hyperparameters for the P10 configuration.
54
+
55
+ evaluation/
56
+ gold_results.json Evaluation on the 496-example human gold set.
57
+ F1 : 47.35%
58
+ P : 43.94%
59
+ R : 51.33%
60
+
61
+ silver_results.json Evaluation on the silver-labeled test set.
62
+ F1 : 56.05%
63
+ P : 56.52%
64
+ R : 55.58%
65
+
66
+ --------------------------------------------------------------------------------
67
+ MODEL ARCHITECTURE
68
+ --------------------------------------------------------------------------------
69
+ Base encoder : DictaBERT (Hebrew BERT trained by the Dicta Institute)
70
+ Decoder : Conditional Random Field (CRF) layer
71
+ Tagging scheme: IO (no B- prefix; spans are contiguous I- sequences)
72
+ Training data : processed/train_merged.jsonl from the companion dataset repo
73
+ (thread-aware tokenization, merged silver + guided splits)
74
+
75
+ --------------------------------------------------------------------------------
76
+ LABEL SCHEMA
77
+ --------------------------------------------------------------------------------
78
+ O Not a recipe modification span
79
+ I-SUBSTITUTION Ingredient or component substitution
80
+ I-QUANTITY Quantity or measurement change
81
+ I-TECHNIQUE Cooking technique change
82
+ I-ADDITION Addition of a new ingredient or step
83
+
84
+ --------------------------------------------------------------------------------
85
+ PERFORMANCE SUMMARY
86
+ --------------------------------------------------------------------------------
87
+
88
+ Evaluation set Precision Recall F1
89
+ ------------------- --------- ------ ------
90
+ Gold (496 examples) 43.94 % 51.33 % 47.35 %
91
+ Silver (test split) 56.52 % 55.58 % 56.05 %
92
+
93
+ Baseline reference: teacher model upper-bound metrics are available in
94
+ the companion dataset repo at evaluation/teacher_upper_bound.json.
95
+
96
+ --------------------------------------------------------------------------------
97
+ USAGE NOTES
98
+ --------------------------------------------------------------------------------
99
+ • Load best_model.pt with a DictaBERT + CRF inference wrapper.
100
+ • Use id2label.json / label2id.json to map model outputs to span types.
101
+ • Input text should be tokenized consistently with the DictaBERT tokenizer
102
+ used during training (see training_summary.json for tokenizer details).
103
+ • The model was developed for naturally occurring Hebrew cooking discourse;
104
+ performance on formal recipe text may differ.
105
+
106
+ --------------------------------------------------------------------------------
107
+ COMPANION DATASET
108
+ --------------------------------------------------------------------------------
109
+ DanielDDDs/recipe-modifications-v2
110
+ https://huggingface.co/datasets/DanielDDDs/recipe-modifications-v2
111
+
112
+ Contains raw comment threads, silver labels, gold annotations, full
113
+ processed splits, and all ablation / P-series training summaries.
114
+
115
+ --------------------------------------------------------------------------------
116
+ CITATION / CONTACT
117
+ --------------------------------------------------------------------------------
118
+ Repository owner : DanielDDDs
119
+ Hugging Face URL : https://huggingface.co/DanielDDDs/hebrew-recipe-modification-ner
120
+
121
+ ================================================================================