dictabert-large P1_add_weights — relaxed F1=62.6%

Browse files

Files changed (8) hide show

.gitattributes +1 -0
README.md +30 -116
config.json +48 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +64 -0
vocab.txt +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+vocab.txt filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,121 +1,35 @@
 ---
-license: mit
-datasets:
-- DanielDDDS/recipe-modifications-v2
-language:
-- he
-metrics:
-- f1
-- precision
-- recall
-base_model:
-- dicta-il/dictabert
-pipeline_tag: token-classification
 tags:
-- NER
-- Hebrew
 - recipe
-- CRF
-- DictaBERT
 ---
-================================================================================
-README — DanielDDDs/hebrew-recipe-modification-ner
-Model Repository
-https://huggingface.co/DanielDDDs/hebrew-recipe-modification-ner
-================================================================================
-OVERVIEW
---------
-A fine-tuned DictaBERT + CRF model for Named Entity Recognition of recipe
-modifications in Hebrew YouTube cooking comments. The model identifies spans
-where commenters describe substitutions, quantity changes, technique changes,
-and additions to recipes. This checkpoint (P10) is the best-performing
-configuration in a progressive training series evaluated against both human-
-annotated gold examples and silver-labeled data.
---------------------------------------------------------------------------------
-FILE MANIFEST
---------------------------------------------------------------------------------
-best_model.pt                    DictaBERT + CRF model weights.
-                                 Progressive training config: P10.
-                                 Gold F1  : 47.35%  (P 43.94%, R 51.33%)
-                                 Silver F1: 56.05%  (P 56.52%, R 55.58%)
-id2label.json                    Integer ID → label string mapping.
-                                 Keys: 0 … 4  →  O, I-SUBSTITUTION,
-                                 I-QUANTITY, I-TECHNIQUE, I-ADDITION
-label2id.json                    Label string → integer ID mapping
-                                 (reverse of id2label.json).
-training_summary.json            Final training run metrics and
-                                 hyperparameters for the P10 configuration.
-evaluation/
-  gold_results.json              Evaluation on the 496-example human gold set.
-                                   F1 : 47.35%
-                                   P  : 43.94%
-                                   R  : 51.33%
-  silver_results.json            Evaluation on the silver-labeled test set.
-                                   F1 : 56.05%
-                                   P  : 56.52%
-                                   R  : 55.58%
---------------------------------------------------------------------------------
-MODEL ARCHITECTURE
---------------------------------------------------------------------------------
-  Base encoder  : DictaBERT (Hebrew BERT trained by the Dicta Institute)
-  Decoder       : Conditional Random Field (CRF) layer
-  Tagging scheme: IO (no B- prefix; spans are contiguous I- sequences)
-  Training data : processed/train_merged.jsonl from the companion dataset repo
-                  (thread-aware tokenization, merged silver + guided splits)
---------------------------------------------------------------------------------
-LABEL SCHEMA
---------------------------------------------------------------------------------
-  O                 Not a recipe modification span
-  I-SUBSTITUTION    Ingredient or component substitution
-  I-QUANTITY        Quantity or measurement change
-  I-TECHNIQUE       Cooking technique change
-  I-ADDITION        Addition of a new ingredient or step
---------------------------------------------------------------------------------
-PERFORMANCE SUMMARY
---------------------------------------------------------------------------------
-  Evaluation set       Precision   Recall   F1
-  -------------------  ---------   ------   ------
-  Gold  (496 examples)  43.94 %    51.33 %  47.35 %
-  Silver (test split)   56.52 %    55.58 %  56.05 %
-  Baseline reference: teacher model upper-bound metrics are available in
-  the companion dataset repo at evaluation/teacher_upper_bound.json.
---------------------------------------------------------------------------------
-USAGE NOTES
---------------------------------------------------------------------------------
-  • Load best_model.pt with a DictaBERT + CRF inference wrapper.
-  • Use id2label.json / label2id.json to map model outputs to span types.
-  • Input text should be tokenized consistently with the DictaBERT tokenizer
-    used during training (see training_summary.json for tokenizer details).
-  • The model was developed for naturally occurring Hebrew cooking discourse;
-    performance on formal recipe text may differ.
---------------------------------------------------------------------------------
-COMPANION DATASET
---------------------------------------------------------------------------------
-  DanielDDDs/recipe-modifications-v2
-  https://huggingface.co/datasets/DanielDDDs/recipe-modifications-v2
-  Contains raw comment threads, silver labels, gold annotations, full
-  processed splits, and all ablation / P-series training summaries.
---------------------------------------------------------------------------------
-CITATION / CONTACT
---------------------------------------------------------------------------------
-  Repository owner : DanielDDDs
-  Hugging Face URL : https://huggingface.co/DanielDDDs/hebrew-recipe-modification-ner
-================================================================================

 ---
+language: he
 tags:
+- token-classification
+- hebrew
 - recipe
+- ner
+license: mit
 ---
+# Hebrew Recipe Modification NER
+DictaBERT-large fine-tuned for recipe modification extraction from Hebrew YouTube comments.
+Trained with class weighting (P1) on silver labels from a 3-pass LLM teacher pipeline (v2).
+## Labels
+- `B/I-SUBSTITUTION` — ingredient substitution
+- `B/I-ADDITION` — ingredient addition
+- `B/I-QUANTITY` — quantity change
+- `B/I-TECHNIQUE` — technique change
+## Usage
+```python
+from transformers import pipeline
+pipe = pipeline("token-classification",
+                model="DanielDDDS/hebrew-recipe-modification-ner",
+                aggregation_strategy="simple")
+pipe("אפשר להחליף חמאה בשמן קוקוס")
+```
+## Performance (corrected gold test set, n=496, 38 spans)
+- Exact Entity F1:   25.5%
+- Relaxed Entity F1: 62.6%
+- Model: DictaBERT-large + linear head, class weights (P1)
+- Beats LLM teacher on relaxed F1 (teacher: 48.4%)

config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "_name_or_path": "models/checkpoints/dictabert_large/P1_add_weights/best_model",
+  "architectures": [
+    "BertForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "id2label": {
+    "0": "O",
+    "1": "B-SUBSTITUTION",
+    "2": "I-SUBSTITUTION",
+    "3": "B-QUANTITY",
+    "4": "I-QUANTITY",
+    "5": "B-TECHNIQUE",
+    "6": "I-TECHNIQUE",
+    "7": "B-ADDITION",
+    "8": "I-ADDITION"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "label2id": {
+    "B-ADDITION": 7,
+    "B-QUANTITY": 3,
+    "B-SUBSTITUTION": 1,
+    "B-TECHNIQUE": 5,
+    "I-ADDITION": 8,
+    "I-QUANTITY": 4,
+    "I-SUBSTITUTION": 2,
+    "I-TECHNIQUE": 6,
+    "O": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "newmodern": true,
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 128000
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ff700a3be2787e08af27a6585dddba67ede1fa30ea8187ee11da28b6e781953e
+size 1735723004

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "[BLANK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0fb90bfa35244d26f0065d1fcd0b5becc3da3d44d616a7e2aacaf6320b9fa2d0
+size 1500244