Upload official epoch2 TIMEX model

Browse files

Files changed (11) hide show

README.md +139 -0
config.json +50 -0
eval_metrics.json +27 -0
label_map.json +24 -0
model.safetensors +3 -0
special_tokens_map.json +51 -0
tokenizer.json +0 -0
tokenizer_config.json +67 -0
train.log +0 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,139 @@

+---
+language:
+  - ko
+license: other
+library_name: transformers
+base_model: jhgan/ko-sroberta-multitask
+tags:
+  - token-classification
+  - named-entity-recognition
+  - timex
+  - korean
+metrics:
+  - f1
+pipeline_tag: token-classification
+model-index:
+  - name: ko-sroberta-korean-time-expression-classifier
+    results:
+      - task:
+          type: token-classification
+          name: Korean TIMEX3 Detection
+        dataset:
+          name: 158.시간 표현 탐지 데이터
+          type: private
+          split: Validation
+        metrics:
+          - type: f1
+            name: Entity F1
+            value: 0.8266074116550786
+          - type: precision
+            name: Entity Precision
+            value: 0.8264533883728931
+          - type: recall
+            name: Entity Recall
+            value: 0.8267614923575464
+---
+# Korean Time Expression Classifier
+This model detects Korean TIMEX3 time expressions with BIO token classification labels.
+The backbone is [`jhgan/ko-sroberta-multitask`](https://huggingface.co/jhgan/ko-sroberta-multitask), fine-tuned on `158.시간 표현 탐지 데이터` for four TIMEX3 entity types:
+- `DATE`
+- `TIME`
+- `DURATION`
+- `SET`
+## Intended Use
+Use this model to identify Korean time expressions in sentences or utterances. It predicts token-level BIO labels and can be used through the Hugging Face `token-classification` pipeline.
+This is an experimental model trained for TIMEX3 span detection. It does not extract EVENT or TLINK annotations.
+## Training Data
+The model was trained on the official `Training` split and evaluated on the official `Validation` split of `158.시간 표현 탐지 데이터`.
+Training/evaluation preprocessing:
+- Unsupported, empty, malformed, or unalignable TIMEX3 spans are excluded.
+- Records whose TIMEX3 span would be truncated by `max_length=256` are excluded.
+- TIMEX-free records are retained as negative examples.
+- JSON `text` fields are used as the source text.
+## Training Configuration
+```bash
+python -m time_expression_classifier.train_token_classifier \
+  --data-root "158.시간 표현 탐지 데이터" \
+  --model-name jhgan/ko-sroberta-multitask \
+  --output-dir outputs/official_epoch2 \
+  --split-mode official \
+  --epochs 2 \
+  --learning-rate 3e-5 \
+  --batch-size 16 \
+  --max-length 256
+```
+Key settings:
+| setting | value |
+| --- | --- |
+| backbone | `jhgan/ko-sroberta-multitask` |
+| epochs | 2 |
+| learning rate | 3e-5 |
+| batch size | 16 |
+| max length | 256 |
+| weight decay | 0.01 |
+| warmup ratio | 0.06 |
+| seed | 42 |
+## Evaluation
+Metrics are entity-level exact match on the official `Validation` split.
+| metric | value |
+| --- | ---: |
+| entity precision | 0.8265 |
+| entity recall | 0.8268 |
+| entity F1 | 0.8266 |
+| token accuracy | 0.9899 |
+| eval loss | 0.0350 |
+Per-label entity-level results:
+| label | precision | recall | F1 | support |
+| --- | ---: | ---: | ---: | ---: |
+| DATE | 0.8495 | 0.8367 | 0.8430 | 23422 |
+| TIME | 0.7933 | 0.8033 | 0.7983 | 3665 |
+| DURATION | 0.7848 | 0.8247 | 0.8042 | 6810 |
+| SET | 0.7107 | 0.6910 | 0.7007 | 974 |
+## Usage
+```python
+from transformers import pipeline
+tagger = pipeline(
+    "token-classification",
+    model="kwoncho/ko-sroberta-korean-time-expression-classifier",
+    aggregation_strategy="simple",
+)
+text = "매주 토요일 저녁에 회의를 합니다."
+print(tagger(text))
+```
+## Limitations
+- The model is sensitive to ambiguous time expressions such as `주`, `하루`, `시간`, `한달`, `일주일`, and `매일`.
+- `SET` is the lowest-performing label due to smaller support and ambiguity between repeated events and duration expressions.
+- The model predicts TIMEX3 spans only. Normalization to calendar values is not included.
+- Evaluation uses exact span match, so partial boundary differences count as errors.
+## Reproducibility
+Repository: `git@github.com:hyun2019/ko-sroberta-korean-time-expression-classifier.git`
+The local release artifact is tracked as `models/official_epoch2` via DVC.

config.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+  "architectures": [
+    "RobertaForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "O",
+    "1": "B-DATE",
+    "2": "I-DATE",
+    "3": "B-TIME",
+    "4": "I-TIME",
+    "5": "B-DURATION",
+    "6": "I-DURATION",
+    "7": "B-SET",
+    "8": "I-SET"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B-DATE": 1,
+    "B-DURATION": 5,
+    "B-SET": 7,
+    "B-TIME": 3,
+    "I-DATE": 2,
+    "I-DURATION": 6,
+    "I-SET": 8,
+    "I-TIME": 4,
+    "O": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "tokenizer_class": "BertTokenizer",
+  "transformers_version": "4.57.0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 32000
+}

eval_metrics.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "eval_loss": 0.034969817847013474,
+  "eval_precision": 0.8264533883728931,
+  "eval_recall": 0.8267614923575464,
+  "eval_f1": 0.8266074116550786,
+  "eval_token_accuracy": 0.9898756337293201,
+  "eval_label_date_precision": 0.8494581707845688,
+  "eval_label_date_recall": 0.8366919989753223,
+  "eval_label_date_f1": 0.8430267572915771,
+  "eval_label_date_support": 23422,
+  "eval_label_time_precision": 0.7933171651845864,
+  "eval_label_time_recall": 0.8032742155525239,
+  "eval_label_time_f1": 0.7982646420824295,
+  "eval_label_time_support": 3665,
+  "eval_label_duration_precision": 0.7847959754052544,
+  "eval_label_duration_recall": 0.824669603524229,
+  "eval_label_duration_f1": 0.8042388658169841,
+  "eval_label_duration_support": 6810,
+  "eval_label_set_precision": 0.7106652587117213,
+  "eval_label_set_recall": 0.6909650924024641,
+  "eval_label_set_f1": 0.7006767308693389,
+  "eval_label_set_support": 974,
+  "eval_runtime": 98.5723,
+  "eval_samples_per_second": 455.422,
+  "eval_steps_per_second": 28.466,
+  "epoch": 2.0
+}

label_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "label_to_id": {
+    "O": 0,
+    "B-DATE": 1,
+    "I-DATE": 2,
+    "B-TIME": 3,
+    "I-TIME": 4,
+    "B-DURATION": 5,
+    "I-DURATION": 6,
+    "B-SET": 7,
+    "I-SET": 8
+  },
+  "id_to_label": {
+    "0": "O",
+    "1": "B-DATE",
+    "2": "I-DATE",
+    "3": "B-TIME",
+    "4": "I-TIME",
+    "5": "B-DURATION",
+    "6": "I-DURATION",
+    "7": "B-SET",
+    "8": "I-SET"
+  }
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:382f64854160157ffd0fca9a33ac26b46d5db8e97aab11f62ef973c101a2fcfc
+size 440161684

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,67 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "[CLS]",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "eos_token": "[SEP]",
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 128,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

train.log ADDED Viewed

The diff for this file is too large to render. See raw diff

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fdf542d0e058e1e436f8de44dbf531267f84de401b3bf7b6fe2ba56108bbd3af
+size 5841

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff