File size: 4,013 Bytes
7d773c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
language:
  - ko
license: other
library_name: transformers
base_model: jhgan/ko-sroberta-multitask
tags:
  - token-classification
  - named-entity-recognition
  - timex
  - korean
metrics:
  - f1
pipeline_tag: token-classification
model-index:
  - name: ko-sroberta-korean-time-expression-classifier
    results:
      - task:
          type: token-classification
          name: Korean TIMEX3 Detection
        dataset:
          name: 158.시간 표현 탐지 데이터
          type: private
          split: Validation
        metrics:
          - type: f1
            name: Entity F1
            value: 0.8266074116550786
          - type: precision
            name: Entity Precision
            value: 0.8264533883728931
          - type: recall
            name: Entity Recall
            value: 0.8267614923575464
---

# Korean Time Expression Classifier

This model detects Korean TIMEX3 time expressions with BIO token classification labels.

The backbone is [`jhgan/ko-sroberta-multitask`](https://huggingface.co/jhgan/ko-sroberta-multitask), fine-tuned on `158.시간 표현 탐지 데이터` for four TIMEX3 entity types:

- `DATE`
- `TIME`
- `DURATION`
- `SET`

## Intended Use

Use this model to identify Korean time expressions in sentences or utterances. It predicts token-level BIO labels and can be used through the Hugging Face `token-classification` pipeline.

This is an experimental model trained for TIMEX3 span detection. It does not extract EVENT or TLINK annotations.

## Training Data

The model was trained on the official `Training` split and evaluated on the official `Validation` split of `158.시간 표현 탐지 데이터`.

Training/evaluation preprocessing:

- Unsupported, empty, malformed, or unalignable TIMEX3 spans are excluded.
- Records whose TIMEX3 span would be truncated by `max_length=256` are excluded.
- TIMEX-free records are retained as negative examples.
- JSON `text` fields are used as the source text.

## Training Configuration

```bash
python -m time_expression_classifier.train_token_classifier \
  --data-root "158.시간 표현 탐지 데이터" \
  --model-name jhgan/ko-sroberta-multitask \
  --output-dir outputs/official_epoch2 \
  --split-mode official \
  --epochs 2 \
  --learning-rate 3e-5 \
  --batch-size 16 \
  --max-length 256
```

Key settings:

| setting | value |
| --- | --- |
| backbone | `jhgan/ko-sroberta-multitask` |
| epochs | 2 |
| learning rate | 3e-5 |
| batch size | 16 |
| max length | 256 |
| weight decay | 0.01 |
| warmup ratio | 0.06 |
| seed | 42 |

## Evaluation

Metrics are entity-level exact match on the official `Validation` split.

| metric | value |
| --- | ---: |
| entity precision | 0.8265 |
| entity recall | 0.8268 |
| entity F1 | 0.8266 |
| token accuracy | 0.9899 |
| eval loss | 0.0350 |

Per-label entity-level results:

| label | precision | recall | F1 | support |
| --- | ---: | ---: | ---: | ---: |
| DATE | 0.8495 | 0.8367 | 0.8430 | 23422 |
| TIME | 0.7933 | 0.8033 | 0.7983 | 3665 |
| DURATION | 0.7848 | 0.8247 | 0.8042 | 6810 |
| SET | 0.7107 | 0.6910 | 0.7007 | 974 |

## Usage

```python
from transformers import pipeline

tagger = pipeline(
    "token-classification",
    model="kwoncho/ko-sroberta-korean-time-expression-classifier",
    aggregation_strategy="simple",
)

text = "매주 토요일 저녁에 회의를 합니다."
print(tagger(text))
```

## Limitations

- The model is sensitive to ambiguous time expressions such as `주`, `하루`, `시간`, `한달`, `일주일`, and `매일`.
- `SET` is the lowest-performing label due to smaller support and ambiguity between repeated events and duration expressions.
- The model predicts TIMEX3 spans only. Normalization to calendar values is not included.
- Evaluation uses exact span match, so partial boundary differences count as errors.

## Reproducibility

Repository: `git@github.com:hyun2019/ko-sroberta-korean-time-expression-classifier.git`

The local release artifact is tracked as `models/official_epoch2` via DVC.