File size: 4,013 Bytes

7d773c6

---
language:
  - ko
license: other
library_name: transformers
base_model: jhgan/ko-sroberta-multitask
tags:
  - token-classification
  - named-entity-recognition
  - timex
  - korean
metrics:
  - f1
pipeline_tag: token-classification
model-index:
  - name: ko-sroberta-korean-time-expression-classifier
    results:
      - task:
          type: token-classification
          name: Korean TIMEX3 Detection
        dataset:
          name: 158.시간 표현 탐지 데이터
          type: private
          split: Validation
        metrics:
          - type: f1
            name: Entity F1
            value: 0.8266074116550786
          - type: precision
            name: Entity Precision
            value: 0.8264533883728931
          - type: recall
            name: Entity Recall
            value: 0.8267614923575464
---

# Korean Time Expression Classifier

This model detects Korean TIMEX3 time expressions with BIO token classification labels.

The backbone is [`jhgan/ko-sroberta-multitask`](https://huggingface.co/jhgan/ko-sroberta-multitask), fine-tuned on `158.시간 표현 탐지 데이터` for four TIMEX3 entity types:

- `DATE`
- `TIME`
- `DURATION`
- `SET`

## Intended Use

Use this model to identify Korean time expressions in sentences or utterances. It predicts token-level BIO labels and can be used through the Hugging Face `token-classification` pipeline.

This is an experimental model trained for TIMEX3 span detection. It does not extract EVENT or TLINK annotations.

## Training Data

The model was trained on the official `Training` split and evaluated on the official `Validation` split of `158.시간 표현 탐지 데이터`.

Training/evaluation preprocessing:

- Unsupported, empty, malformed, or unalignable TIMEX3 spans are excluded.
- Records whose TIMEX3 span would be truncated by `max_length=256` are excluded.
- TIMEX-free records are retained as negative examples.
- JSON `text` fields are used as the source text.

## Training Configuration

```bash
python -m time_expression_classifier.train_token_classifier \
  --data-root "158.시간 표현 탐지 데이터" \
  --model-name jhgan/ko-sroberta-multitask \
  --output-dir outputs/official_epoch2 \
  --split-mode official \
  --epochs 2 \
  --learning-rate 3e-5 \
  --batch-size 16 \
  --max-length 256
```

Key settings:

| setting | value |
| --- | --- |
| backbone | `jhgan/ko-sroberta-multitask` |
| epochs | 2 |
| learning rate | 3e-5 |
| batch size | 16 |
| max length | 256 |
| weight decay | 0.01 |
| warmup ratio | 0.06 |
| seed | 42 |

## Evaluation

Metrics are entity-level exact match on the official `Validation` split.

| metric | value |
| --- | ---: |
| entity precision | 0.8265 |
| entity recall | 0.8268 |
| entity F1 | 0.8266 |
| token accuracy | 0.9899 |
| eval loss | 0.0350 |

Per-label entity-level results:

| label | precision | recall | F1 | support |
| --- | ---: | ---: | ---: | ---: |
| DATE | 0.8495 | 0.8367 | 0.8430 | 23422 |
| TIME | 0.7933 | 0.8033 | 0.7983 | 3665 |
| DURATION | 0.7848 | 0.8247 | 0.8042 | 6810 |
| SET | 0.7107 | 0.6910 | 0.7007 | 974 |

## Usage

```python
from transformers import pipeline

tagger = pipeline(
    "token-classification",
    model="kwoncho/ko-sroberta-korean-time-expression-classifier",
    aggregation_strategy="simple",
)

text = "매주 토요일 저녁에 회의를 합니다."
print(tagger(text))
```

## Limitations

- The model is sensitive to ambiguous time expressions such as `주`, `하루`, `시간`, `한달`, `일주일`, and `매일`.
- `SET` is the lowest-performing label due to smaller support and ambiguity between repeated events and duration expressions.
- The model predicts TIMEX3 spans only. Normalization to calendar values is not included.
- Evaluation uses exact span match, so partial boundary differences count as errors.

## Reproducibility

Repository: `git@github.com:hyun2019/ko-sroberta-korean-time-expression-classifier.git`

The local release artifact is tracked as `models/official_epoch2` via DVC.