Upload official epoch2 TIMEX model

7d773c6 verified about 18 hours ago

4.01 kB

	---
	language:
	- ko
	license: other
	library_name: transformers
	base_model: jhgan/ko-sroberta-multitask
	tags:
	- token-classification
	- named-entity-recognition
	- timex
	- korean
	metrics:
	- f1
	pipeline_tag: token-classification
	model-index:
	- name: ko-sroberta-korean-time-expression-classifier
	results:
	- task:
	type: token-classification
	name: Korean TIMEX3 Detection
	dataset:
	name: 158.시간 표현 탐지 데이터
	type: private
	split: Validation
	metrics:
	- type: f1
	name: Entity F1
	value: 0.8266074116550786
	- type: precision
	name: Entity Precision
	value: 0.8264533883728931
	- type: recall
	name: Entity Recall
	value: 0.8267614923575464
	---

	# Korean Time Expression Classifier

	This model detects Korean TIMEX3 time expressions with BIO token classification labels.

	The backbone is [`jhgan/ko-sroberta-multitask`](https://huggingface.co/jhgan/ko-sroberta-multitask), fine-tuned on `158.시간 표현 탐지 데이터` for four TIMEX3 entity types:

	- `DATE`
	- `TIME`
	- `DURATION`
	- `SET`

	## Intended Use

	Use this model to identify Korean time expressions in sentences or utterances. It predicts token-level BIO labels and can be used through the Hugging Face `token-classification` pipeline.

	This is an experimental model trained for TIMEX3 span detection. It does not extract EVENT or TLINK annotations.

	## Training Data

	The model was trained on the official `Training` split and evaluated on the official `Validation` split of `158.시간 표현 탐지 데이터`.

	Training/evaluation preprocessing:

	- Unsupported, empty, malformed, or unalignable TIMEX3 spans are excluded.
	- Records whose TIMEX3 span would be truncated by `max_length=256` are excluded.
	- TIMEX-free records are retained as negative examples.
	- JSON `text` fields are used as the source text.

	## Training Configuration

	```bash
	python -m time_expression_classifier.train_token_classifier \
	--data-root "158.시간 표현 탐지 데이터" \
	--model-name jhgan/ko-sroberta-multitask \
	--output-dir outputs/official_epoch2 \
	--split-mode official \
	--epochs 2 \
	--learning-rate 3e-5 \
	--batch-size 16 \
	--max-length 256
	```

	Key settings:

	\| setting \| value \|
	\| --- \| --- \|
	\| backbone \| `jhgan/ko-sroberta-multitask` \|
	\| epochs \| 2 \|
	\| learning rate \| 3e-5 \|
	\| batch size \| 16 \|
	\| max length \| 256 \|
	\| weight decay \| 0.01 \|
	\| warmup ratio \| 0.06 \|
	\| seed \| 42 \|

	## Evaluation

	Metrics are entity-level exact match on the official `Validation` split.

	\| metric \| value \|
	\| --- \| ---: \|
	\| entity precision \| 0.8265 \|
	\| entity recall \| 0.8268 \|
	\| entity F1 \| 0.8266 \|
	\| token accuracy \| 0.9899 \|
	\| eval loss \| 0.0350 \|

	Per-label entity-level results:

	\| label \| precision \| recall \| F1 \| support \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| DATE \| 0.8495 \| 0.8367 \| 0.8430 \| 23422 \|
	\| TIME \| 0.7933 \| 0.8033 \| 0.7983 \| 3665 \|
	\| DURATION \| 0.7848 \| 0.8247 \| 0.8042 \| 6810 \|
	\| SET \| 0.7107 \| 0.6910 \| 0.7007 \| 974 \|

	## Usage

	```python
	from transformers import pipeline

	tagger = pipeline(
	"token-classification",
	model="kwoncho/ko-sroberta-korean-time-expression-classifier",
	aggregation_strategy="simple",
	)

	text = "매주 토요일 저녁에 회의를 합니다."
	print(tagger(text))
	```

	## Limitations

	- The model is sensitive to ambiguous time expressions such as `주`, `하루`, `시간`, `한달`, `일주일`, and `매일`.
	- `SET` is the lowest-performing label due to smaller support and ambiguity between repeated events and duration expressions.
	- The model predicts TIMEX3 spans only. Normalization to calendar values is not included.
	- Evaluation uses exact span match, so partial boundary differences count as errors.

	## Reproducibility

	Repository: `git@github.com:hyun2019/ko-sroberta-korean-time-expression-classifier.git`

	The local release artifact is tracked as `models/official_epoch2` via DVC.