Korean Sentence Type Classification with ko-sroberta

This repository contains a Korean 4-way sentence type classifier built on top of jhgan/ko-sroberta-multitask.

Labels:

사실형
추론형
예측형
대화형

This is a custom classifier built from:

backbone encoder: jhgan/ko-sroberta-multitask
pooling: mean pooling
head: dropout + linear classification layer

Model Summary

Task:

single-label sentence classification

Input:

one Korean sentence

Output:

one of four sentence type labels

Label mapping:

0: 사실형
1: 추론형
2: 예측형
3: 대화형

Data

Primary source:

AI Hub 문장 유형(예측, 추론, 사실, 대화) 분류 데이터

Local experiment data were built from the official AI Hub labeling archives and flattened into sentence-level JSON files.

Training split used in the final run:

all_train.json: 130,823 sentences

Holdout validation split used in the final run:

all_val.json: 17,644 sentences

Final holdout label supports:

사실형: 7,113
추론형: 4,970
예측형: 953
대화형: 4,608

Training Procedure

Backbone:

jhgan/ko-sroberta-multitask

Optimization:

optimizer: AdamW
scheduler: linear warmup + linear decay
learning rate: 2e-5
weight decay: 0.01
warmup ratio: 0.1

Regularization and stability:

max sequence length: 256
early stopping patience: 2
gradient clipping: 1.0
mixed precision training

Imbalance handling:

class-weighted cross-entropy
WeightedRandomSampler
best checkpoint selection by macro F1

Final holdout training run:

epochs: 8
batch size: 32
gradient accumulation: 1

Evaluation

5-fold cross-validation on `all_train.json`

mean accuracy: 0.8406
mean macro F1: 0.8397
mean weighted F1: 0.8410

Mean per-label F1:

사실형: 0.8143
추론형: 0.7319
예측형: 0.8178
대화형: 0.9948

Final holdout on `all_val.json`

accuracy: 0.8270
macro F1: 0.8153
weighted F1: 0.8283

Per-label holdout metrics:

Label	Precision	Recall	F1	Support
`사실형`	0.8399	0.7627	0.7994	7113
`추론형`	0.6954	0.7740	0.7326	4970
`예측형`	0.6983	0.7723	0.7334	953
`대화형`	0.9965	0.9946	0.9955	4608

Intended Use

This model is intended for:

Korean sentence type classification research
internal annotation support
baseline comparison against larger backbones

This model is not presented as a universal Korean sentence understanding model.

Limitations

This model was trained on a specific AI Hub sentence type taxonomy.
Performance is not uniform across labels.
추론형 and 예측형 remain harder than 대화형.
The model may not generalize cleanly outside the source domain mixture used in the AI Hub corpus.

Repository Contents

best.pt: trained classifier checkpoint
tokenizer/: tokenizer files
label_mapping.json: explicit label mapping
best_metrics.json: final reported holdout metrics
hf_export_config.json: export metadata
load_spec.json: custom loading notes

Loading Note

This repository contains a custom classification head checkpoint, not a plain AutoModelForSequenceClassification export.

To use it directly, you should load:

the tokenizer from tokenizer/
the checkpoint from best.pt
the model architecture from the local project code that defines the encoder, pooling, and classifier head

In other words, this repository is ready for publishing, but inference currently expects the accompanying custom code path rather than a stock Transformers sequence-classification class.

Downloads last month: -; Downloads are not tracked for this model. How to track