Korean Sentence Type Classification with ko-sroberta
This repository contains a Korean 4-way sentence type classifier built on top of jhgan/ko-sroberta-multitask.
Labels:
μ¬μ€νμΆλ‘ νμμΈ‘νλνν
This is a custom classifier built from:
- backbone encoder:
jhgan/ko-sroberta-multitask - pooling: mean pooling
- head: dropout + linear classification layer
Model Summary
Task:
- single-label sentence classification
Input:
- one Korean sentence
Output:
- one of four sentence type labels
Label mapping:
0:μ¬μ€ν1:μΆλ‘ ν2:μμΈ‘ν3:λνν
Data
Primary source:
- AI Hub
λ¬Έμ₯ μ ν(μμΈ‘, μΆλ‘ , μ¬μ€, λν) λΆλ₯ λ°μ΄ν°
Local experiment data were built from the official AI Hub labeling archives and flattened into sentence-level JSON files.
Training split used in the final run:
all_train.json:130,823sentences
Holdout validation split used in the final run:
all_val.json:17,644sentences
Final holdout label supports:
μ¬μ€ν:7,113μΆλ‘ ν:4,970μμΈ‘ν:953λνν:4,608
Training Procedure
Backbone:
jhgan/ko-sroberta-multitask
Optimization:
- optimizer:
AdamW - scheduler: linear warmup + linear decay
- learning rate:
2e-5 - weight decay:
0.01 - warmup ratio:
0.1
Regularization and stability:
- max sequence length:
256 - early stopping patience:
2 - gradient clipping:
1.0 - mixed precision training
Imbalance handling:
- class-weighted cross-entropy
WeightedRandomSampler- best checkpoint selection by
macro F1
Final holdout training run:
- epochs:
8 - batch size:
32 - gradient accumulation:
1
Evaluation
5-fold cross-validation on all_train.json
- mean accuracy:
0.8406 - mean macro F1:
0.8397 - mean weighted F1:
0.8410
Mean per-label F1:
μ¬μ€ν:0.8143μΆλ‘ ν:0.7319μμΈ‘ν:0.8178λνν:0.9948
Final holdout on all_val.json
- accuracy:
0.8270 - macro F1:
0.8153 - weighted F1:
0.8283
Per-label holdout metrics:
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
μ¬μ€ν |
0.8399 | 0.7627 | 0.7994 | 7113 |
μΆλ‘ ν |
0.6954 | 0.7740 | 0.7326 | 4970 |
μμΈ‘ν |
0.6983 | 0.7723 | 0.7334 | 953 |
λνν |
0.9965 | 0.9946 | 0.9955 | 4608 |
Intended Use
This model is intended for:
- Korean sentence type classification research
- internal annotation support
- baseline comparison against larger backbones
This model is not presented as a universal Korean sentence understanding model.
Limitations
- This model was trained on a specific AI Hub sentence type taxonomy.
- Performance is not uniform across labels.
μΆλ‘ νandμμΈ‘νremain harder thanλνν.- The model may not generalize cleanly outside the source domain mixture used in the AI Hub corpus.
Repository Contents
best.pt: trained classifier checkpointtokenizer/: tokenizer fileslabel_mapping.json: explicit label mappingbest_metrics.json: final reported holdout metricshf_export_config.json: export metadataload_spec.json: custom loading notes
Loading Note
This repository contains a custom classification head checkpoint, not a plain AutoModelForSequenceClassification export.
To use it directly, you should load:
- the tokenizer from
tokenizer/ - the checkpoint from
best.pt - the model architecture from the local project code that defines the encoder, pooling, and classifier head
In other words, this repository is ready for publishing, but inference currently expects the accompanying custom code path rather than a stock Transformers sequence-classification class.