Korean Sentence Type Classification with ko-sroberta

This repository contains a Korean 4-way sentence type classifier built on top of jhgan/ko-sroberta-multitask.

Labels:

  • μ‚¬μ‹€ν˜•
  • μΆ”λ‘ ν˜•
  • μ˜ˆμΈ‘ν˜•
  • λŒ€ν™”ν˜•

This is a custom classifier built from:

  • backbone encoder: jhgan/ko-sroberta-multitask
  • pooling: mean pooling
  • head: dropout + linear classification layer

Model Summary

Task:

  • single-label sentence classification

Input:

  • one Korean sentence

Output:

  • one of four sentence type labels

Label mapping:

  • 0: μ‚¬μ‹€ν˜•
  • 1: μΆ”λ‘ ν˜•
  • 2: μ˜ˆμΈ‘ν˜•
  • 3: λŒ€ν™”ν˜•

Data

Primary source:

  • AI Hub λ¬Έμž₯ μœ ν˜•(예츑, μΆ”λ‘ , 사싀, λŒ€ν™”) λΆ„λ₯˜ 데이터

Local experiment data were built from the official AI Hub labeling archives and flattened into sentence-level JSON files.

Training split used in the final run:

  • all_train.json: 130,823 sentences

Holdout validation split used in the final run:

  • all_val.json: 17,644 sentences

Final holdout label supports:

  • μ‚¬μ‹€ν˜•: 7,113
  • μΆ”λ‘ ν˜•: 4,970
  • μ˜ˆμΈ‘ν˜•: 953
  • λŒ€ν™”ν˜•: 4,608

Training Procedure

Backbone:

  • jhgan/ko-sroberta-multitask

Optimization:

  • optimizer: AdamW
  • scheduler: linear warmup + linear decay
  • learning rate: 2e-5
  • weight decay: 0.01
  • warmup ratio: 0.1

Regularization and stability:

  • max sequence length: 256
  • early stopping patience: 2
  • gradient clipping: 1.0
  • mixed precision training

Imbalance handling:

  • class-weighted cross-entropy
  • WeightedRandomSampler
  • best checkpoint selection by macro F1

Final holdout training run:

  • epochs: 8
  • batch size: 32
  • gradient accumulation: 1

Evaluation

5-fold cross-validation on all_train.json

  • mean accuracy: 0.8406
  • mean macro F1: 0.8397
  • mean weighted F1: 0.8410

Mean per-label F1:

  • μ‚¬μ‹€ν˜•: 0.8143
  • μΆ”λ‘ ν˜•: 0.7319
  • μ˜ˆμΈ‘ν˜•: 0.8178
  • λŒ€ν™”ν˜•: 0.9948

Final holdout on all_val.json

  • accuracy: 0.8270
  • macro F1: 0.8153
  • weighted F1: 0.8283

Per-label holdout metrics:

Label Precision Recall F1 Support
μ‚¬μ‹€ν˜• 0.8399 0.7627 0.7994 7113
μΆ”λ‘ ν˜• 0.6954 0.7740 0.7326 4970
μ˜ˆμΈ‘ν˜• 0.6983 0.7723 0.7334 953
λŒ€ν™”ν˜• 0.9965 0.9946 0.9955 4608

Intended Use

This model is intended for:

  • Korean sentence type classification research
  • internal annotation support
  • baseline comparison against larger backbones

This model is not presented as a universal Korean sentence understanding model.

Limitations

  • This model was trained on a specific AI Hub sentence type taxonomy.
  • Performance is not uniform across labels.
  • μΆ”λ‘ ν˜• and μ˜ˆμΈ‘ν˜• remain harder than λŒ€ν™”ν˜•.
  • The model may not generalize cleanly outside the source domain mixture used in the AI Hub corpus.

Repository Contents

  • best.pt: trained classifier checkpoint
  • tokenizer/: tokenizer files
  • label_mapping.json: explicit label mapping
  • best_metrics.json: final reported holdout metrics
  • hf_export_config.json: export metadata
  • load_spec.json: custom loading notes

Loading Note

This repository contains a custom classification head checkpoint, not a plain AutoModelForSequenceClassification export.

To use it directly, you should load:

  • the tokenizer from tokenizer/
  • the checkpoint from best.pt
  • the model architecture from the local project code that defines the encoder, pooling, and classifier head

In other words, this repository is ready for publishing, but inference currently expects the accompanying custom code path rather than a stock Transformers sequence-classification class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support