Thai Job NER β€” Fine-tuned WangchanBERTa

Named Entity Recognition model for extracting structured HR data from informal Thai job postings (e.g., Facebook groups, Line chats). Fine-tuned from wangchanberta-base-att-spm-uncased (110M params).

Model Description

This model extracts 7 entity types from Thai job-related text:

Entity Description Example
HARD_SKILL Skills or procedures ดูแΰΈ₯ΰΈœΰΈΉΰΉ‰ΰΈͺΰΈΉΰΈ‡ΰΈ­ΰΈ²ΰΈ’ΰΈΈ, CPR, Python
PERSON Names ΰΈ„ΰΈΈΰΈ“ΰΈͺฑชาฒ, ΰΈžΰΈ΅ΰΉˆΰΉΰΈˆΰΈ™
LOCATION Places ΰΈͺΰΈ΅ΰΈ₯ΰΈ‘, ΰΈ₯ΰΈ²ΰΈ”ΰΈžΰΈ£ΰΉ‰ΰΈ²ΰΈ§, ΰΈšΰΈ²ΰΈ‡ΰΈ™ΰΈ²
COMPENSATION Pay amounts 18,000 ΰΈšΰΈ²ΰΈ—/ΰΉ€ΰΈ”ΰΈ·ΰΈ­ΰΈ™
EMPLOYMENT_TERMS Job structure part-time, กะกΰΈ₯ΰΈ²ΰΈ‡ΰΈ§ΰΈ±ΰΈ™
CONTACT Phone, Line, email 081-234-5678, @care123
DEMOGRAPHIC Age, gender ΰΈ­ΰΈ²ΰΈ’ΰΈΈ 25-40, หญิง

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "chayuto/thai-job-ner-wangchanberta"
ner = pipeline("ner", model=model_name, aggregation_strategy="simple")

text = "รับΰΈͺฑัครคนดูแΰΈ₯ΰΈœΰΈΉΰΉ‰ΰΈͺΰΈΉΰΈ‡ΰΈ­ΰΈ²ΰΈ’ΰΈΈ ΰΈ’ΰΉˆΰΈ²ΰΈ™ΰΈͺΰΈ΅ΰΈ₯ΰΈ‘ ΰΉ€ΰΈ‡ΰΈ΄ΰΈ™ΰΉ€ΰΈ”ΰΈ·ΰΈ­ΰΈ™ 18,000 ΰΈšΰΈ²ΰΈ— ΰΉ‚ΰΈ—ΰΈ£ 081-234-5678"
results = ner(text)
for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2%})")

Training

  • Base model: airesearch/wangchanberta-base-att-spm-uncased (CamemBERT architecture, 110M params)
  • Training data: 1,253 Thai job posts (synthetic silver labels from GPT-4o, fuzzy-aligned to IOB2) β€” Dataset on HuggingFace
  • Hardware: Apple Silicon MPS backend, FP32
  • Hyperparameters: LR=3e-5, warmup=0.1, batch=8, grad_accum=2, 15 epochs, class-weighted loss, label smoothing=0.05
  • Training time: ~3 min 48 sec

Data Pipeline

Raw Thai text + GPT-4o entity extractions β†’ fuzzy alignment with rapidfuzz + pythainlp TCC boundary snapping β†’ subword token mapping via offset_mapping β†’ IOB2-formatted HuggingFace Dataset.

Evaluation

Overall (Test Set, 126 examples)

Metric Score
F1 0.897
Precision 0.850
Recall 0.949

Per-Entity F1

Entity F1 Precision Recall
CONTACT 0.962 0.942 0.983
LOCATION 0.959 0.928 0.991
EMPLOYMENT_TERMS 0.926 0.870 0.990
PERSON 0.907 0.861 0.958
HARD_SKILL 0.903 0.873 0.936
DEMOGRAPHIC 0.875 0.827 0.928
COMPENSATION 0.764 0.673 0.884

Links

Limitations

  • Trained on synthetic data β€” may underperform on real-world posts with heavy emoji usage, OCR errors, or extreme colloquialism
  • Thai-specific: limited English entity extraction capability
  • 512 token max sequence length
  • HARD_SKILL has the lowest F1 (0.761) due to open vocabulary and complex boundaries

Technical Notes

  • FP16 is broken on MPS β€” always use FP32 for Apple Silicon training
  • Uses offset_mapping to bypass WangchanBERTa's <_> space token misalignment in char_to_token()
  • Thai Character Cluster (TCC) boundary snapping prevents Unicode grapheme splitting during alignment

License

MIT

Downloads last month
6
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train chayuto/thai-job-ner-wangchanberta

Evaluation results