XLM-RoBERTa-BiLSTM-CRF-Joint: Myanmar Word Segmentation + POS Tagging

A joint model for Myanmar (Burmese) Word Segmentation and Part-of-Speech Tagging using a custom architecture built on top of xlm-roberta-base with an Asymmetric BiLSTM and Dual CRF heads.

Model Architecture

XLM-RoBERTa (768) + Position Embedding (64)
    ↓
Asymmetric BiLSTM
  Forward LSTM  : hidden=256
  Backward LSTM : hidden=512
  Concatenated  : 768
    β”œβ”€β”€ Head 1 – Word Segmentation CRF  (4 BIES labels: B, I, E, S)
    └── Head 2 – POS Tagging CRF        (68 labels: BIES Γ— 17 UPOS tags)

Key design choices:

  • Position embedding encodes distance-from-end-of-sentence signals (dim=64)
  • Asymmetric BiLSTM: forward hidden=256, backward hidden=512 to capture stronger right-context for Myanmar syllable boundary detection
  • Dual CRF heads: separate CRF decoders for WS and POS to enforce label sequence constraints
  • Joint loss: loss = CRF_WS + CRF_POS + 0.3 * (CE_WS + CE_POS)
  • WeightedRandomSampler with 4x boost for rare POS tags (AUX, INTJ, SYM, PROPN, SCONJ, DET, X)
  • Trained with AMP (FP16 Mixed Precision) and DataParallel (2x Tesla T4)

Dataset

Split Sentences Syllables
Train 83,066 ~2,498,960
Val 10,383 ~312,040
Test 10,384 ~313,305
Total 103,833 3,124,305
  • Split: 80/10/10 (random_state=42)
  • Input granularity: Syllable-level (word-level CoNLL-U β†’ syllable BIES conversion)
  • POS tagset: Universal Dependencies UPOS (17 tags)

Labels

Word Segmentation (WS) β€” 4 labels

Label Meaning
B Beginning syllable of a word
I Inside syllable of a word
E Ending syllable of a word
S Single-syllable word

POS Tags β€” 17 UPOS tags (each with B/I/E/S prefix = 68 total labels)

ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X

Training Details

Hyperparameter Value
Base model xlm-roberta-base
Max sequence length 128 tokens
Batch size 48 (per GPU)
Epochs 15 (early stopping patience=3)
Optimizer AdamW
Gradient clipping max_norm=1.0
Precision FP16 (AMP)
Hardware 2x Tesla T4 (DataParallel)
Training time ~5h 54m

Evaluation Results

Test Set Performance

Task Precision Recall F1
Word Segmentation 0.9381 0.9451 0.9416
POS Tagging 0.8947 0.9097 0.9021
Combined (0.4WS + 0.6POS) β€” β€” 0.9179

Per-class POS F1 (Test Set)

POS Tag Precision Recall F1 Support
ADJ 0.7993 0.8435 0.8208 4,576
ADP 0.9474 0.9702 0.9586 24,631
ADV 0.7989 0.8243 0.8114 3,114
AUX 0.5986 0.6687 0.6317 1,775
CCONJ 0.8528 0.8353 0.8440 3,600
DET 0.9112 0.9067 0.9089 600
INTJ 0.5915 0.8400 0.6942 50
NOUN 0.8727 0.8849 0.8788 46,862
NUM 0.9543 0.9685 0.9614 4,726
PART 0.9281 0.9327 0.9304 42,569
PRON 0.9444 0.9411 0.9428 4,227
PROPN 0.7295 0.8153 0.7700 1,819
PUNCT 0.9826 0.9893 0.9859 16,573
SCONJ 0.7332 0.8365 0.7815 1,633
SYM 0.8605 0.7551 0.8043 49
VERB 0.8481 0.8649 0.8564 28,542
X 0.5599 0.5470 0.5534 521
micro avg 0.8947 0.9097 0.9021 185,867
macro avg 0.8184 0.8485 0.8314 185,867

Repository Files

File Description
best_model.pt Trained model weights (PyTorch state_dict)
config.json Model configuration (label maps, num_labels, task)
ws_label2id.json Word segmentation label β†’ id mapping
ws_id2label.json Word segmentation id β†’ label mapping
pos_label2id.json POS label β†’ id mapping (68 labels)
pos_id2label.json POS id β†’ label mapping (68 labels)
model_metadata.json Training metadata and final metrics

Usage

import torch
import json
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

# Load label mappings
with open("ws_id2label.json") as f:
    ws_id2label = {int(k): v for k, v in json.load(f).items()}
with open("pos_id2label.json") as f:
    pos_id2label = {int(k): v for k, v in json.load(f).items()}

# Rebuild model (same architecture as training)
# ... (instantiate JointSegPosModel with same params)

# Load weights
model.load_state_dict(torch.load("best_model.pt", map_location="cpu"))
model.eval()

# Inference: input is a list of Myanmar syllables
syllables = ["α€ž", "ာ", "α€Έ", "ထေ", "α€œ", "ော", "α€€α€»", "α€”α€Ία€Έ"]
encoding = tokenizer(
    syllables,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True,
    max_length=128,
    padding="max_length"
)

with torch.no_grad():
    ws_preds, pos_preds = model(encoding["input_ids"], encoding["attention_mask"])

Citation

If you use this model, please cite:

@misc{sithu015_xlm_roberta_bilstm_crf_joint_2026,
  title  = {XLM-RoBERTa-BiLSTM-CRF-Joint: Myanmar Word Segmentation and POS Tagging},
  author = {sithu015},
  year   = {2026},
  url    = {https://huggingface.co/sithu015/XLM-RoBERTa-BiLSTM-CRF-Joint}
}

License

Apache 2.0

Acknowledgements

  • Base model: xlm-roberta-base by Facebook AI
  • Dataset: Myanmar CoNLL-U corpus (103,833 sentences)
  • Training platform: Kaggle (2x Tesla T4 GPU)
  • Libraries: transformers, pytorch-crf, seqeval, torch
Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sithu015/XLM-RoBERTa-BiLSTM-CRF-Joint

Finetuned
(3885)
this model