XLM-RoBERTa-BiLSTM-CRF-Joint: Myanmar Word Segmentation + POS Tagging

A joint model for Myanmar (Burmese) Word Segmentation and Part-of-Speech Tagging using a custom architecture built on top of xlm-roberta-base with an Asymmetric BiLSTM and Dual CRF heads.

Model Architecture

XLM-RoBERTa (768) + Position Embedding (64)
    ↓
Asymmetric BiLSTM
  Forward LSTM  : hidden=256
  Backward LSTM : hidden=512
  Concatenated  : 768
    ├── Head 1 – Word Segmentation CRF  (4 BIES labels: B, I, E, S)
    └── Head 2 – POS Tagging CRF        (68 labels: BIES × 17 UPOS tags)

Key design choices:

Position embedding encodes distance-from-end-of-sentence signals (dim=64)
Asymmetric BiLSTM: forward hidden=256, backward hidden=512 to capture stronger right-context for Myanmar syllable boundary detection
Dual CRF heads: separate CRF decoders for WS and POS to enforce label sequence constraints
Joint loss: loss = CRF_WS + CRF_POS + 0.3 * (CE_WS + CE_POS)
WeightedRandomSampler with 4x boost for rare POS tags (AUX, INTJ, SYM, PROPN, SCONJ, DET, X)
Trained with AMP (FP16 Mixed Precision) and DataParallel (2x Tesla T4)

Dataset

Split	Sentences	Syllables
Train	83,066	~2,498,960
Val	10,383	~312,040
Test	10,384	~313,305
Total	103,833	3,124,305

Split: 80/10/10 (random_state=42)
Input granularity: Syllable-level (word-level CoNLL-U → syllable BIES conversion)
POS tagset: Universal Dependencies UPOS (17 tags)

Labels

Word Segmentation (WS) — 4 labels

Label	Meaning
B	Beginning syllable of a word
I	Inside syllable of a word
E	Ending syllable of a word
S	Single-syllable word

POS Tags — 17 UPOS tags (each with B/I/E/S prefix = 68 total labels)

ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X

Training Details

Hyperparameter	Value
Base model	`xlm-roberta-base`
Max sequence length	128 tokens
Batch size	48 (per GPU)
Epochs	15 (early stopping patience=3)
Optimizer	AdamW
Gradient clipping	max_norm=1.0
Precision	FP16 (AMP)
Hardware	2x Tesla T4 (DataParallel)
Training time	~5h 54m

Evaluation Results

Test Set Performance

Task	Precision	Recall	F1
Word Segmentation	0.9381	0.9451	0.9416
POS Tagging	0.8947	0.9097	0.9021
Combined (0.4WS + 0.6POS)	—	—	0.9179

Per-class POS F1 (Test Set)

POS Tag	Precision	Recall	F1	Support
ADJ	0.7993	0.8435	0.8208	4,576
ADP	0.9474	0.9702	0.9586	24,631
ADV	0.7989	0.8243	0.8114	3,114
AUX	0.5986	0.6687	0.6317	1,775
CCONJ	0.8528	0.8353	0.8440	3,600
DET	0.9112	0.9067	0.9089	600
INTJ	0.5915	0.8400	0.6942	50
NOUN	0.8727	0.8849	0.8788	46,862
NUM	0.9543	0.9685	0.9614	4,726
PART	0.9281	0.9327	0.9304	42,569
PRON	0.9444	0.9411	0.9428	4,227
PROPN	0.7295	0.8153	0.7700	1,819
PUNCT	0.9826	0.9893	0.9859	16,573
SCONJ	0.7332	0.8365	0.7815	1,633
SYM	0.8605	0.7551	0.8043	49
VERB	0.8481	0.8649	0.8564	28,542
X	0.5599	0.5470	0.5534	521
micro avg	0.8947	0.9097	0.9021	185,867
macro avg	0.8184	0.8485	0.8314	185,867

Repository Files

File	Description
`best_model.pt`	Trained model weights (PyTorch state_dict)
`config.json`	Model configuration (label maps, num_labels, task)
`ws_label2id.json`	Word segmentation label → id mapping
`ws_id2label.json`	Word segmentation id → label mapping
`pos_label2id.json`	POS label → id mapping (68 labels)
`pos_id2label.json`	POS id → label mapping (68 labels)
`model_metadata.json`	Training metadata and final metrics

Usage

import torch
import json
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

# Load label mappings
with open("ws_id2label.json") as f:
    ws_id2label = {int(k): v for k, v in json.load(f).items()}
with open("pos_id2label.json") as f:
    pos_id2label = {int(k): v for k, v in json.load(f).items()}

# Rebuild model (same architecture as training)
# ... (instantiate JointSegPosModel with same params)

# Load weights
model.load_state_dict(torch.load("best_model.pt", map_location="cpu"))
model.eval()

# Inference: input is a list of Myanmar syllables
syllables = ["သ", "ာ", "း", "အေ", "လ", "ော", "ကျ", "န်း"]
encoding = tokenizer(
    syllables,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True,
    max_length=128,
    padding="max_length"
)

with torch.no_grad():
    ws_preds, pos_preds = model(encoding["input_ids"], encoding["attention_mask"])

Citation

If you use this model, please cite:

@misc{sithu015_xlm_roberta_bilstm_crf_joint_2026,
  title  = {XLM-RoBERTa-BiLSTM-CRF-Joint: Myanmar Word Segmentation and POS Tagging},
  author = {sithu015},
  year   = {2026},
  url    = {https://huggingface.co/sithu015/XLM-RoBERTa-BiLSTM-CRF-Joint}
}

License

Apache 2.0

Acknowledgements

Base model: xlm-roberta-base by Facebook AI
Dataset: Myanmar CoNLL-U corpus (103,833 sentences)
Training platform: Kaggle (2x Tesla T4 GPU)
Libraries: transformers, pytorch-crf, seqeval, torch

Downloads last month: 18

Model tree for sithu015/XLM-RoBERTa-BiLSTM-CRF-Joint

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3885)

this model