XLM-RoBERTa-BiLSTM-CRF-Joint: Myanmar Word Segmentation + POS Tagging
A joint model for Myanmar (Burmese) Word Segmentation and Part-of-Speech Tagging using a custom architecture built on top of xlm-roberta-base with an Asymmetric BiLSTM and Dual CRF heads.
Model Architecture
XLM-RoBERTa (768) + Position Embedding (64)
β
Asymmetric BiLSTM
Forward LSTM : hidden=256
Backward LSTM : hidden=512
Concatenated : 768
βββ Head 1 β Word Segmentation CRF (4 BIES labels: B, I, E, S)
βββ Head 2 β POS Tagging CRF (68 labels: BIES Γ 17 UPOS tags)
Key design choices:
- Position embedding encodes distance-from-end-of-sentence signals (dim=64)
- Asymmetric BiLSTM: forward hidden=256, backward hidden=512 to capture stronger right-context for Myanmar syllable boundary detection
- Dual CRF heads: separate CRF decoders for WS and POS to enforce label sequence constraints
- Joint loss:
loss = CRF_WS + CRF_POS + 0.3 * (CE_WS + CE_POS) - WeightedRandomSampler with 4x boost for rare POS tags (AUX, INTJ, SYM, PROPN, SCONJ, DET, X)
- Trained with AMP (FP16 Mixed Precision) and DataParallel (2x Tesla T4)
Dataset
| Split | Sentences | Syllables |
|---|---|---|
| Train | 83,066 | ~2,498,960 |
| Val | 10,383 | ~312,040 |
| Test | 10,384 | ~313,305 |
| Total | 103,833 | 3,124,305 |
- Split: 80/10/10 (random_state=42)
- Input granularity: Syllable-level (word-level CoNLL-U β syllable BIES conversion)
- POS tagset: Universal Dependencies UPOS (17 tags)
Labels
Word Segmentation (WS) β 4 labels
| Label | Meaning |
|---|---|
| B | Beginning syllable of a word |
| I | Inside syllable of a word |
| E | Ending syllable of a word |
| S | Single-syllable word |
POS Tags β 17 UPOS tags (each with B/I/E/S prefix = 68 total labels)
ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
Training Details
| Hyperparameter | Value |
|---|---|
| Base model | xlm-roberta-base |
| Max sequence length | 128 tokens |
| Batch size | 48 (per GPU) |
| Epochs | 15 (early stopping patience=3) |
| Optimizer | AdamW |
| Gradient clipping | max_norm=1.0 |
| Precision | FP16 (AMP) |
| Hardware | 2x Tesla T4 (DataParallel) |
| Training time | ~5h 54m |
Evaluation Results
Test Set Performance
| Task | Precision | Recall | F1 |
|---|---|---|---|
| Word Segmentation | 0.9381 | 0.9451 | 0.9416 |
| POS Tagging | 0.8947 | 0.9097 | 0.9021 |
| Combined (0.4WS + 0.6POS) | β | β | 0.9179 |
Per-class POS F1 (Test Set)
| POS Tag | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| ADJ | 0.7993 | 0.8435 | 0.8208 | 4,576 |
| ADP | 0.9474 | 0.9702 | 0.9586 | 24,631 |
| ADV | 0.7989 | 0.8243 | 0.8114 | 3,114 |
| AUX | 0.5986 | 0.6687 | 0.6317 | 1,775 |
| CCONJ | 0.8528 | 0.8353 | 0.8440 | 3,600 |
| DET | 0.9112 | 0.9067 | 0.9089 | 600 |
| INTJ | 0.5915 | 0.8400 | 0.6942 | 50 |
| NOUN | 0.8727 | 0.8849 | 0.8788 | 46,862 |
| NUM | 0.9543 | 0.9685 | 0.9614 | 4,726 |
| PART | 0.9281 | 0.9327 | 0.9304 | 42,569 |
| PRON | 0.9444 | 0.9411 | 0.9428 | 4,227 |
| PROPN | 0.7295 | 0.8153 | 0.7700 | 1,819 |
| PUNCT | 0.9826 | 0.9893 | 0.9859 | 16,573 |
| SCONJ | 0.7332 | 0.8365 | 0.7815 | 1,633 |
| SYM | 0.8605 | 0.7551 | 0.8043 | 49 |
| VERB | 0.8481 | 0.8649 | 0.8564 | 28,542 |
| X | 0.5599 | 0.5470 | 0.5534 | 521 |
| micro avg | 0.8947 | 0.9097 | 0.9021 | 185,867 |
| macro avg | 0.8184 | 0.8485 | 0.8314 | 185,867 |
Repository Files
| File | Description |
|---|---|
best_model.pt |
Trained model weights (PyTorch state_dict) |
config.json |
Model configuration (label maps, num_labels, task) |
ws_label2id.json |
Word segmentation label β id mapping |
ws_id2label.json |
Word segmentation id β label mapping |
pos_label2id.json |
POS label β id mapping (68 labels) |
pos_id2label.json |
POS id β label mapping (68 labels) |
model_metadata.json |
Training metadata and final metrics |
Usage
import torch
import json
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
# Load label mappings
with open("ws_id2label.json") as f:
ws_id2label = {int(k): v for k, v in json.load(f).items()}
with open("pos_id2label.json") as f:
pos_id2label = {int(k): v for k, v in json.load(f).items()}
# Rebuild model (same architecture as training)
# ... (instantiate JointSegPosModel with same params)
# Load weights
model.load_state_dict(torch.load("best_model.pt", map_location="cpu"))
model.eval()
# Inference: input is a list of Myanmar syllables
syllables = ["α", "α¬", "αΈ", "α‘α±", "α", "α±α¬", "αα»", "ααΊαΈ"]
encoding = tokenizer(
syllables,
is_split_into_words=True,
return_tensors="pt",
truncation=True,
max_length=128,
padding="max_length"
)
with torch.no_grad():
ws_preds, pos_preds = model(encoding["input_ids"], encoding["attention_mask"])
Citation
If you use this model, please cite:
@misc{sithu015_xlm_roberta_bilstm_crf_joint_2026,
title = {XLM-RoBERTa-BiLSTM-CRF-Joint: Myanmar Word Segmentation and POS Tagging},
author = {sithu015},
year = {2026},
url = {https://huggingface.co/sithu015/XLM-RoBERTa-BiLSTM-CRF-Joint}
}
License
Apache 2.0
Acknowledgements
- Base model: xlm-roberta-base by Facebook AI
- Dataset: Myanmar CoNLL-U corpus (103,833 sentences)
- Training platform: Kaggle (2x Tesla T4 GPU)
- Libraries:
transformers,pytorch-crf,seqeval,torch
- Downloads last month
- 18
Model tree for sithu015/XLM-RoBERTa-BiLSTM-CRF-Joint
Base model
FacebookAI/xlm-roberta-base