metadata
language:
- oc
- multilingual
license: apache-2.0
base_model: bert-base-multilingual-cased
tags:
- masked-lm
- occitan
- multilingual
- domain-adaptation
datasets:
- custom
metrics:
- perplexity
model-index:
- name: mBERT-Occitan
results:
- task:
type: masked-language-modeling
dataset:
name: Occitan Corpus
type: custom
metrics:
- type: perplexity
value: 9.52
mBERT-Occitan
A fine-tuned multilingual BERT model adapted for Mediveal Occitan language using a hybrid tokenization approach (mBERT + BPE).
Model Description
This model is based on bert-base-multilingual-cased and has been fine-tuned on Occitan text using Masked Language Modeling (MLM). The model uses a hybrid tokenization approach that combines:
- The original mBERT tokenizer vocabulary
- Additional BPE (Byte Pair Encoding) subword units trained specifically on Occitan text
Training Details
- Base Model: bert-base-multilingual-cased
- Training Objective: Masked Language Modeling (MLM)
- MLM Probability: 15%
- Epochs: 10
- Batch Size: 32
- Learning Rate: 5e-5
- Max Sequence Length: 512
Performance
- Perplexity on Occitan validation set: 9.52
- Improvement over original mBERT: 98.99% reduction in perplexity (from 942.85 to 9.52)
- Improvement over traditional fine-tuning: 8.8% better than traditional mBERT fine-tuning (9.52 vs 10.44)
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("ahan2000/mBERT-Occitan")
tokenizer = AutoTokenizer.from_pretrained("ahan2000/mBERT-Occitan")
# Example usage
text = "Lo temps es bèl uèi."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Tokenization
The hybrid tokenizer combines:
- Original mBERT vocabulary (119,547 tokens)
- Additional Occitan-specific BPE subword units (419 tokens)
- Total vocabulary size: 119,966 tokens
Limitations
- The model is fine-tuned specifically for Occitan and may not perform as well on other languages as the original mBERT
- Training was done on a limited Occitan corpus
- The model maintains the same architecture as mBERT (12 layers, 768 hidden size)
License
Apache 2.0 (same as the base mBERT model)