mBERT-Occitan / README.md
ahan2000's picture
Update README.md
9dc9639 verified
metadata
language:
  - oc
  - multilingual
license: apache-2.0
base_model: bert-base-multilingual-cased
tags:
  - masked-lm
  - occitan
  - multilingual
  - domain-adaptation
datasets:
  - custom
metrics:
  - perplexity
model-index:
  - name: mBERT-Occitan
    results:
      - task:
          type: masked-language-modeling
        dataset:
          name: Occitan Corpus
          type: custom
        metrics:
          - type: perplexity
            value: 9.52

mBERT-Occitan

A fine-tuned multilingual BERT model adapted for Mediveal Occitan language using a hybrid tokenization approach (mBERT + BPE).

Model Description

This model is based on bert-base-multilingual-cased and has been fine-tuned on Occitan text using Masked Language Modeling (MLM). The model uses a hybrid tokenization approach that combines:

  • The original mBERT tokenizer vocabulary
  • Additional BPE (Byte Pair Encoding) subword units trained specifically on Occitan text

Training Details

  • Base Model: bert-base-multilingual-cased
  • Training Objective: Masked Language Modeling (MLM)
  • MLM Probability: 15%
  • Epochs: 10
  • Batch Size: 32
  • Learning Rate: 5e-5
  • Max Sequence Length: 512

Performance

  • Perplexity on Occitan validation set: 9.52
  • Improvement over original mBERT: 98.99% reduction in perplexity (from 942.85 to 9.52)
  • Improvement over traditional fine-tuning: 8.8% better than traditional mBERT fine-tuning (9.52 vs 10.44)

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("ahan2000/mBERT-Occitan")
tokenizer = AutoTokenizer.from_pretrained("ahan2000/mBERT-Occitan")

# Example usage
text = "Lo temps es bèl uèi."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Tokenization

The hybrid tokenizer combines:

  • Original mBERT vocabulary (119,547 tokens)
  • Additional Occitan-specific BPE subword units (419 tokens)
  • Total vocabulary size: 119,966 tokens

Limitations

  • The model is fine-tuned specifically for Occitan and may not perform as well on other languages as the original mBERT
  • Training was done on a limited Occitan corpus
  • The model maintains the same architecture as mBERT (12 layers, 768 hidden size)

License

Apache 2.0 (same as the base mBERT model)