mBERT-Occitan / README.md

ahan2000

Update README.md

9dc9639 verified 5 months ago

preview code

raw

history blame contribute delete

2.32 kB

metadata

language:
  - oc
  - multilingual
license: apache-2.0
base_model: bert-base-multilingual-cased
tags:
  - masked-lm
  - occitan
  - multilingual
  - domain-adaptation
datasets:
  - custom
metrics:
  - perplexity
model-index:
  - name: mBERT-Occitan
    results:
      - task:
          type: masked-language-modeling
        dataset:
          name: Occitan Corpus
          type: custom
        metrics:
          - type: perplexity
            value: 9.52

mBERT-Occitan

A fine-tuned multilingual BERT model adapted for Mediveal Occitan language using a hybrid tokenization approach (mBERT + BPE).

Model Description

This model is based on bert-base-multilingual-cased and has been fine-tuned on Occitan text using Masked Language Modeling (MLM). The model uses a hybrid tokenization approach that combines:

The original mBERT tokenizer vocabulary
Additional BPE (Byte Pair Encoding) subword units trained specifically on Occitan text

Training Details

Base Model: bert-base-multilingual-cased
Training Objective: Masked Language Modeling (MLM)
MLM Probability: 15%
Epochs: 10
Batch Size: 32
Learning Rate: 5e-5
Max Sequence Length: 512

Performance

Perplexity on Occitan validation set: 9.52
Improvement over original mBERT: 98.99% reduction in perplexity (from 942.85 to 9.52)
Improvement over traditional fine-tuning: 8.8% better than traditional mBERT fine-tuning (9.52 vs 10.44)

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("ahan2000/mBERT-Occitan")
tokenizer = AutoTokenizer.from_pretrained("ahan2000/mBERT-Occitan")

# Example usage
text = "Lo temps es bèl uèi."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Tokenization

The hybrid tokenizer combines:

Original mBERT vocabulary (119,547 tokens)
Additional Occitan-specific BPE subword units (419 tokens)
Total vocabulary size: 119,966 tokens

Limitations

The model is fine-tuned specifically for Occitan and may not perform as well on other languages as the original mBERT
Training was done on a limited Occitan corpus
The model maintains the same architecture as mBERT (12 layers, 768 hidden size)

License

Apache 2.0 (same as the base mBERT model)