BERT Finance Term Extractor (English)

A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text.


🧠 Model Description

This model is fine-tuned from google-bert/bert-base-cased for domain-specific terminology extraction.

It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines.


πŸ—οΈ Training Pipeline

The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets.

Data Processing

  • Input format: CoNLL-style token-tag sequences
  • Sentences are split by blank lines
  • Labels are converted into integer IDs (label2id, id2label)
  • Automatic train/dev split using configurable ratio (dev_ratio=0.1)

Tokenization & Label Alignment

  • Tokenizer: BertTokenizerFast
  • Tokenization uses is_split_into_words=True
  • Word-piece alignment handled via word_ids()
  • Special tokens assigned label -100 (ignored in loss)

βš™οΈ Training Details

  • Base model: google-bert/bert-base-cased
  • Task: Token Classification (NER-style)
  • Framework: Hugging Face Trainer

Training Arguments

  • learning_rate: 2e-5
  • batch_size: 16
  • num_train_epochs: 5
  • max_seq_length: 256
  • weight_decay: 0.01

Training Strategy

  • Evaluation: per epoch
  • Checkpoint saving: per epoch
  • Best model selection:
    • metric: F1 score
    • load_best_model_at_end=True
  • Logging:
    • TensorBoard enabled
    • logging every 10 steps

Hardware Optimization

  • Optional fp16 mixed precision
  • Multi-worker dataloading

πŸ“Š Evaluation

Evaluation is performed using the seqeval library.

Metrics:

  • F1 score (primary metric)
  • Full classification report (printed during training)

Example:

precision    recall  f1-score   support
...

🎯 Intended Use

This model is suitable for:

Financial terminology extraction
Terminology preprocessing for translation systems
Supporting CAT tools
Domain-specific NLP pipelines
🚫 Out-of-Scope Use

This model is not intended for:

General-purpose NER tasks
Legal or compliance decision-making
Fully automated terminology validation without human review
πŸš€ Usage
from transformers import pipeline

pipe = pipeline(
    "token-classification",
    model="owen4512/bert-base-cased-finance-term-extractor",
    aggregation_strategy="simple"
)

text = "The firm increased exposure to derivatives and sovereign bonds."
print(pipe(text))
🧾 Example

Input:
"The company issued convertible bonds and derivatives."

Output:
["convertible bonds", "derivatives"]

⚠️ Limitations
Domain-specific: performance outside finance may degrade
Rare or unseen terms may not be recognized
Tokenization may split multi-word terms
Human validation is recommended
πŸ“œ License

This model is derived from data released under CC BY-NC 4.0.

βœ… Non-commercial use allowed
❌ Commercial use prohibited without permission
βœ… Attribution required

The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data.

πŸ™ Acknowledgements
Base model: google-bert/bert-base-cased
Dataset: WMT 2025 terminology resources
Framework: Hugging Face Transformers & Datasets
Metrics: seqeval
Downloads last month
17
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for owen4512/bert-base-cased-finance-term-extractor

Finetuned
(2845)
this model