BERT Finance Term Extractor (English)

A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text.

🧠 Model Description

This model is fine-tuned from google-bert/bert-base-cased for domain-specific terminology extraction.

It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines.

🏗️ Training Pipeline

The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets.

Data Processing

Input format: CoNLL-style token-tag sequences
Sentences are split by blank lines
Labels are converted into integer IDs (label2id, id2label)
Automatic train/dev split using configurable ratio (dev_ratio=0.1)

Tokenization & Label Alignment

Tokenizer: BertTokenizerFast
Tokenization uses is_split_into_words=True
Word-piece alignment handled via word_ids()
Special tokens assigned label -100 (ignored in loss)

⚙️ Training Details

Base model: google-bert/bert-base-cased
Task: Token Classification (NER-style)
Framework: Hugging Face Trainer

Training Arguments

learning_rate: 2e-5
batch_size: 16
num_train_epochs: 5
max_seq_length: 256
weight_decay: 0.01

Training Strategy

Evaluation: per epoch
Checkpoint saving: per epoch
Best model selection:
- metric: F1 score
- load_best_model_at_end=True
Logging:
- TensorBoard enabled
- logging every 10 steps

Hardware Optimization

Optional fp16 mixed precision
Multi-worker dataloading

📊 Evaluation

Evaluation is performed using the seqeval library.

Metrics:

F1 score (primary metric)
Full classification report (printed during training)

Example:

precision    recall  f1-score   support
...

🎯 Intended Use

This model is suitable for:

Financial terminology extraction
Terminology preprocessing for translation systems
Supporting CAT tools
Domain-specific NLP pipelines
🚫 Out-of-Scope Use

This model is not intended for:

General-purpose NER tasks
Legal or compliance decision-making
Fully automated terminology validation without human review
🚀 Usage
from transformers import pipeline

pipe = pipeline(
    "token-classification",
    model="owen4512/bert-base-cased-finance-term-extractor",
    aggregation_strategy="simple"
)

text = "The firm increased exposure to derivatives and sovereign bonds."
print(pipe(text))
🧾 Example

Input:
"The company issued convertible bonds and derivatives."

Output:
["convertible bonds", "derivatives"]

⚠️ Limitations
Domain-specific: performance outside finance may degrade
Rare or unseen terms may not be recognized
Tokenization may split multi-word terms
Human validation is recommended
📜 License

This model is derived from data released under CC BY-NC 4.0.

✅ Non-commercial use allowed
❌ Commercial use prohibited without permission
✅ Attribution required

The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data.

🙏 Acknowledgements
Base model: google-bert/bert-base-cased
Dataset: WMT 2025 terminology resources
Framework: Hugging Face Transformers & Datasets
Metrics: seqeval

Downloads last month: 17

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for owen4512/bert-base-cased-finance-term-extractor

Base model

google-bert/bert-base-cased

Finetuned

(2845)

this model