BERT Finance Term Extractor (English)
A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text.
π§ Model Description
This model is fine-tuned from google-bert/bert-base-cased for domain-specific terminology extraction.
It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines.
ποΈ Training Pipeline
The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets.
Data Processing
- Input format: CoNLL-style token-tag sequences
- Sentences are split by blank lines
- Labels are converted into integer IDs (
label2id,id2label) - Automatic train/dev split using configurable ratio (
dev_ratio=0.1)
Tokenization & Label Alignment
- Tokenizer:
BertTokenizerFast - Tokenization uses
is_split_into_words=True - Word-piece alignment handled via
word_ids() - Special tokens assigned label
-100(ignored in loss)
βοΈ Training Details
- Base model:
google-bert/bert-base-cased - Task: Token Classification (NER-style)
- Framework: Hugging Face
Trainer
Training Arguments
- learning_rate: 2e-5
- batch_size: 16
- num_train_epochs: 5
- max_seq_length: 256
- weight_decay: 0.01
Training Strategy
- Evaluation: per epoch
- Checkpoint saving: per epoch
- Best model selection:
- metric: F1 score
load_best_model_at_end=True
- Logging:
- TensorBoard enabled
- logging every 10 steps
Hardware Optimization
- Optional fp16 mixed precision
- Multi-worker dataloading
π Evaluation
Evaluation is performed using the seqeval library.
Metrics:
- F1 score (primary metric)
- Full classification report (printed during training)
Example:
precision recall f1-score support
...
π― Intended Use
This model is suitable for:
Financial terminology extraction
Terminology preprocessing for translation systems
Supporting CAT tools
Domain-specific NLP pipelines
π« Out-of-Scope Use
This model is not intended for:
General-purpose NER tasks
Legal or compliance decision-making
Fully automated terminology validation without human review
π Usage
from transformers import pipeline
pipe = pipeline(
"token-classification",
model="owen4512/bert-base-cased-finance-term-extractor",
aggregation_strategy="simple"
)
text = "The firm increased exposure to derivatives and sovereign bonds."
print(pipe(text))
π§Ύ Example
Input:
"The company issued convertible bonds and derivatives."
Output:
["convertible bonds", "derivatives"]
β οΈ Limitations
Domain-specific: performance outside finance may degrade
Rare or unseen terms may not be recognized
Tokenization may split multi-word terms
Human validation is recommended
π License
This model is derived from data released under CC BY-NC 4.0.
β
Non-commercial use allowed
β Commercial use prohibited without permission
β
Attribution required
The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data.
π Acknowledgements
Base model: google-bert/bert-base-cased
Dataset: WMT 2025 terminology resources
Framework: Hugging Face Transformers & Datasets
Metrics: seqeval
- Downloads last month
- 17
Model tree for owen4512/bert-base-cased-finance-term-extractor
Base model
google-bert/bert-base-cased