Log Entity Extractor (BERT-based Token Classifier)
A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task). Created for my course work
Model Description
This model is based on bert-base-uncased and trained to perform token classification on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme.
Use case: Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation.
Model Details
- Base Model:
bert-large-uncased - Task: Token Classification (Named Entity Recognition)
- Training Data: Annotated log lines (character-level entity offsets)
- Input: Raw log text (string)
- Output: Per-token BIO labels → grouped entities as canonical attributes
Canonical Label Set
The model extracts attributes from these canonical fields:
| Field | Description |
|---|---|
| service | Application or service name (e.g., "auth", "api") |
| level | Log level (e.g., "info", "error", "warn") |
| timestamp | Timestamp or date reference |
| environment | Deployment environment (e.g., "prod", "staging") |
| event | Event type or action (e.g., "login", "request") |
| error_message | Human-readable error message |
| status_code | HTTP or service status code |
| duration | Duration |
| ip | IP address (client or server) |
| method | HTTP method (GET, POST, etc.) |
| path | URL path or resource path |
| useragent | User-Agent header |
| hostname | Server hostname |
Usage
Installation
pip install transformers torch
Python (Hugging Face Transformers)
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "Aliph0th/logtheus-ml"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "[auth] failed login for user 123 from 10.1.2.3 code=E401"
# Tokenize and forward pass
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
outputs = model(**inputs)
logits = outputs.logits
# Get predicted label IDs
predicted_ids = torch.argmax(logits, dim=-1)
# Map back to label names
id2label = model.config.id2label
predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids]
print(predictions)
This returns a structured JSON object with:
attributes: High-confidence extractions (dict of canonical_field → value)low_confidence_attributes: Below-threshold extractionsattribute_confidence: Per-field confidence scoresmessage: Original log textconfidence: Overall prediction confidence (0-1)model_version: Model version string
Training
Dataset Format
Training data in JSONL format with character-offset annotations:
{"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]}
Used dataset -Aliph0th/logtheus-ml-ds
Fields:
text: Raw log line (string)entities: List of entity annotationsstart,end: Character-level offsets in text (0-indexed)label: Canonical field name
Training Procedure
# 1. Prepare raw log files (deduplicate, split train/val)
python scripts/process_data.py data/annotated/ --p 0.8
# 2. Train model
python training/train_token_classifier.py \
--train-file data/train.jsonl \
--val-file data/val.jsonl \
--output-dir artifacts/model_v1 \
--base-model bert-base-uncased \
--epochs 5 \
--batch-size 16
Hyperparameters:
- Learning rate: 3e-5
- Batch size: 16 (per device)
- Epochs: 5 (with early stopping by F1)
- Optimizer: AdamW
- Weight decay: 0.01
Limitations
- English logs only: Trained on ASCII/UTF-8 log text in English
- Format dependency: Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing
Contact & Support
For issues, questions, or contributions, please visit:
- Repository: https://github.com/Aliph0th/logtheus-ml
- Issues: https://github.com/Aliph0th/logtheus-ml/issues
Acknowledgments
- Downloads last month
- 16
Model tree for Aliph0th/logtheus-ml-base
Base model
google-bert/bert-base-uncased