Log Entity Extractor (BERT-based Token Classifier)

A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task). Created for my course work

Model Description

This model is based on bert-base-uncased and trained to perform token classification on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme.

Use case: Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation.

Model Details

Base Model: bert-large-uncased
Task: Token Classification (Named Entity Recognition)
Training Data: Annotated log lines (character-level entity offsets)
Input: Raw log text (string)
Output: Per-token BIO labels → grouped entities as canonical attributes

Canonical Label Set

The model extracts attributes from these canonical fields:

Field	Description
service	Application or service name (e.g., "auth", "api")
level	Log level (e.g., "info", "error", "warn")
timestamp	Timestamp or date reference
environment	Deployment environment (e.g., "prod", "staging")
event	Event type or action (e.g., "login", "request")
error_message	Human-readable error message
status_code	HTTP or service status code
duration	Duration
ip	IP address (client or server)
method	HTTP method (GET, POST, etc.)
path	URL path or resource path
useragent	User-Agent header
hostname	Server hostname

Usage

Installation

pip install transformers torch

Python (Hugging Face Transformers)

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "Aliph0th/logtheus-ml"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "[auth] failed login for user 123 from 10.1.2.3 code=E401"
# Tokenize and forward pass
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
outputs = model(**inputs)
logits = outputs.logits
# Get predicted label IDs
predicted_ids = torch.argmax(logits, dim=-1)
# Map back to label names
id2label = model.config.id2label
predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids]
print(predictions)

This returns a structured JSON object with:

attributes: High-confidence extractions (dict of canonical_field → value)
low_confidence_attributes: Below-threshold extractions
attribute_confidence: Per-field confidence scores
message: Original log text
confidence: Overall prediction confidence (0-1)
model_version: Model version string

Training

Dataset Format

Training data in JSONL format with character-offset annotations:

{"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]}

Used dataset -Aliph0th/logtheus-ml-ds

Fields:

text: Raw log line (string)
entities: List of entity annotations
- start, end: Character-level offsets in text (0-indexed)
- label: Canonical field name

Training Procedure

# 1. Prepare raw log files (deduplicate, split train/val)
python scripts/process_data.py data/annotated/ --p 0.8
# 2. Train model
python training/train_token_classifier.py \
  --train-file data/train.jsonl \
  --val-file data/val.jsonl \
  --output-dir artifacts/model_v1 \
  --base-model bert-base-uncased \
  --epochs 5 \
  --batch-size 16

Hyperparameters:

Learning rate: 3e-5
Batch size: 16 (per device)
Epochs: 5 (with early stopping by F1)
Optimizer: AdamW
Weight decay: 0.01

Limitations

English logs only: Trained on ASCII/UTF-8 log text in English
Format dependency: Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing