Log Entity Extractor (BERT-based Token Classifier)

A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task). Created for my course work

Model Description

This model is based on bert-base-uncased and trained to perform token classification on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme.

Use case: Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation.

Model Details

  • Base Model: bert-large-uncased
  • Task: Token Classification (Named Entity Recognition)
  • Training Data: Annotated log lines (character-level entity offsets)
  • Input: Raw log text (string)
  • Output: Per-token BIO labels → grouped entities as canonical attributes

Canonical Label Set

The model extracts attributes from these canonical fields:

Field Description
service Application or service name (e.g., "auth", "api")
level Log level (e.g., "info", "error", "warn")
timestamp Timestamp or date reference
environment Deployment environment (e.g., "prod", "staging")
event Event type or action (e.g., "login", "request")
error_message Human-readable error message
status_code HTTP or service status code
duration Duration
ip IP address (client or server)
method HTTP method (GET, POST, etc.)
path URL path or resource path
useragent User-Agent header
hostname Server hostname

Usage

Installation

pip install transformers torch

Python (Hugging Face Transformers)

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "Aliph0th/logtheus-ml"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "[auth] failed login for user 123 from 10.1.2.3 code=E401"
# Tokenize and forward pass
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
outputs = model(**inputs)
logits = outputs.logits
# Get predicted label IDs
predicted_ids = torch.argmax(logits, dim=-1)
# Map back to label names
id2label = model.config.id2label
predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids]
print(predictions)

This returns a structured JSON object with:

  • attributes: High-confidence extractions (dict of canonical_field → value)
  • low_confidence_attributes: Below-threshold extractions
  • attribute_confidence: Per-field confidence scores
  • message: Original log text
  • confidence: Overall prediction confidence (0-1)
  • model_version: Model version string

Training

Dataset Format

Training data in JSONL format with character-offset annotations:

{"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]}

Used dataset -Aliph0th/logtheus-ml-ds

Fields:

  • text: Raw log line (string)
  • entities: List of entity annotations
    • start, end: Character-level offsets in text (0-indexed)
    • label: Canonical field name

Training Procedure

# 1. Prepare raw log files (deduplicate, split train/val)
python scripts/process_data.py data/annotated/ --p 0.8
# 2. Train model
python training/train_token_classifier.py \
  --train-file data/train.jsonl \
  --val-file data/val.jsonl \
  --output-dir artifacts/model_v1 \
  --base-model bert-base-uncased \
  --epochs 5 \
  --batch-size 16

Hyperparameters:

  • Learning rate: 3e-5
  • Batch size: 16 (per device)
  • Epochs: 5 (with early stopping by F1)
  • Optimizer: AdamW
  • Weight decay: 0.01

Limitations

  • English logs only: Trained on ASCII/UTF-8 log text in English
  • Format dependency: Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing

Contact & Support

For issues, questions, or contributions, please visit:

Acknowledgments

Downloads last month
16
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Aliph0th/logtheus-ml-base

Finetuned
(6612)
this model

Dataset used to train Aliph0th/logtheus-ml-base

Paper for Aliph0th/logtheus-ml-base