Aliph0th commited on
Commit
4b6cb8c
·
verified ·
1 Parent(s): 2290d4b

add README

Browse files
Files changed (1) hide show
  1. README.md +146 -0
README.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - google-bert/bert-base-uncased
6
+ pipeline_tag: token-classification
7
+ library_name: transformers
8
+ tags:
9
+ - token-classification
10
+ - named-entity-recognition
11
+ - ner
12
+ - bert
13
+ - logs
14
+ datasets:
15
+ - Aliph0th/logtheus-ml-ds
16
+ ---
17
+ # Log Entity Extractor (BERT-based Token Classifier)
18
+
19
+ A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task). Created for my course work
20
+
21
+ ## Model Description
22
+
23
+ This model is based on `bert-base-uncased` and trained to perform **token classification** on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme.
24
+
25
+ **Use case:** Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation.
26
+
27
+ ## Model Details
28
+
29
+ - **Base Model:** `bert-large-uncased`
30
+ - **Task:** Token Classification (Named Entity Recognition)
31
+ - **Training Data:** Annotated log lines (character-level entity offsets)
32
+ - **Input:** Raw log text (string)
33
+ - **Output:** Per-token BIO labels → grouped entities as canonical attributes
34
+
35
+ ## Canonical Label Set
36
+
37
+ The model extracts attributes from these canonical fields:
38
+
39
+ | Field | Description |
40
+ |-------|-------------|
41
+ | service | Application or service name (e.g., "auth", "api") |
42
+ | level | Log level (e.g., "info", "error", "warn") |
43
+ | timestamp | Timestamp or date reference |
44
+ | environment | Deployment environment (e.g., "prod", "staging") |
45
+ | event | Event type or action (e.g., "login", "request") |
46
+ | error_message | Human-readable error message |
47
+ | status_code | HTTP or service status code |
48
+ | duration | Duration |
49
+ | ip | IP address (client or server) |
50
+ | method | HTTP method (GET, POST, etc.) |
51
+ | path | URL path or resource path |
52
+ | useragent | User-Agent header |
53
+ | hostname | Server hostname |
54
+
55
+ ## Usage
56
+
57
+ ### Installation
58
+
59
+ ```bash
60
+ pip install transformers torch
61
+ ```
62
+
63
+ ### Python (Hugging Face Transformers)
64
+
65
+ ```python
66
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
67
+ import torch
68
+ model_name = "Aliph0th/logtheus-ml"
69
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
70
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
71
+ text = "[auth] failed login for user 123 from 10.1.2.3 code=E401"
72
+ # Tokenize and forward pass
73
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
74
+ outputs = model(**inputs)
75
+ logits = outputs.logits
76
+ # Get predicted label IDs
77
+ predicted_ids = torch.argmax(logits, dim=-1)
78
+ # Map back to label names
79
+ id2label = model.config.id2label
80
+ predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids]
81
+ print(predictions)
82
+ ```
83
+
84
+ This returns a structured JSON object with:
85
+ - `attributes`: High-confidence extractions (dict of canonical_field → value)
86
+ - `low_confidence_attributes`: Below-threshold extractions
87
+ - `attribute_confidence`: Per-field confidence scores
88
+ - `message`: Original log text
89
+ - `confidence`: Overall prediction confidence (0-1)
90
+ - `model_version`: Model version string
91
+
92
+ ## Training
93
+
94
+ ### Dataset Format
95
+
96
+ Training data in JSONL format with character-offset annotations:
97
+
98
+ ```json
99
+ {"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]}
100
+ ```
101
+
102
+ Used dataset -[Aliph0th/logtheus-ml-ds](https://huggingface.co/datasets/Aliph0th/logtheus-ml-ds)
103
+
104
+ **Fields:**
105
+ - `text`: Raw log line (string)
106
+ - `entities`: List of entity annotations
107
+ - `start`, `end`: Character-level offsets in text (0-indexed)
108
+ - `label`: Canonical field name
109
+
110
+ ### Training Procedure
111
+
112
+ ```bash
113
+ # 1. Prepare raw log files (deduplicate, split train/val)
114
+ python scripts/process_data.py data/annotated/ --p 0.8
115
+ # 2. Train model
116
+ python training/train_token_classifier.py \
117
+ --train-file data/train.jsonl \
118
+ --val-file data/val.jsonl \
119
+ --output-dir artifacts/model_v1 \
120
+ --base-model bert-base-uncased \
121
+ --epochs 5 \
122
+ --batch-size 16
123
+ ```
124
+
125
+ **Hyperparameters:**
126
+ - Learning rate: 3e-5
127
+ - Batch size: 16 (per device)
128
+ - Epochs: 5 (with early stopping by F1)
129
+ - Optimizer: AdamW
130
+ - Weight decay: 0.01
131
+
132
+ ## Limitations
133
+
134
+ - **English logs only:** Trained on ASCII/UTF-8 log text in English
135
+ - **Format dependency:** Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing
136
+
137
+ ## Contact & Support
138
+
139
+ For issues, questions, or contributions, please visit:
140
+ - **Repository:** https://github.com/Aliph0th/logtheus-ml
141
+ - **Issues:** https://github.com/Aliph0th/logtheus-ml/issues
142
+
143
+ ## Acknowledgments
144
+
145
+ - Based on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
146
+ - Built with [Hugging Face Transformers](https://huggingface.co/transformers/)