| --- |
| language: |
| - en |
| - id |
| tags: |
| - bert |
| - text-classification |
| - token-classification |
| - cybersecurity |
| - fill-mask |
| - named-entity-recognition |
| base_model: google-bert/bert-base-cased |
| library_name: transformers |
| --- |
| |
| # bert-base-cybersecurity |
|
|
| ## 1. Model Details |
|
|
| **Model description** |
| "bert-base-cybersecurity" is a transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content). |
|
|
| - Model type: fine-tuned lightweight BERT variant |
| - Languages: English & Indonesia |
| - Finetuned from: `bert-base-cased` |
| - Status: **Early version** — trained on **0.00%** of planned data. |
|
|
| **Model sources** |
| - Base model: [google-bert/bert-base-cased](https://huggingface.co/google-bert/bert-base-cased) |
| - Data: Cybersecurity Data |
|
|
| ## 2. Uses |
|
|
| ### Direct use |
| You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence. |
|
|
| ### Downstream use |
| - Embedding extraction for clustering or anomaly detection in security logs. |
| - As part of a pipeline for phishing detection, malicious email filtering, incident triage. |
| - As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard). |
|
|
| ### Out-of-scope use |
| - Not meant for high-stakes automated blocking decisions without human review. |
| - Not optimized for languages other than English and Indonesian. |
| - Not tested for non-cybersecurity domains or out-of-distribution data. |
|
|
| ## 3. Bias, Risks, and Limitations |
|
|
| Because the model is based on a small subset (0.00%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language). |
|
|
| - Inherits any biases present in the base model (`google-bert/bert-base-cased`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary. |
| - Should not be used as sole authority for incident decisions; only as an aid to human analysts. |
|
|
| ## 4. How to Get Started with the Model |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| |
| tokenizer = AutoTokenizer.from_pretrained("codechrl/bert-base-cybersecurity") |
| model = AutoModelForSequenceClassification.from_pretrained("codechrl/bert-base-cybersecurity") |
| |
| inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123", |
| return_tensors="pt", truncation=True, padding=True) |
| outputs = model(**inputs) |
| logits = outputs.logits |
| predicted_class = logits.argmax(dim=-1).item() |
| ``` |
|
|
| ## 5. Training Details |
|
|
| - **Trained records**: 1 / 237,628 (0.00%) |
| - **Learning rate**: 5e-05 |
| - **Epochs**: 3 |
| - **Batch size**: 1 |
| - **Max sequence length**: 512 |