---
language: en
license: mit
tags:
- security
- ner
- vulnerability-detection
- codebert
- lora
library_name: transformers
pipeline_tag: token-classification
---

# Vulnerability Extractor - CodeBERT with LoRA

## Model Description

This model extracts vulnerability indicators from security logs using Named Entity Recognition (NER).

**Task**: Token Classification / Named Entity Recognition

**Base Model**: microsoft/codebert-base

**Fine-tuning Method**: LoRA (98% parameter reduction)

## Extracted Entities

- **SOFTWARE**: Software/service names (e.g., Apache, nginx, OpenSSL)
- **VERSION**: Version numbers (e.g., 2.4.49, 1.1.0)
- **ERROR**: Error types (e.g., buffer overflow, authentication failure)
- **EXPLOIT**: Exploit hints (e.g., Heartbleed, path traversal)
- **IP**: IP addresses
- **PORT**: Port numbers
- **USER**: Usernames
- **PATH**: File paths

## Performance

- **Entity Recognition F1**: ~0.88
- **Inference Speed**: ~60ms per log (GPU)

## Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("Swapnanil09/vulnerability-extractor")
model = AutoModelForTokenClassification.from_pretrained("Swapnanil09/vulnerability-extractor")

# Extract vulnerabilities
log = "Apache 2.4.49 path traversal attack attempt detected"
inputs = tokenizer(log, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode entities
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

entities = []
current_entity = None

for token, label in zip(tokens, labels):
    if token in ['<s>', '</s>', '<pad>']:
        continue
    if label.startswith('B-'):
        if current_entity:
            entities.append(current_entity)
        current_entity = {'text': token.replace('Ġ', ' ').strip(), 'type': label[2:]}
    elif label.startswith('I-') and current_entity:
        current_entity['text'] += token.replace('Ġ', ' ')

if current_entity:
    entities.append(current_entity)

print(f"Entities: {entities}")
```

## Model Details

- **Parameters**: ~125M (only ~2M trainable with LoRA)
- **Input**: Security log text (max 128 tokens)
- **Output**: Token-level entity labels (BIO tagging)
- **Entity Types**: 8 types + O (outside)

## Use Cases

1. Automated vulnerability scanning
2. Security log analysis
3. Threat intelligence extraction
4. CVE mapping preparation

## Limitations

- Entity extraction accuracy depends on log format
- May miss entities with unusual formatting
- Trained on specific entity types only

## Citation

```bibtex
@misc{vulnerability-extractor,
  author = {Your Name},
  title = {Vulnerability Extractor with CodeBERT and LoRA},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Swapnanil09/vulnerability-extractor}}
}
```

## License

MIT License