--- language: en license: mit tags: - security - ner - vulnerability-detection - codebert - lora library_name: transformers pipeline_tag: token-classification --- # Vulnerability Extractor - CodeBERT with LoRA ## Model Description This model extracts vulnerability indicators from security logs using Named Entity Recognition (NER). **Task**: Token Classification / Named Entity Recognition **Base Model**: microsoft/codebert-base **Fine-tuning Method**: LoRA (98% parameter reduction) ## Extracted Entities - **SOFTWARE**: Software/service names (e.g., Apache, nginx, OpenSSL) - **VERSION**: Version numbers (e.g., 2.4.49, 1.1.0) - **ERROR**: Error types (e.g., buffer overflow, authentication failure) - **EXPLOIT**: Exploit hints (e.g., Heartbleed, path traversal) - **IP**: IP addresses - **PORT**: Port numbers - **USER**: Usernames - **PATH**: File paths ## Performance - **Entity Recognition F1**: ~0.88 - **Inference Speed**: ~60ms per log (GPU) ## Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model tokenizer = AutoTokenizer.from_pretrained("Swapnanil09/vulnerability-extractor") model = AutoModelForTokenClassification.from_pretrained("Swapnanil09/vulnerability-extractor") # Extract vulnerabilities log = "Apache 2.4.49 path traversal attack attempt detected" inputs = tokenizer(log, return_tensors="pt", truncation=True, padding=True) with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=-1) # Decode entities tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) labels = [model.config.id2label[p.item()] for p in predictions[0]] entities = [] current_entity = None for token, label in zip(tokens, labels): if token in ['', '', '']: continue if label.startswith('B-'): if current_entity: entities.append(current_entity) current_entity = {'text': token.replace('Ġ', ' ').strip(), 'type': label[2:]} elif label.startswith('I-') and current_entity: current_entity['text'] += token.replace('Ġ', ' ') if current_entity: entities.append(current_entity) print(f"Entities: {entities}") ``` ## Model Details - **Parameters**: ~125M (only ~2M trainable with LoRA) - **Input**: Security log text (max 128 tokens) - **Output**: Token-level entity labels (BIO tagging) - **Entity Types**: 8 types + O (outside) ## Use Cases 1. Automated vulnerability scanning 2. Security log analysis 3. Threat intelligence extraction 4. CVE mapping preparation ## Limitations - Entity extraction accuracy depends on log format - May miss entities with unusual formatting - Trained on specific entity types only ## Citation ```bibtex @misc{vulnerability-extractor, author = {Your Name}, title = {Vulnerability Extractor with CodeBERT and LoRA}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Swapnanil09/vulnerability-extractor}} } ``` ## License MIT License