--- license: mit language: - en base_model: - distilbert/distilbert-base-multilingual-cased pipeline_tag: text-classification library_name: transformers tags: - code - cyber --- # Transformer This is a transformers model fine tuned for malicious URL detection. Given a FQDN URL it outputs probability of it to be malicious identifying common suspicious pattern. ## Model Details ### Model Description - **Developed by:** Anvilogic - **Model Type:** Transformer - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 768 tokens - **Finetuned from model:** [distilbert](https://huggingface.co/distilbert/distilbert-base-cased) - **Language(s) (NLP):** Multilingual - **License:** MIT ### Full Model Architecture ``` DistilBERT: name: "distilbert-base-cased" params: layers: 6 hidden_size: 768 attention_heads: 12 ff_dim: 3072 max_seq_len: 512 vocab_size: 28996 total_params: 66M activation: "gelu" ``` ## Usage ### Direct Usage First install the Transformers library: ```bash pip install -U transformers ``` Then you can load this model and run inference. ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch # Load pre-trained model and tokenizer model_name = "Anvilogic/URLGuardian" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Adjust `num_labels` based on your task # Example sentences sentences = ["paypal.com.secure-login.xyz","bit.ly/fake-login"] # Tokenize inputs inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") # Run inference with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # Raw predictions predictions = torch.argmax(logits, dim=-1) # Convert to class labels # Print results print(predictions.tolist()) # Example output: [1, 0] (Assuming 1 = Positive, 0 = Negative) ``` ### Downstream Usage This dataset enables real-time malicious URL detection with lightweight models, supporting large-scale inference for phishing prevention and cybersecurity monitoring. ## Training Details ### Framework Versions - Python: 3.10.14 - Transformers: 4.49.0 - PyTorch: 2.2.2 - Tokenizers: 0.20.3 ### Training Data The model was fine-tuned using [Anvilogic/URL-Guardian-Dataset](https://huggingface.co/datasets/Anvilogic/URL-Guardian-Dataset), which contains URL as well as their labels. The dataset was filtered and converted to the parquet format for efficient processing. ### Training Procedure The model was optimized using [BCELoss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) #### Training Hyperparameters - **Model Architecture**: encoder fine-tuned from [distilbert](https://huggingface.co/distilbert/distilbert-base-cased) - **Batch Size**: 32 - **Epochs**: 3 - **Learning Rate**: 2e-5 - **Warmup Steps**: 100 ## Evaluation In the final evaluation after training, the model achieved the following metrics on the test set: **Binary Classification Evaluator** ```json Accuracy : 0.9744 F1 Score : 0.9742 Precision : 0.9771 Recall : 0.9712 Average Precision : 0.9962 ``` These results indicate the model's high performance in identifying maliciosu URLs, with strong precision and recall scores that make it well-suited for cybersecurity applications.