---
license: mit
language:
- en
base_model:
- distilbert/distilbert-base-multilingual-cased
pipeline_tag: text-classification
library_name: transformers
tags:
- code
- cyber
---


# Transformer

This is a transformers model fine tuned for malicious URL detection. Given a FQDN URL it outputs probability of it to be malicious identifying common suspicious pattern.

## Model Details

### Model Description

- **Developed by:** Anvilogic
- **Model Type:** Transformer
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 tokens
- **Finetuned from model:** [distilbert](https://huggingface.co/distilbert/distilbert-base-cased)
- **Language(s) (NLP):** Multilingual
- **License:** MIT

### Full Model Architecture

```
DistilBERT:
  name: "distilbert-base-cased"
  params:
    layers: 6
    hidden_size: 768
    attention_heads: 12
    ff_dim: 3072
    max_seq_len: 512
    vocab_size: 28996
    total_params: 66M
    activation: "gelu"
```

## Usage

### Direct Usage

First install the Transformers library:

```bash
pip install -U transformers
```
Then you can load this model and run inference.
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load pre-trained model and tokenizer
model_name = "Anvilogic/URLGuardian"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Adjust `num_labels` based on your task
# Example sentences
sentences = ["paypal.com.secure-login.xyz","bit.ly/fake-login"]
# Tokenize inputs
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # Raw predictions
    predictions = torch.argmax(logits, dim=-1)  # Convert to class labels
# Print results
print(predictions.tolist())  # Example output: [1, 0] (Assuming 1 = Positive, 0 = Negative)
```
### Downstream Usage
This dataset enables real-time malicious URL detection with lightweight models, supporting large-scale inference for phishing prevention and cybersecurity monitoring.
## Training Details

### Framework Versions
- Python: 3.10.14
- Transformers: 4.49.0
- PyTorch: 2.2.2
- Tokenizers: 0.20.3

### Training Data

The model was fine-tuned using [Anvilogic/URL-Guardian-Dataset](https://huggingface.co/datasets/Anvilogic/URL-Guardian-Dataset), which contains URL as well as their labels.
The dataset was filtered and converted to the parquet format for efficient processing.

### Training Procedure
The model was optimized using [BCELoss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html)

#### Training Hyperparameters
- **Model Architecture**: encoder fine-tuned from [distilbert](https://huggingface.co/distilbert/distilbert-base-cased)
- **Batch Size**: 32
- **Epochs**: 3
- **Learning Rate**: 2e-5
- **Warmup Steps**: 100


## Evaluation

In the final evaluation after training, the model achieved the following metrics on the test set:

**Binary Classification Evaluator**
```json
Accuracy : 0.9744
F1 Score : 0.9742
Precision : 0.9771
Recall : 0.9712
Average Precision : 0.9962
```
These results indicate the model's high performance in identifying maliciosu URLs, with strong precision and recall scores that make it well-suited for cybersecurity applications.