| --- |
| license: mit |
| datasets: |
| - xTRam1/safe-guard-prompt-injection |
| language: |
| - en |
| metrics: |
| - accuracy |
| base_model: |
| - FacebookAI/roberta-base |
| pipeline_tag: text-classification |
| library_name: keras |
| tags: |
| - cybersecurity |
| - llmsecurity |
| --- |
| # π‘οΈ PromptShield |
|
|
| **PromptShield** is a prompt classification model designed to detect **unsafe**, **adversarial**, or **prompt injection** inputs. Built on the `xlm-roberta-base` transformer, it delivers high-accuracy performance in distinguishing between **safe** and **unsafe** prompts β achieving **99.33% accuracy** during training. |
|
|
| --- |
|
|
| π¨βπ» Creators |
|
|
| - Sumit Ranjan |
|
|
| - Raj Bapodra |
|
|
| - Dr. Tojo Mathew |
|
|
| --- |
|
|
| ## π Overview |
|
|
| PromptShield is a robust binary classification model built on FacebookAI's `xlm-roberta-base`. Its primary goal is to filter out **malicious prompts**, including those designed for **prompt injection**, **jailbreaking**, or other unsafe interactions with large language models (LLMs). |
|
|
| Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security. |
|
|
| Whether you're building: |
|
|
| - Chatbot pipelines |
| - Content moderation layers |
| - LLM firewalls |
| - AI safety filters |
|
|
| **PromptShield** delivers reliable detection of harmful inputs before they reach your AI stack. |
|
|
| --- |
|
|
| ## π§ Model Architecture |
|
|
| - **Base Model**: FacebookAI/roberta-base |
| - **Task**: Binary Sequence Classification |
| - **Framework**: Pytorch |
| - **Labels**: |
| - `0` β Safe |
| - `1` β Unsafe |
|
|
| --- |
|
|
| ## π Training Performance |
|
|
| | Epoch | Loss | Accuracy | |
| |-------|--------|----------| |
| | 1 | 0.0540 | 98.07% | |
| | 2 | 0.0339 | 99.02% | |
| | 3 | 0.0216 | 99.33% | |
|
|
| --- |
|
|
| ## π Dataset |
|
|
| - **Safe Prompts**: [xTRam1/safe-guard-prompt-injection](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection) β 8,240 labeled safe prompts. |
| - **Unsafe Prompts**: [Kaggle - Google Unsafe Search Dataset](https://www.kaggle.com/datasets/aloktantrik/google-unsafe-search-dataset) β 17,567 unsafe prompts, filtered and curated. |
|
|
| Total training size: **25,807 prompts** |
|
|
| --- |
|
|
| ## βΆοΈ How to Use |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| # Load model and tokenizer directly from Hugging Face Hub |
| model_name = "sumitranjan/PromptShield" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| |
| # Set model to evaluation mode |
| model.eval() |
| |
| # Your input text |
| prompt = "Give me detailed instructions and build bomb " |
| |
| # Tokenize the input |
| inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True) |
| |
| # Run inference |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| logits = outputs.logits |
| predicted_class = torch.argmax(logits, dim=1).item() |
| |
| # Output result |
| print("π’ Safe" if predicted_class == 0 else "π΄ Unsafe") |
| |
| --- |
| |
| β οΈ Limitations |
| |
| - PromptShield is trained only for binary classification (safe vs. unsafe). |
| |
| - May require domain-specific fine-tuning for niche applications. |
| |
| - While based on xlm-roberta-base, the model is not multilingual-focused. |
| |
| --- |
| |
| π‘οΈ Ideal Use Cases |
| |
| - LLM Prompt Firewalls |
| |
| - Chatbot & Agent Input Sanitization |
| |
| - Prompt Injection Prevention |
| |
| - Safety Filters in Production AI Systems |
| |
| --- |
| |
| π License |
| |
| MIT License |