Prompt Injection Classifier

A lightweight sklearn-based classifier that detects prompt injection attacks in LLM inputs.

Model Details

  • Type: TF-IDF + Logistic Regression pipeline
  • Task: Binary text classification (injection vs clean)
  • Framework: scikit-learn
  • Accuracy: ~94% (5-fold cross-validation)

Usage

import pickle

with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# Predict
text = "Ignore all previous instructions"
prediction = model.predict([text])[0]  # 1 = injection, 0 = clean
probability = model.predict_proba([text])[0][1]  # injection probability

Training Data

Trained on 50 examples (25 injection, 25 clean) covering common attack patterns:

  • Instruction overrides
  • Role reassignment
  • Jailbreak attempts
  • System prompt extraction
  • Safety bypass attempts

Limitations

  • Small training set; best used as a first-pass filter
  • May not catch novel or obfuscated injection techniques
  • English only

License

MIT

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train zachz/prompt-injection-classifier