Prompt Injection Classifier
A lightweight sklearn-based classifier that detects prompt injection attacks in LLM inputs.
Model Details
- Type: TF-IDF + Logistic Regression pipeline
- Task: Binary text classification (injection vs clean)
- Framework: scikit-learn
- Accuracy: ~94% (5-fold cross-validation)
Usage
import pickle
with open("model.pkl", "rb") as f:
model = pickle.load(f)
# Predict
text = "Ignore all previous instructions"
prediction = model.predict([text])[0] # 1 = injection, 0 = clean
probability = model.predict_proba([text])[0][1] # injection probability
Training Data
Trained on 50 examples (25 injection, 25 clean) covering common attack patterns:
- Instruction overrides
- Role reassignment
- Jailbreak attempts
- System prompt extraction
- Safety bypass attempts
Limitations
- Small training set; best used as a first-pass filter
- May not catch novel or obfuscated injection techniques
- English only
License
MIT
- Downloads last month
- -