jailbreak-embeddings-large-onnx

ONNX export of the multilingual-e5-large-wjb-threatfeed_v1 model — a fine-tuned sentence-transformers model for detecting duplicate vulnerability submissions (jailbreak and prompt injection attacks) in the 0din threat feed.

It maps prompts to a 1024-dimensional dense vector space optimized for semantic similarity comparison of attack prompts.

This model achieves a +59.5% F1 improvement over the OpenAI text-embedding-3-large baseline on duplicate detection, and is the best-performing model in the series.

Model Details

Model Description

  • Model Type: Sentence Transformer (two-stage fine-tuned), exported to ONNX
  • Base Model: intfloat/multilingual-e5-large (~560M parameters)
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Language: Multilingual (XLM-RoBERTa backbone)
  • Format: ONNX (compatible with onnxruntime, tract-onnx, and other ONNX runtimes)

Embedding Pipeline

Input Text → Tokenizer → ONNX Model → Mean Pooling → L2 Normalization → Embedding

The ONNX model contains only the transformer backbone. Mean pooling and L2 normalization must be implemented in application code (see usage examples below).

Model Inputs

The ONNX model requires 3 inputs:

  • input_ids: Token IDs from tokenizer
  • attention_mask: 1 for real tokens, 0 for padding
  • token_type_ids: All zeros for single-sentence embeddings

ONNX Verification

The ONNX export produces near bit-for-bit identical embeddings to the native sentence-transformers model (0.000001 max difference across all test sentences).

Intended Use

This model is designed for:

  • Duplicate detection in AI security vulnerability reports (jailbreak/prompt injection attacks)
  • Semantic similarity comparison of attack prompts that may use different surface-level techniques but target the same underlying vulnerability
  • Embedding generation for LSH-based similarity search in vulnerability management systems
  • Edge/server deployment via ONNX runtime without requiring PyTorch

The model is trained to recognize semantic equivalence between attack prompts even when they use different jailbreak tactics (e.g., role-playing, encoding, academic framing) to elicit the same harmful behavior.

Usage

sentence-transformers (with ONNX backend)

from sentence_transformers import SentenceTransformer

# Load directly with ONNX backend
model = SentenceTransformer("0dinai/jailbreak-embeddings-large-onnx", backend="onnx")

sentences = ["First attack prompt", "Second attack prompt"]
embeddings = model.encode(sentences)
similarity = model.similarity(embeddings, embeddings)
print(similarity)

Python (onnxruntime)

import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

# Load model and tokenizer
session = ort.InferenceSession("onnx/model.onnx")
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=1, pad_token="<pad>")
tokenizer.enable_truncation(max_length=512)

# Tokenize
texts = ["First attack prompt", "Second attack prompt"]
encodings = tokenizer.encode_batch(texts)
input_ids = np.array([e.ids for e in encodings], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64)
token_type_ids = np.zeros_like(input_ids)

# Run ONNX inference
outputs = session.run(None, {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "token_type_ids": token_type_ids,
})
token_embeddings = outputs[0]  # [batch, seq_len, 1024]

# Mean pooling
mask = attention_mask[:, :, np.newaxis].astype(np.float32)
embeddings = (token_embeddings * mask).sum(axis=1) / mask.sum(axis=1)

# L2 normalization
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / norms

# Cosine similarity
similarity = np.dot(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.4f}")

Rust (tract-onnx)

use tract_onnx::prelude::*;
use tokenizers::Tokenizer;

// Load model and tokenizer
let model = tract_onnx::onnx()
    .model_for_path("onnx/model.onnx")?
    .into_optimized()?
    .into_runnable()?;
let tokenizer = Tokenizer::from_file("tokenizer.json")?;

// Tokenize
let encoding = tokenizer.encode("Attack prompt text", true)?;
let input_ids: Vec<i64> = encoding.get_ids().iter().map(|&x| x as i64).collect();
let attention_mask: Vec<i64> = encoding.get_attention_mask().iter().map(|&x| x as i64).collect();
let token_type_ids: Vec<i64> = vec![0i64; input_ids.len()];

// Run inference, then apply mean pooling + L2 normalization
// (see full Rust implementation at github.com/0din-ai)

Training Details

This model was trained using a two-stage fine-tuning approach:

Stage 1: WildJailbreak Pre-training

Pre-trained on public synthetic data to learn jailbreak semantics.

  • Dataset: Allen AI WildJailbreak — vanilla-adversarial prompt pairs
  • Pairs: 161,396 positive pairs (same intent, different formulation)
  • Split: 153,326 train / 4,034 val / 4,036 test (95% / 2.5% / 2.5%)
  • Loss: MultipleNegativesRankingLoss (in-batch negatives)
  • Batch size: 16 (per device) x 2 gradient accumulation steps = 32 effective
  • Learning rate: 1e-5
  • FP16: True
  • Purpose: Teach the model to see through jailbreak wrappers and match prompts by underlying intent

Stage 2: Threat Feed Fine-tuning

Fine-tuned on annotated pairs from the internal 0din threat feed.

  • Pairs: 9,598 annotated pairs (7,678 train / 958 val / 962 test)
  • Label Distribution: ~34% duplicates / ~66% non-duplicates
  • Annotation: Google Gemini 2.5 Pro (single-model annotation)
  • Source Similarity Threshold: Candidate pairs generated with Thor similarity >= 0.5
  • Loss: ContrastiveLoss (cosine distance, margin=0.5)
  • Purpose: Calibrate the model for real-world duplicate detection on production vulnerability data

Stage 2 Hyperparameters

Parameter Value
Epochs 50 (early stopped)
Batch size 8 (per device) x 4 gradient accumulation = 32 effective
Learning rate 1e-5
LR scheduler Linear
Warmup ratio 0.1
Weight decay 0.01
FP16 True
Early stopping patience 10
Eval steps 50
Seed 1

Evaluation Results

Duplicate Detection Performance

Evaluated on 55 human-labeled vulnerability pairs (10 duplicates, 45 non-duplicates) from a corpus of 3,749 vulnerabilities. Best F1 score at each model's optimal threshold:

Model Best F1 Threshold Precision Recall
OpenAI text-embedding-3-large (baseline) 0.462 0.80 1.000 0.300
Finetuned V1 (WildJailbreak only, e5-small) 0.500 0.50 0.333 1.000
Finetuned V2 (WJB + threat feed v1, e5-small) 0.526 0.70 0.556 0.500
Finetuned V3 (WJB + threat feed v2, e5-small) 0.556 0.75 0.625 0.500
Finetuned V4 (WJB + threat feed 10k, e5-small) 0.600 0.70 0.600 0.600
Finetuned Base V1 (e5-base) 0.696 0.70 0.615 0.800
This model (Large V1) 0.737 0.80 0.778 0.700

Threshold Analysis (This Model)

Threshold Precision Recall F1 TP FP FN TN
0.50 0.250 0.900 0.391 9 27 1 18
0.55 0.310 0.900 0.462 9 20 1 25
0.60 0.346 0.900 0.500 9 17 1 28
0.65 0.391 0.900 0.545 9 14 1 31
0.70 0.500 0.800 0.615 8 8 2 37
0.75 0.615 0.800 0.696 8 5 2 40
0.80 0.778 0.700 0.737 7 2 3 43
0.85 1.000 0.400 0.571 4 0 6 45
0.90 1.000 0.200 0.333 2 0 8 45

Key Findings

  • +59.5% F1 improvement over the OpenAI text-embedding-3-large baseline (0.737 vs 0.462)
  • Best in series: Continues the scaling trend: e5-small (0.600) → e5-base (0.696) → e5-large (0.737).
  • Highest precision at optimal threshold: 0.778 precision with only 2 false positives, compared to 0.615 for e5-base at its optimal threshold.
  • Precision-recall tradeoff vs e5-base: Trades a small amount of recall (0.700 vs 0.800) for a significant precision gain (0.778 vs 0.615), resulting in a better-balanced F1.
  • Higher optimal threshold (0.80): The larger model produces more confident and well-separated similarity scores, allowing a higher decision threshold while maintaining strong performance.
  • Strong recall at lower thresholds: Maintains 0.900 recall across thresholds 0.50–0.65, indicating very few true duplicates are missed at permissive thresholds.

Note: The evaluation dataset is small (55 pairs, 10 positive). With only 10 true duplicates, each TP/FP change causes large metric swings. Results should be interpreted with caution.

Limitations

  • Small evaluation set: Only 55 human-labeled pairs (10 duplicates). Results should be taken as directional rather than definitive.
  • LLM annotation bias in training data: Stage 2 training data was annotated by a single LLM (Gemini 2.5 Pro), which may affect calibration.
  • Model size: ~560M parameters with 1024-dim embeddings. The ONNX model is ~2.1GB.
  • Domain-specific: Optimized for jailbreak/prompt injection duplicate detection. Performance on general semantic similarity tasks is not evaluated.

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
    title={Dimensionality Reduction by Learning an Invariant Mapping},
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}

WildJailbreak

@article{jiang2024wildteaming,
    title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models},
    author={Jiang, Liwei and Bhatt, Kavel and Phute, Seungju and Hwang, Jaehun and Liang, Dongwei and Sap, Maarten and Hajishirzi, Hannaneh and Choi, Yejin},
    journal={arXiv preprint arXiv:2406.18510},
    year={2024}
}
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for 0dinai/jailbreak-embeddings-large-onnx-int8