jailbreak-embeddings-large-onnx

ONNX export of the multilingual-e5-large-wjb-threatfeed_v1 model — a fine-tuned sentence-transformers model for detecting duplicate vulnerability submissions (jailbreak and prompt injection attacks) in the 0din threat feed.

It maps prompts to a 1024-dimensional dense vector space optimized for semantic similarity comparison of attack prompts.

This model achieves a +59.5% F1 improvement over the OpenAI text-embedding-3-large baseline on duplicate detection, and is the best-performing model in the series.

Model Details

Model Description

Model Type: Sentence Transformer (two-stage fine-tuned), exported to ONNX
Base Model: intfloat/multilingual-e5-large (~560M parameters)
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity
Language: Multilingual (XLM-RoBERTa backbone)
Format: ONNX (compatible with onnxruntime, tract-onnx, and other ONNX runtimes)

Embedding Pipeline

Input Text → Tokenizer → ONNX Model → Mean Pooling → L2 Normalization → Embedding

The ONNX model contains only the transformer backbone. Mean pooling and L2 normalization must be implemented in application code (see usage examples below).

Model Inputs

The ONNX model requires 3 inputs:

input_ids: Token IDs from tokenizer
attention_mask: 1 for real tokens, 0 for padding
token_type_ids: All zeros for single-sentence embeddings

ONNX Verification

The ONNX export produces near bit-for-bit identical embeddings to the native sentence-transformers model (0.000001 max difference across all test sentences).

Intended Use

This model is designed for:

Duplicate detection in AI security vulnerability reports (jailbreak/prompt injection attacks)
Semantic similarity comparison of attack prompts that may use different surface-level techniques but target the same underlying vulnerability
Embedding generation for LSH-based similarity search in vulnerability management systems
Edge/server deployment via ONNX runtime without requiring PyTorch

The model is trained to recognize semantic equivalence between attack prompts even when they use different jailbreak tactics (e.g., role-playing, encoding, academic framing) to elicit the same harmful behavior.

Usage

sentence-transformers (with ONNX backend)

from sentence_transformers import SentenceTransformer

# Load directly with ONNX backend
model = SentenceTransformer("0dinai/jailbreak-embeddings-large-onnx", backend="onnx")

sentences = ["First attack prompt", "Second attack prompt"]
embeddings = model.encode(sentences)
similarity = model.similarity(embeddings, embeddings)
print(similarity)

Python (onnxruntime)

import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

# Load model and tokenizer
session = ort.InferenceSession("onnx/model.onnx")
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=1, pad_token="<pad>")
tokenizer.enable_truncation(max_length=512)

# Tokenize
texts = ["First attack prompt", "Second attack prompt"]
encodings = tokenizer.encode_batch(texts)
input_ids = np.array([e.ids for e in encodings], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64)
token_type_ids = np.zeros_like(input_ids)

# Run ONNX inference
outputs = session.run(None, {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "token_type_ids": token_type_ids,
})
token_embeddings = outputs[0]  # [batch, seq_len, 1024]

# Mean pooling
mask = attention_mask[:, :, np.newaxis].astype(np.float32)
embeddings = (token_embeddings * mask).sum(axis=1) / mask.sum(axis=1)

# L2 normalization
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / norms

# Cosine similarity
similarity = np.dot(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.4f}")

Rust (tract-onnx)

use tract_onnx::prelude::*;
use tokenizers::Tokenizer;

// Load model and tokenizer
let model = tract_onnx::onnx()
    .model_for_path("onnx/model.onnx")?
    .into_optimized()?
    .into_runnable()?;
let tokenizer = Tokenizer::from_file("tokenizer.json")?;

// Tokenize
let encoding = tokenizer.encode("Attack prompt text", true)?;
let input_ids: Vec<i64> = encoding.get_ids().iter().map(|&x| x as i64).collect();
let attention_mask: Vec<i64> = encoding.get_attention_mask().iter().map(|&x| x as i64).collect();
let token_type_ids: Vec<i64> = vec![0i64; input_ids.len()];

// Run inference, then apply mean pooling + L2 normalization
// (see full Rust implementation at github.com/0din-ai)

Training Details

This model was trained using a two-stage fine-tuning approach:

Stage 1: WildJailbreak Pre-training

Pre-trained on public synthetic data to learn jailbreak semantics.

Dataset: Allen AI WildJailbreak — vanilla-adversarial prompt pairs
Pairs: 161,396 positive pairs (same intent, different formulation)
Split: 153,326 train / 4,034 val / 4,036 test (95% / 2.5% / 2.5%)
Loss: MultipleNegativesRankingLoss (in-batch negatives)
Batch size: 16 (per device) x 2 gradient accumulation steps = 32 effective
Learning rate: 1e-5
FP16: True
Purpose: Teach the model to see through jailbreak wrappers and match prompts by underlying intent

Stage 2: Threat Feed Fine-tuning

Fine-tuned on annotated pairs from the internal 0din threat feed.

Pairs: 9,598 annotated pairs (7,678 train / 958 val / 962 test)
Label Distribution: ~34% duplicates / ~66% non-duplicates
Annotation: Google Gemini 2.5 Pro (single-model annotation)
Source Similarity Threshold: Candidate pairs generated with Thor similarity >= 0.5
Loss: ContrastiveLoss (cosine distance, margin=0.5)
Purpose: Calibrate the model for real-world duplicate detection on production vulnerability data

Stage 2 Hyperparameters

Parameter	Value
Epochs	50 (early stopped)
Batch size	8 (per device) x 4 gradient accumulation = 32 effective
Learning rate	1e-5
LR scheduler	Linear
Warmup ratio	0.1
Weight decay	0.01
FP16	True
Early stopping patience	10
Eval steps	50
Seed	1

Evaluation Results

Duplicate Detection Performance

Evaluated on 55 human-labeled vulnerability pairs (10 duplicates, 45 non-duplicates) from a corpus of 3,749 vulnerabilities. Best F1 score at each model's optimal threshold:

Model	Best F1	Threshold	Precision	Recall
OpenAI text-embedding-3-large (baseline)	0.462	0.80	1.000	0.300
Finetuned V1 (WildJailbreak only, e5-small)	0.500	0.50	0.333	1.000
Finetuned V2 (WJB + threat feed v1, e5-small)	0.526	0.70	0.556	0.500
Finetuned V3 (WJB + threat feed v2, e5-small)	0.556	0.75	0.625	0.500
Finetuned V4 (WJB + threat feed 10k, e5-small)	0.600	0.70	0.600	0.600
Finetuned Base V1 (e5-base)	0.696	0.70	0.615	0.800
This model (Large V1)	0.737	0.80	0.778	0.700

Threshold Analysis (This Model)

Threshold	Precision	Recall	F1	TP	FP	FN	TN
0.50	0.250	0.900	0.391	9	27	1	18
0.55	0.310	0.900	0.462	9	20	1	25
0.60	0.346	0.900	0.500	9	17	1	28
0.65	0.391	0.900	0.545	9	14	1	31
0.70	0.500	0.800	0.615	8	8	2	37
0.75	0.615	0.800	0.696	8	5	2	40
0.80	0.778	0.700	0.737	7	2	3	43
0.85	1.000	0.400	0.571	4	0	6	45
0.90	1.000	0.200	0.333	2	0	8	45

Key Findings

+59.5% F1 improvement over the OpenAI text-embedding-3-large baseline (0.737 vs 0.462)
Best in series: Continues the scaling trend: e5-small (0.600) → e5-base (0.696) → e5-large (0.737).
Highest precision at optimal threshold: 0.778 precision with only 2 false positives, compared to 0.615 for e5-base at its optimal threshold.
Precision-recall tradeoff vs e5-base: Trades a small amount of recall (0.700 vs 0.800) for a significant precision gain (0.778 vs 0.615), resulting in a better-balanced F1.
Higher optimal threshold (0.80): The larger model produces more confident and well-separated similarity scores, allowing a higher decision threshold while maintaining strong performance.
Strong recall at lower thresholds: Maintains 0.900 recall across thresholds 0.50–0.65, indicating very few true duplicates are missed at permissive thresholds.

Note: The evaluation dataset is small (55 pairs, 10 positive). With only 10 true duplicates, each TP/FP change causes large metric swings. Results should be interpreted with caution.

Limitations

Small evaluation set: Only 55 human-labeled pairs (10 duplicates). Results should be taken as directional rather than definitive.
LLM annotation bias in training data: Stage 2 training data was annotated by a single LLM (Gemini 2.5 Pro), which may affect calibration.
Model size: ~560M parameters with 1024-dim embeddings. The ONNX model is ~2.1GB.
Domain-specific: Optimized for jailbreak/prompt injection duplicate detection. Performance on general semantic similarity tasks is not evaluated.

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
    title={Dimensionality Reduction by Learning an Invariant Mapping},
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}

WildJailbreak

@article{jiang2024wildteaming,
    title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models},
    author={Jiang, Liwei and Bhatt, Kavel and Phute, Seungju and Hwang, Jaehun and Liang, Dongwei and Sap, Maarten and Hajishirzi, Hannaneh and Choi, Yejin},
    journal={arXiv preprint arXiv:2406.18510},
    year={2024}
}

Downloads last month: 19

Papers for 0dinai/jailbreak-embeddings-large-onnx-int8

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Paper • 2406.18510 • Published Jun 26, 2024 • 10

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 12