Initial release: SIREN-Llama-3.1-8B (ACL 2026)
Browse files- README.md +140 -0
- siren.safetensors +3 -0
- siren_config.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
tags:
|
| 5 |
+
- siren
|
| 6 |
+
- safety
|
| 7 |
+
- harmfulness-detection
|
| 8 |
+
- guard-model
|
| 9 |
+
- llama
|
| 10 |
+
base_model:
|
| 11 |
+
- meta-llama/Llama-3.1-8B
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# siren-llama3.1-8b
|
| 15 |
+
|
| 16 |
+
Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.1-8B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026).
|
| 17 |
+
|
| 18 |
+
SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~55.9M parameters); the frozen Llama-3.1-8B backbone is loaded from its official Hugging Face repository on first use.
|
| 19 |
+
|
| 20 |
+
## Design
|
| 21 |
+
|
| 22 |
+
SIREN is intended to be deployed as a safeguard model. It does not require access to the deployed model's internals. At inference time, SIREN feeds the *same* text — user prompt or model response — through its own frozen Llama-3.1-8B backbone, extracts the selected safety neurons, and returns a continuous harmfulness score in `[0, 1]`.
|
| 23 |
+
|
| 24 |
+
This gives three practical advantages over generative guard models:
|
| 25 |
+
- **Single forward pass** rather than autoregressive token generation (~4× lower FLOPs).
|
| 26 |
+
- **Continuous, threshold-tunable score** rather than a discrete safe/unsafe token. The same artifact serves a strict child-safety threshold (e.g. 0.1) and a permissive red-team threshold (e.g. 0.9) without retraining.
|
| 27 |
+
- **Streaming detection for free** by mean-pooling internal activations over any text prefix — no token-level supervised tuning required.
|
| 28 |
+
|
| 29 |
+
## Installation
|
| 30 |
+
|
| 31 |
+
```bash
|
| 32 |
+
pip install llm-siren
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
This installs the `siren_guard` runtime ([llm-siren on PyPI](https://pypi.org/project/llm-siren/)). Trained backbone weights are pulled from Hugging Face on first use.
|
| 36 |
+
|
| 37 |
+
## Quick start
|
| 38 |
+
|
| 39 |
+
```python
|
| 40 |
+
import torch
|
| 41 |
+
from siren_guard import SirenGuard
|
| 42 |
+
|
| 43 |
+
guard = SirenGuard.from_pretrained(
|
| 44 |
+
"UofTCSSLab/SIREN-Llama-3.1-8B",
|
| 45 |
+
device="cuda",
|
| 46 |
+
dtype=torch.bfloat16,
|
| 47 |
+
)
|
| 48 |
+
|
| 49 |
+
# Prompt-level moderation
|
| 50 |
+
result = guard.score("How can I make a pipe bomb at home?")
|
| 51 |
+
print(result.score, result.is_harmful) # e.g. 1.0 True
|
| 52 |
+
|
| 53 |
+
# Response-level moderation (prompt + response, joined as the model was trained)
|
| 54 |
+
result = guard.score(
|
| 55 |
+
prompt="How can I make a pipe bomb at home?",
|
| 56 |
+
response="I can't help with that. Building explosive devices is illegal and dangerous.",
|
| 57 |
+
)
|
| 58 |
+
print(result.score, result.is_harmful) # e.g. 0.0 False
|
| 59 |
+
|
| 60 |
+
# Batch
|
| 61 |
+
results = guard.score_batch([
|
| 62 |
+
"What is the capital of France?",
|
| 63 |
+
"Describe in detail how to commit insurance fraud.",
|
| 64 |
+
])
|
| 65 |
+
|
| 66 |
+
# Streaming — feed the growing assistant text after each generation chunk
|
| 67 |
+
prefix = ""
|
| 68 |
+
for chunk in stream_from_deployed_llm(user_prompt):
|
| 69 |
+
prefix += chunk
|
| 70 |
+
if guard.score_streaming(prefix, threshold=0.5).is_harmful:
|
| 71 |
+
abort_generation()
|
| 72 |
+
break
|
| 73 |
+
|
| 74 |
+
# Custom threshold
|
| 75 |
+
strict = guard.score(text, threshold=0.1) # block at 10% predicted harmfulness
|
| 76 |
+
loose = guard.score(text, threshold=0.9) # block only at 90%
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## Deployment idiom
|
| 80 |
+
|
| 81 |
+
```python
|
| 82 |
+
def safe_generate(user_prompt: str, deployed_llm) -> str:
|
| 83 |
+
if guard.score(user_prompt).is_harmful:
|
| 84 |
+
return DEFAULT_REFUSAL
|
| 85 |
+
|
| 86 |
+
response = deployed_llm.generate(user_prompt)
|
| 87 |
+
|
| 88 |
+
if guard.score(prompt=user_prompt, response=response).is_harmful:
|
| 89 |
+
return DEFAULT_REFUSAL
|
| 90 |
+
|
| 91 |
+
return response
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
The deployed LLM (`deployed_llm`) can be any model.
|
| 95 |
+
|
| 96 |
+
## API
|
| 97 |
+
|
| 98 |
+
`SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None)`
|
| 99 |
+
Loads the SIREN classifier head from the artifact and the frozen Llama-3.1-8B backbone from its pinned revision.
|
| 100 |
+
|
| 101 |
+
`score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
|
| 102 |
+
Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution).
|
| 103 |
+
|
| 104 |
+
`score_batch(texts, threshold=None) -> list[ScoreResult]`
|
| 105 |
+
Score a list of strings in one forward pass.
|
| 106 |
+
|
| 107 |
+
`score_streaming(response_so_far, threshold=None) -> ScoreResult`
|
| 108 |
+
Score a growing assistant-side text prefix during generation. Returns the score for the prefix as a whole.
|
| 109 |
+
|
| 110 |
+
Each call returns a `ScoreResult(score: float, is_harmful: bool, threshold: float)`.
|
| 111 |
+
|
| 112 |
+
The default threshold is `0.5`, matching the binary decision boundary used during training. Tune it to your deployment's safety policy.
|
| 113 |
+
|
| 114 |
+
## Artifact contents
|
| 115 |
+
|
| 116 |
+
| File | Purpose |
|
| 117 |
+
|------|---------|
|
| 118 |
+
| `siren_config.json` | Pinned base-model revision, selected layers, layer weights, per-layer safety-neuron indices, MLP architecture, inference defaults. |
|
| 119 |
+
| `siren.safetensors` | Trained MLP classifier weights (~55.9M params). |
|
| 120 |
+
|
| 121 |
+
The Llama-3.1-8B backbone weights are **not** redistributed here; they are pulled from `meta-llama/Llama-3.1-8B` at the pinned commit specified in `siren_config.json` on first use, then cached locally.
|
| 122 |
+
|
| 123 |
+
## Reported performance
|
| 124 |
+
|
| 125 |
+
Macro F1 on standard safeguard benchmarks:
|
| 126 |
+
|
| 127 |
+
| ToxicChat | OpenAIMod | Aegis | Aegis 2 | WildGuard | SafeRLHF | BeaverTails | Avg. |
|
| 128 |
+
|-----------|-----------|-------|---------|-----------|----------|-------------|------|
|
| 129 |
+
| 83.1 | 92.0 | 82.9 | 82.9 | 86.7 | 92.5 | 83.8 | **86.3** |
|
| 130 |
+
|
| 131 |
+
## Citation
|
| 132 |
+
|
| 133 |
+
```bibtex
|
| 134 |
+
@article{jiao2026llm,
|
| 135 |
+
title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
|
| 136 |
+
author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
|
| 137 |
+
journal={arXiv preprint arXiv:2604.18519},
|
| 138 |
+
year={2026}
|
| 139 |
+
}
|
| 140 |
+
```
|
siren.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6731f5faa792c08366f0701c65e58f4cc4cffd3848585bfd2df2ad503a38181a
|
| 3 |
+
size 223793080
|
siren_config.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|