difanjiao commited on
Commit
e644e22
·
verified ·
1 Parent(s): 3809aae

Initial release: SIREN-Llama-3.1-8B (ACL 2026)

Browse files
Files changed (3) hide show
  1. README.md +140 -0
  2. siren.safetensors +3 -0
  3. siren_config.json +0 -0
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ tags:
5
+ - siren
6
+ - safety
7
+ - harmfulness-detection
8
+ - guard-model
9
+ - llama
10
+ base_model:
11
+ - meta-llama/Llama-3.1-8B
12
+ ---
13
+
14
+ # siren-llama3.1-8b
15
+
16
+ Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.1-8B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026).
17
+
18
+ SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~55.9M parameters); the frozen Llama-3.1-8B backbone is loaded from its official Hugging Face repository on first use.
19
+
20
+ ## Design
21
+
22
+ SIREN is intended to be deployed as a safeguard model. It does not require access to the deployed model's internals. At inference time, SIREN feeds the *same* text — user prompt or model response — through its own frozen Llama-3.1-8B backbone, extracts the selected safety neurons, and returns a continuous harmfulness score in `[0, 1]`.
23
+
24
+ This gives three practical advantages over generative guard models:
25
+ - **Single forward pass** rather than autoregressive token generation (~4× lower FLOPs).
26
+ - **Continuous, threshold-tunable score** rather than a discrete safe/unsafe token. The same artifact serves a strict child-safety threshold (e.g. 0.1) and a permissive red-team threshold (e.g. 0.9) without retraining.
27
+ - **Streaming detection for free** by mean-pooling internal activations over any text prefix — no token-level supervised tuning required.
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ pip install llm-siren
33
+ ```
34
+
35
+ This installs the `siren_guard` runtime ([llm-siren on PyPI](https://pypi.org/project/llm-siren/)). Trained backbone weights are pulled from Hugging Face on first use.
36
+
37
+ ## Quick start
38
+
39
+ ```python
40
+ import torch
41
+ from siren_guard import SirenGuard
42
+
43
+ guard = SirenGuard.from_pretrained(
44
+ "UofTCSSLab/SIREN-Llama-3.1-8B",
45
+ device="cuda",
46
+ dtype=torch.bfloat16,
47
+ )
48
+
49
+ # Prompt-level moderation
50
+ result = guard.score("How can I make a pipe bomb at home?")
51
+ print(result.score, result.is_harmful) # e.g. 1.0 True
52
+
53
+ # Response-level moderation (prompt + response, joined as the model was trained)
54
+ result = guard.score(
55
+ prompt="How can I make a pipe bomb at home?",
56
+ response="I can't help with that. Building explosive devices is illegal and dangerous.",
57
+ )
58
+ print(result.score, result.is_harmful) # e.g. 0.0 False
59
+
60
+ # Batch
61
+ results = guard.score_batch([
62
+ "What is the capital of France?",
63
+ "Describe in detail how to commit insurance fraud.",
64
+ ])
65
+
66
+ # Streaming — feed the growing assistant text after each generation chunk
67
+ prefix = ""
68
+ for chunk in stream_from_deployed_llm(user_prompt):
69
+ prefix += chunk
70
+ if guard.score_streaming(prefix, threshold=0.5).is_harmful:
71
+ abort_generation()
72
+ break
73
+
74
+ # Custom threshold
75
+ strict = guard.score(text, threshold=0.1) # block at 10% predicted harmfulness
76
+ loose = guard.score(text, threshold=0.9) # block only at 90%
77
+ ```
78
+
79
+ ## Deployment idiom
80
+
81
+ ```python
82
+ def safe_generate(user_prompt: str, deployed_llm) -> str:
83
+ if guard.score(user_prompt).is_harmful:
84
+ return DEFAULT_REFUSAL
85
+
86
+ response = deployed_llm.generate(user_prompt)
87
+
88
+ if guard.score(prompt=user_prompt, response=response).is_harmful:
89
+ return DEFAULT_REFUSAL
90
+
91
+ return response
92
+ ```
93
+
94
+ The deployed LLM (`deployed_llm`) can be any model.
95
+
96
+ ## API
97
+
98
+ `SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None)`
99
+ Loads the SIREN classifier head from the artifact and the frozen Llama-3.1-8B backbone from its pinned revision.
100
+
101
+ `score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
102
+ Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution).
103
+
104
+ `score_batch(texts, threshold=None) -> list[ScoreResult]`
105
+ Score a list of strings in one forward pass.
106
+
107
+ `score_streaming(response_so_far, threshold=None) -> ScoreResult`
108
+ Score a growing assistant-side text prefix during generation. Returns the score for the prefix as a whole.
109
+
110
+ Each call returns a `ScoreResult(score: float, is_harmful: bool, threshold: float)`.
111
+
112
+ The default threshold is `0.5`, matching the binary decision boundary used during training. Tune it to your deployment's safety policy.
113
+
114
+ ## Artifact contents
115
+
116
+ | File | Purpose |
117
+ |------|---------|
118
+ | `siren_config.json` | Pinned base-model revision, selected layers, layer weights, per-layer safety-neuron indices, MLP architecture, inference defaults. |
119
+ | `siren.safetensors` | Trained MLP classifier weights (~55.9M params). |
120
+
121
+ The Llama-3.1-8B backbone weights are **not** redistributed here; they are pulled from `meta-llama/Llama-3.1-8B` at the pinned commit specified in `siren_config.json` on first use, then cached locally.
122
+
123
+ ## Reported performance
124
+
125
+ Macro F1 on standard safeguard benchmarks:
126
+
127
+ | ToxicChat | OpenAIMod | Aegis | Aegis 2 | WildGuard | SafeRLHF | BeaverTails | Avg. |
128
+ |-----------|-----------|-------|---------|-----------|----------|-------------|------|
129
+ | 83.1 | 92.0 | 82.9 | 82.9 | 86.7 | 92.5 | 83.8 | **86.3** |
130
+
131
+ ## Citation
132
+
133
+ ```bibtex
134
+ @article{jiao2026llm,
135
+ title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
136
+ author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
137
+ journal={arXiv preprint arXiv:2604.18519},
138
+ year={2026}
139
+ }
140
+ ```
siren.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6731f5faa792c08366f0701c65e58f4cc4cffd3848585bfd2df2ad503a38181a
3
+ size 223793080
siren_config.json ADDED
The diff for this file is too large to render. See raw diff