File size: 5,659 Bytes
e644e22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
library_name: transformers
license: apache-2.0
tags:
- siren
- safety
- harmfulness-detection
- guard-model
- llama
base_model:
- meta-llama/Llama-3.1-8B
---

# siren-llama3.1-8b

Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.1-8B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026).

SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~55.9M parameters); the frozen Llama-3.1-8B backbone is loaded from its official Hugging Face repository on first use.

## Design

SIREN is intended to be deployed as a safeguard model. It does not require access to the deployed model's internals. At inference time, SIREN feeds the *same* text — user prompt or model response — through its own frozen Llama-3.1-8B backbone, extracts the selected safety neurons, and returns a continuous harmfulness score in `[0, 1]`.

This gives three practical advantages over generative guard models:
- **Single forward pass** rather than autoregressive token generation (~4× lower FLOPs).
- **Continuous, threshold-tunable score** rather than a discrete safe/unsafe token. The same artifact serves a strict child-safety threshold (e.g. 0.1) and a permissive red-team threshold (e.g. 0.9) without retraining.
- **Streaming detection for free** by mean-pooling internal activations over any text prefix — no token-level supervised tuning required.

## Installation

```bash
pip install llm-siren
```

This installs the `siren_guard` runtime ([llm-siren on PyPI](https://pypi.org/project/llm-siren/)). Trained backbone weights are pulled from Hugging Face on first use.

## Quick start

```python
import torch
from siren_guard import SirenGuard

guard = SirenGuard.from_pretrained(
    "UofTCSSLab/SIREN-Llama-3.1-8B",
    device="cuda",
    dtype=torch.bfloat16,
)

# Prompt-level moderation
result = guard.score("How can I make a pipe bomb at home?")
print(result.score, result.is_harmful)  # e.g. 1.0  True

# Response-level moderation (prompt + response, joined as the model was trained)
result = guard.score(
    prompt="How can I make a pipe bomb at home?",
    response="I can't help with that. Building explosive devices is illegal and dangerous.",
)
print(result.score, result.is_harmful)  # e.g. 0.0  False

# Batch
results = guard.score_batch([
    "What is the capital of France?",
    "Describe in detail how to commit insurance fraud.",
])

# Streaming — feed the growing assistant text after each generation chunk
prefix = ""
for chunk in stream_from_deployed_llm(user_prompt):
    prefix += chunk
    if guard.score_streaming(prefix, threshold=0.5).is_harmful:
        abort_generation()
        break

# Custom threshold
strict = guard.score(text, threshold=0.1)   # block at 10% predicted harmfulness
loose  = guard.score(text, threshold=0.9)   # block only at 90%
```

## Deployment idiom

```python
def safe_generate(user_prompt: str, deployed_llm) -> str:
    if guard.score(user_prompt).is_harmful:
        return DEFAULT_REFUSAL

    response = deployed_llm.generate(user_prompt)

    if guard.score(prompt=user_prompt, response=response).is_harmful:
        return DEFAULT_REFUSAL

    return response
```

The deployed LLM (`deployed_llm`) can be any model.

## API

`SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None)`
Loads the SIREN classifier head from the artifact and the frozen Llama-3.1-8B backbone from its pinned revision.

`score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution).

`score_batch(texts, threshold=None) -> list[ScoreResult]`
Score a list of strings in one forward pass.

`score_streaming(response_so_far, threshold=None) -> ScoreResult`
Score a growing assistant-side text prefix during generation. Returns the score for the prefix as a whole.

Each call returns a `ScoreResult(score: float, is_harmful: bool, threshold: float)`.

The default threshold is `0.5`, matching the binary decision boundary used during training. Tune it to your deployment's safety policy.

## Artifact contents

| File | Purpose |
|------|---------|
| `siren_config.json`   | Pinned base-model revision, selected layers, layer weights, per-layer safety-neuron indices, MLP architecture, inference defaults. |
| `siren.safetensors`   | Trained MLP classifier weights (~55.9M params). |

The Llama-3.1-8B backbone weights are **not** redistributed here; they are pulled from `meta-llama/Llama-3.1-8B` at the pinned commit specified in `siren_config.json` on first use, then cached locally.

## Reported performance

Macro F1 on standard safeguard benchmarks:

| ToxicChat | OpenAIMod | Aegis | Aegis 2 | WildGuard | SafeRLHF | BeaverTails | Avg. |
|-----------|-----------|-------|---------|-----------|----------|-------------|------|
| 83.1 | 92.0 | 82.9 | 82.9 | 86.7 | 92.5 | 83.8 | **86.3** |

## Citation

```bibtex
@article{jiao2026llm,
  title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
  author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
  journal={arXiv preprint arXiv:2604.18519},
  year={2026}
}
```