File size: 5,659 Bytes
e644e22 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | ---
library_name: transformers
license: apache-2.0
tags:
- siren
- safety
- harmfulness-detection
- guard-model
- llama
base_model:
- meta-llama/Llama-3.1-8B
---
# siren-llama3.1-8b
Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.1-8B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026).
SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~55.9M parameters); the frozen Llama-3.1-8B backbone is loaded from its official Hugging Face repository on first use.
## Design
SIREN is intended to be deployed as a safeguard model. It does not require access to the deployed model's internals. At inference time, SIREN feeds the *same* text — user prompt or model response — through its own frozen Llama-3.1-8B backbone, extracts the selected safety neurons, and returns a continuous harmfulness score in `[0, 1]`.
This gives three practical advantages over generative guard models:
- **Single forward pass** rather than autoregressive token generation (~4× lower FLOPs).
- **Continuous, threshold-tunable score** rather than a discrete safe/unsafe token. The same artifact serves a strict child-safety threshold (e.g. 0.1) and a permissive red-team threshold (e.g. 0.9) without retraining.
- **Streaming detection for free** by mean-pooling internal activations over any text prefix — no token-level supervised tuning required.
## Installation
```bash
pip install llm-siren
```
This installs the `siren_guard` runtime ([llm-siren on PyPI](https://pypi.org/project/llm-siren/)). Trained backbone weights are pulled from Hugging Face on first use.
## Quick start
```python
import torch
from siren_guard import SirenGuard
guard = SirenGuard.from_pretrained(
"UofTCSSLab/SIREN-Llama-3.1-8B",
device="cuda",
dtype=torch.bfloat16,
)
# Prompt-level moderation
result = guard.score("How can I make a pipe bomb at home?")
print(result.score, result.is_harmful) # e.g. 1.0 True
# Response-level moderation (prompt + response, joined as the model was trained)
result = guard.score(
prompt="How can I make a pipe bomb at home?",
response="I can't help with that. Building explosive devices is illegal and dangerous.",
)
print(result.score, result.is_harmful) # e.g. 0.0 False
# Batch
results = guard.score_batch([
"What is the capital of France?",
"Describe in detail how to commit insurance fraud.",
])
# Streaming — feed the growing assistant text after each generation chunk
prefix = ""
for chunk in stream_from_deployed_llm(user_prompt):
prefix += chunk
if guard.score_streaming(prefix, threshold=0.5).is_harmful:
abort_generation()
break
# Custom threshold
strict = guard.score(text, threshold=0.1) # block at 10% predicted harmfulness
loose = guard.score(text, threshold=0.9) # block only at 90%
```
## Deployment idiom
```python
def safe_generate(user_prompt: str, deployed_llm) -> str:
if guard.score(user_prompt).is_harmful:
return DEFAULT_REFUSAL
response = deployed_llm.generate(user_prompt)
if guard.score(prompt=user_prompt, response=response).is_harmful:
return DEFAULT_REFUSAL
return response
```
The deployed LLM (`deployed_llm`) can be any model.
## API
`SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None)`
Loads the SIREN classifier head from the artifact and the frozen Llama-3.1-8B backbone from its pinned revision.
`score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution).
`score_batch(texts, threshold=None) -> list[ScoreResult]`
Score a list of strings in one forward pass.
`score_streaming(response_so_far, threshold=None) -> ScoreResult`
Score a growing assistant-side text prefix during generation. Returns the score for the prefix as a whole.
Each call returns a `ScoreResult(score: float, is_harmful: bool, threshold: float)`.
The default threshold is `0.5`, matching the binary decision boundary used during training. Tune it to your deployment's safety policy.
## Artifact contents
| File | Purpose |
|------|---------|
| `siren_config.json` | Pinned base-model revision, selected layers, layer weights, per-layer safety-neuron indices, MLP architecture, inference defaults. |
| `siren.safetensors` | Trained MLP classifier weights (~55.9M params). |
The Llama-3.1-8B backbone weights are **not** redistributed here; they are pulled from `meta-llama/Llama-3.1-8B` at the pinned commit specified in `siren_config.json` on first use, then cached locally.
## Reported performance
Macro F1 on standard safeguard benchmarks:
| ToxicChat | OpenAIMod | Aegis | Aegis 2 | WildGuard | SafeRLHF | BeaverTails | Avg. |
|-----------|-----------|-------|---------|-----------|----------|-------------|------|
| 83.1 | 92.0 | 82.9 | 82.9 | 86.7 | 92.5 | 83.8 | **86.3** |
## Citation
```bibtex
@article{jiao2026llm,
title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
journal={arXiv preprint arXiv:2604.18519},
year={2026}
}
```
|