SIREN-Llama-3.1-8B / README.md
difanjiao's picture
Initial release: SIREN-Llama-3.1-8B (ACL 2026)
e644e22 verified
---
library_name: transformers
license: apache-2.0
tags:
- siren
- safety
- harmfulness-detection
- guard-model
- llama
base_model:
- meta-llama/Llama-3.1-8B
---
# siren-llama3.1-8b
Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.1-8B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026).
SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~55.9M parameters); the frozen Llama-3.1-8B backbone is loaded from its official Hugging Face repository on first use.
## Design
SIREN is intended to be deployed as a safeguard model. It does not require access to the deployed model's internals. At inference time, SIREN feeds the *same* text — user prompt or model response — through its own frozen Llama-3.1-8B backbone, extracts the selected safety neurons, and returns a continuous harmfulness score in `[0, 1]`.
This gives three practical advantages over generative guard models:
- **Single forward pass** rather than autoregressive token generation (~4× lower FLOPs).
- **Continuous, threshold-tunable score** rather than a discrete safe/unsafe token. The same artifact serves a strict child-safety threshold (e.g. 0.1) and a permissive red-team threshold (e.g. 0.9) without retraining.
- **Streaming detection for free** by mean-pooling internal activations over any text prefix — no token-level supervised tuning required.
## Installation
```bash
pip install llm-siren
```
This installs the `siren_guard` runtime ([llm-siren on PyPI](https://pypi.org/project/llm-siren/)). Trained backbone weights are pulled from Hugging Face on first use.
## Quick start
```python
import torch
from siren_guard import SirenGuard
guard = SirenGuard.from_pretrained(
"UofTCSSLab/SIREN-Llama-3.1-8B",
device="cuda",
dtype=torch.bfloat16,
)
# Prompt-level moderation
result = guard.score("How can I make a pipe bomb at home?")
print(result.score, result.is_harmful) # e.g. 1.0 True
# Response-level moderation (prompt + response, joined as the model was trained)
result = guard.score(
prompt="How can I make a pipe bomb at home?",
response="I can't help with that. Building explosive devices is illegal and dangerous.",
)
print(result.score, result.is_harmful) # e.g. 0.0 False
# Batch
results = guard.score_batch([
"What is the capital of France?",
"Describe in detail how to commit insurance fraud.",
])
# Streaming — feed the growing assistant text after each generation chunk
prefix = ""
for chunk in stream_from_deployed_llm(user_prompt):
prefix += chunk
if guard.score_streaming(prefix, threshold=0.5).is_harmful:
abort_generation()
break
# Custom threshold
strict = guard.score(text, threshold=0.1) # block at 10% predicted harmfulness
loose = guard.score(text, threshold=0.9) # block only at 90%
```
## Deployment idiom
```python
def safe_generate(user_prompt: str, deployed_llm) -> str:
if guard.score(user_prompt).is_harmful:
return DEFAULT_REFUSAL
response = deployed_llm.generate(user_prompt)
if guard.score(prompt=user_prompt, response=response).is_harmful:
return DEFAULT_REFUSAL
return response
```
The deployed LLM (`deployed_llm`) can be any model.
## API
`SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None)`
Loads the SIREN classifier head from the artifact and the frozen Llama-3.1-8B backbone from its pinned revision.
`score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution).
`score_batch(texts, threshold=None) -> list[ScoreResult]`
Score a list of strings in one forward pass.
`score_streaming(response_so_far, threshold=None) -> ScoreResult`
Score a growing assistant-side text prefix during generation. Returns the score for the prefix as a whole.
Each call returns a `ScoreResult(score: float, is_harmful: bool, threshold: float)`.
The default threshold is `0.5`, matching the binary decision boundary used during training. Tune it to your deployment's safety policy.
## Artifact contents
| File | Purpose |
|------|---------|
| `siren_config.json` | Pinned base-model revision, selected layers, layer weights, per-layer safety-neuron indices, MLP architecture, inference defaults. |
| `siren.safetensors` | Trained MLP classifier weights (~55.9M params). |
The Llama-3.1-8B backbone weights are **not** redistributed here; they are pulled from `meta-llama/Llama-3.1-8B` at the pinned commit specified in `siren_config.json` on first use, then cached locally.
## Reported performance
Macro F1 on standard safeguard benchmarks:
| ToxicChat | OpenAIMod | Aegis | Aegis 2 | WildGuard | SafeRLHF | BeaverTails | Avg. |
|-----------|-----------|-------|---------|-----------|----------|-------------|------|
| 83.1 | 92.0 | 82.9 | 82.9 | 86.7 | 92.5 | 83.8 | **86.3** |
## Citation
```bibtex
@article{jiao2026llm,
title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
journal={arXiv preprint arXiv:2604.18519},
year={2026}
}
```