UofTCSSLab
/

SIREN-Llama-3.1-8B

@@ -1,19 +1,23 @@
 ---
 library_name: transformers
 license: apache-2.0
 tags:
 - siren
 - safety
 - harmfulness-detection
 - guard-model
 - llama
-base_model:
-- meta-llama/Llama-3.1-8B
 ---
 # siren-llama3.1-8b
-Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.1-8B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026).
 SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~55.9M parameters); the frozen Llama-3.1-8B backbone is loaded from its official Hugging Face repository on first use.
@@ -99,7 +103,8 @@ The deployed LLM (`deployed_llm`) can be any model.
 Loads the SIREN classifier head from the artifact and the frozen Llama-3.1-8B backbone from its pinned revision.
 `score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
-Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution).
 `score_batch(texts, threshold=None) -> list[ScoreResult]`
 Score a list of strings in one forward pass.
@@ -137,4 +142,4 @@ Macro F1 on standard safeguard benchmarks:
   journal={arXiv preprint arXiv:2604.18519},
   year={2026}
 }
-```

 ---
+base_model:
+- meta-llama/Llama-3.1-8B
 library_name: transformers
 license: apache-2.0
+pipeline_tag: text-classification
 tags:
 - siren
 - safety
 - harmfulness-detection
 - guard-model
 - llama
 ---
 # siren-llama3.1-8b
+Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.1-8B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://huggingface.co/papers/2604.18519), ACL 2026).
+- **Paper:** [https://huggingface.co/papers/2604.18519](https://huggingface.co/papers/2604.18519)
+- **Code:** [https://github.com/CSSLab/SIREN](https://github.com/CSSLab/SIREN)
 SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~55.9M parameters); the frozen Llama-3.1-8B backbone is loaded from its official Hugging Face repository on first use.
 Loads the SIREN classifier head from the artifact and the frozen Llama-3.1-8B backbone from its pinned revision.
 `score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
+Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"
+"`, matching the SIREN training distribution).
 `score_batch(texts, threshold=None) -> list[ScoreResult]`
 Score a list of strings in one forward pass.
   journal={arXiv preprint arXiv:2604.18519},
   year={2026}
 }
+```