Improve model card: add pipeline tag and paper/code links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +10 -5
README.md CHANGED
@@ -1,19 +1,23 @@
1
  ---
 
 
2
  library_name: transformers
3
  license: apache-2.0
 
4
  tags:
5
  - siren
6
  - safety
7
  - harmfulness-detection
8
  - guard-model
9
  - llama
10
- base_model:
11
- - meta-llama/Llama-3.1-8B
12
  ---
13
 
14
  # siren-llama3.1-8b
15
 
16
- Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.1-8B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026).
 
 
 
17
 
18
  SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~55.9M parameters); the frozen Llama-3.1-8B backbone is loaded from its official Hugging Face repository on first use.
19
 
@@ -99,7 +103,8 @@ The deployed LLM (`deployed_llm`) can be any model.
99
  Loads the SIREN classifier head from the artifact and the frozen Llama-3.1-8B backbone from its pinned revision.
100
 
101
  `score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
102
- Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution).
 
103
 
104
  `score_batch(texts, threshold=None) -> list[ScoreResult]`
105
  Score a list of strings in one forward pass.
@@ -137,4 +142,4 @@ Macro F1 on standard safeguard benchmarks:
137
  journal={arXiv preprint arXiv:2604.18519},
138
  year={2026}
139
  }
140
- ```
 
1
  ---
2
+ base_model:
3
+ - meta-llama/Llama-3.1-8B
4
  library_name: transformers
5
  license: apache-2.0
6
+ pipeline_tag: text-classification
7
  tags:
8
  - siren
9
  - safety
10
  - harmfulness-detection
11
  - guard-model
12
  - llama
 
 
13
  ---
14
 
15
  # siren-llama3.1-8b
16
 
17
+ Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.1-8B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://huggingface.co/papers/2604.18519), ACL 2026).
18
+
19
+ - **Paper:** [https://huggingface.co/papers/2604.18519](https://huggingface.co/papers/2604.18519)
20
+ - **Code:** [https://github.com/CSSLab/SIREN](https://github.com/CSSLab/SIREN)
21
 
22
  SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~55.9M parameters); the frozen Llama-3.1-8B backbone is loaded from its official Hugging Face repository on first use.
23
 
 
103
  Loads the SIREN classifier head from the artifact and the frozen Llama-3.1-8B backbone from its pinned revision.
104
 
105
  `score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
106
+ Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"
107
+ "`, matching the SIREN training distribution).
108
 
109
  `score_batch(texts, threshold=None) -> list[ScoreResult]`
110
  Score a list of strings in one forward pass.
 
142
  journal={arXiv preprint arXiv:2604.18519},
143
  year={2026}
144
  }
145
+ ```