Surpem
/

Sarden1

+---
+license: apache-2.0
+language:
+- en
+- de
+- fr
+- it
+- es
+- nl
+- pt
+- pl
+- cs
+- da
+- fi
+- sv
+pipeline_tag: token-classification
+tags:
+- pii
+- ner
+- privacy
+- token-classification
+- transformers
+- pytorch
+- safetensors
+---
+# Sarden1-300M: Multilingual PII Detection & Redaction Model
+## Model Description
+Sarden1-300M is a high-performance token classification model built from scratch for
+personally identifiable information (PII) detection and redaction. It identifies and
+labels sensitive entity spans in text across 15 locales, making it suitable for
+GDPR/HIPAA compliance pipelines, log scrubbing, and document redaction at production scale.
+* **Developed by:** Surpem
+* **Model Type:** Token Classifier (BIO tagging)
+* **Architecture:** Custom Decoder-style Transformer
+* **Base Model:** Trained from scratch — no pretrained base
+* **License:** Apache 2.0
+* **Languages:** en, de, fr, it, es, nl, pt, pl, cs, da, fi, sv (+ en_GB, en_CA, en_AU)
+## Architecture
+| Component | Detail |
+| :--- | :--- |
+| Parameters | ~300M |
+| Layers | 18 transformer layers |
+| Hidden size | 1024 |
+| Attention | Grouped Query Attention (16Q / 4KV heads) |
+| FFN | SwiGLU (2730 intermediate) |
+| Positional encoding | RoPE (θ = 500,000) |
+| Normalisation | RMSNorm (no bias) |
+| Tokeniser | GPT-2 BPE (vocab 50,257) |
+| Precision | bfloat16 |
+## Entity Types
+Sarden1-300M detects 12 PII entity types using BIO span labelling:
+| Category | Entity Types |
+| :--- | :--- |
+| Identity | `PERSON`, `USERNAME`, `DATE` |
+| Contact | `EMAIL`, `PHONE`, `ADDRESS` |
+| Financial | `CREDITCARD`, `SSN` |
+| Documents | `PASSPORT`, `DRIVERSLICENSE` |
+| Technical | `IP` |
+| Organisational | `ORG` |
+## Get Started
+```python
+import json, torch
+from safetensors.torch import load_file
+from transformers import AutoTokenizer
+# Load weights and config
+sd       = load_file("model.safetensors")
+cfg      = json.load(open("config.json"))
+id2label = {int(k): v for k, v in cfg["id2label"].items()}
+# Load tokeniser
+tok = AutoTokenizer.from_pretrained(".")
+# (Rebuild model from architecture, then:)
+model.load_state_dict(sd)
+model.eval()
+# Inference
+text    = "Hi, I'm Jane Smith. Reach me at jane@example.com or 555-1234."
+enc     = tok(text, return_offsets_mapping=True, return_tensors="pt")
+with torch.no_grad():
+    logits = model(enc["input_ids"])["logits"]
+preds   = logits.argmax(-1)[0].tolist()
+offsets = enc["offset_mapping"][0].tolist()
+for pred, (cs, ce) in zip(preds, offsets):
+    if cs != ce and id2label.get(pred, "O") != "O":
+        print(f"{id2label[pred]:<20} {repr(text[cs:ce])}")
+```
+Example output:
+```
+PERSON               'Jane Smith'
+EMAIL                'jane@example.com'
+PHONE                '555-1234'
+```
+## Citation
+```bibtex
+@misc{surpem2026sarden1,
+      title  = {Sarden1-300M: Multilingual PII Detection \& Redaction Model},
+      author = {Surpem},
+      year   = {2026},
+      url    = {https://huggingface.co/surpem/sarden1-300m},
+}
+```