Instructions to use Surpem/Sarden1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Surpem/Sarden1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="Surpem/Sarden1")# Load model directly from transformers import Sarden1 model = Sarden1.from_pretrained("Surpem/Sarden1", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 2,949 Bytes
ad0a5d1 0bee8f7 ad0a5d1 0bee8f7 ad0a5d1 0bee8f7 ad0a5d1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | ---
license: apache-2.0
language:
- en
- de
- fr
- it
- es
- nl
- pt
- pl
- cs
- da
- fi
- sv
pipeline_tag: token-classification
tags:
- pii
- ner
- privacy
- token-classification
- transformers
- pytorch
- safetensors
---
# Sarden1: Multilingual PII Detection & Redaction Model
## Model Description
Sarden1 is a high-performance token classification model built from scratch for
personally identifiable information (PII) detection and redaction. It identifies and
labels sensitive entity spans in text across 15 locales, making it suitable for
GDPR/HIPAA compliance pipelines, log scrubbing, and document redaction at production scale.
* **Developed by:** Surpem
* **Model Type:** Token Classifier (BIO tagging)
* **Architecture:** Custom Decoder-style Transformer
* **Base Model:** Trained from scratch — no pretrained base
* **License:** Apache 2.0
* **Languages:** en, de, fr, it, es, nl, pt, pl, cs, da, fi, sv (+ en_GB, en_CA, en_AU)
## Architecture
| Component | Detail |
| :--- | :--- |
| Parameters | ~300M |
| Layers | 18 transformer layers |
| Hidden size | 1024 |
| Attention | Grouped Query Attention (16Q / 4KV heads) |
| FFN | SwiGLU (2730 intermediate) |
| Positional encoding | RoPE (θ = 500,000) |
| Normalisation | RMSNorm (no bias) |
| Tokeniser | GPT-2 BPE (vocab 50,257) |
| Precision | bfloat16 |
## Entity Types
Sarden1 detects 12 PII entity types using BIO span labelling:
| Category | Entity Types |
| :--- | :--- |
| Identity | `PERSON`, `USERNAME`, `DATE` |
| Contact | `EMAIL`, `PHONE`, `ADDRESS` |
| Financial | `CREDITCARD`, `SSN` |
| Documents | `PASSPORT`, `DRIVERSLICENSE` |
| Technical | `IP` |
| Organisational | `ORG` |
## Get Started
```python
import json, torch
from safetensors.torch import load_file
from transformers import AutoTokenizer
# Load weights and config
sd = load_file("model.safetensors")
cfg = json.load(open("config.json"))
id2label = {int(k): v for k, v in cfg["id2label"].items()}
# Load tokeniser
tok = AutoTokenizer.from_pretrained(".")
# (Rebuild model from architecture, then:)
model.load_state_dict(sd)
model.eval()
# Inference
text = "Hi, I'm Jane Smith. Reach me at jane@example.com or 555-1234."
enc = tok(text, return_offsets_mapping=True, return_tensors="pt")
with torch.no_grad():
logits = model(enc["input_ids"])["logits"]
preds = logits.argmax(-1)[0].tolist()
offsets = enc["offset_mapping"][0].tolist()
for pred, (cs, ce) in zip(preds, offsets):
if cs != ce and id2label.get(pred, "O") != "O":
print(f"{id2label[pred]:<20} {repr(text[cs:ce])}")
```
Example output:
```
PERSON 'Jane Smith'
EMAIL 'jane@example.com'
PHONE '555-1234'
```
## Citation
```bibtex
@misc{surpem2026sarden1,
title = {Sarden1-300M: Multilingual PII Detection \& Redaction Model},
author = {Surpem},
year = {2026},
url = {https://huggingface.co/surpem/sarden1-300m},
}
``` |