Instructions to use Surpem/Sarden1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Surpem/Sarden1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="Surpem/Sarden1")# Load model directly from transformers import Sarden1 model = Sarden1.from_pretrained("Surpem/Sarden1", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - de | |
| - fr | |
| - it | |
| - es | |
| - nl | |
| - pt | |
| - pl | |
| - cs | |
| - da | |
| - fi | |
| - sv | |
| pipeline_tag: token-classification | |
| tags: | |
| - pii | |
| - ner | |
| - privacy | |
| - token-classification | |
| - transformers | |
| - pytorch | |
| - safetensors | |
| # Sarden1: Multilingual PII Detection & Redaction Model | |
| ## Model Description | |
| Sarden1 is a high-performance token classification model built from scratch for | |
| personally identifiable information (PII) detection and redaction. It identifies and | |
| labels sensitive entity spans in text across 15 locales, making it suitable for | |
| GDPR/HIPAA compliance pipelines, log scrubbing, and document redaction at production scale. | |
| * **Developed by:** Surpem | |
| * **Model Type:** Token Classifier (BIO tagging) | |
| * **Architecture:** Custom Decoder-style Transformer | |
| * **Base Model:** Trained from scratch — no pretrained base | |
| * **License:** Apache 2.0 | |
| * **Languages:** en, de, fr, it, es, nl, pt, pl, cs, da, fi, sv (+ en_GB, en_CA, en_AU) | |
| ## Architecture | |
| | Component | Detail | | |
| | :--- | :--- | | |
| | Parameters | ~300M | | |
| | Layers | 18 transformer layers | | |
| | Hidden size | 1024 | | |
| | Attention | Grouped Query Attention (16Q / 4KV heads) | | |
| | FFN | SwiGLU (2730 intermediate) | | |
| | Positional encoding | RoPE (θ = 500,000) | | |
| | Normalisation | RMSNorm (no bias) | | |
| | Tokeniser | GPT-2 BPE (vocab 50,257) | | |
| | Precision | bfloat16 | | |
| ## Entity Types | |
| Sarden1 detects 12 PII entity types using BIO span labelling: | |
| | Category | Entity Types | | |
| | :--- | :--- | | |
| | Identity | `PERSON`, `USERNAME`, `DATE` | | |
| | Contact | `EMAIL`, `PHONE`, `ADDRESS` | | |
| | Financial | `CREDITCARD`, `SSN` | | |
| | Documents | `PASSPORT`, `DRIVERSLICENSE` | | |
| | Technical | `IP` | | |
| | Organisational | `ORG` | | |
| ## Get Started | |
| ```python | |
| import json, torch | |
| from safetensors.torch import load_file | |
| from transformers import AutoTokenizer | |
| # Load weights and config | |
| sd = load_file("model.safetensors") | |
| cfg = json.load(open("config.json")) | |
| id2label = {int(k): v for k, v in cfg["id2label"].items()} | |
| # Load tokeniser | |
| tok = AutoTokenizer.from_pretrained(".") | |
| # (Rebuild model from architecture, then:) | |
| model.load_state_dict(sd) | |
| model.eval() | |
| # Inference | |
| text = "Hi, I'm Jane Smith. Reach me at jane@example.com or 555-1234." | |
| enc = tok(text, return_offsets_mapping=True, return_tensors="pt") | |
| with torch.no_grad(): | |
| logits = model(enc["input_ids"])["logits"] | |
| preds = logits.argmax(-1)[0].tolist() | |
| offsets = enc["offset_mapping"][0].tolist() | |
| for pred, (cs, ce) in zip(preds, offsets): | |
| if cs != ce and id2label.get(pred, "O") != "O": | |
| print(f"{id2label[pred]:<20} {repr(text[cs:ce])}") | |
| ``` | |
| Example output: | |
| ``` | |
| PERSON 'Jane Smith' | |
| EMAIL 'jane@example.com' | |
| PHONE '555-1234' | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @misc{surpem2026sarden1, | |
| title = {Sarden1-300M: Multilingual PII Detection \& Redaction Model}, | |
| author = {Surpem}, | |
| year = {2026}, | |
| url = {https://huggingface.co/surpem/sarden1-300m}, | |
| } | |
| ``` |