Surpem
/

Sarden1

Token Classification

Model card Files Files and versions

Sarden1 / README.md

Ill-Ness's picture

Update README.md

0bee8f7 verified 27 days ago

|

history blame contribute delete

2.95 kB

	---
	license: apache-2.0
	language:
	- en
	- de
	- fr
	- it
	- es
	- nl
	- pt
	- pl
	- cs
	- da
	- fi
	- sv
	pipeline_tag: token-classification
	tags:
	- pii
	- ner
	- privacy
	- token-classification
	- transformers
	- pytorch
	- safetensors
	---

	# Sarden1: Multilingual PII Detection & Redaction Model

	## Model Description

	Sarden1 is a high-performance token classification model built from scratch for
	personally identifiable information (PII) detection and redaction. It identifies and
	labels sensitive entity spans in text across 15 locales, making it suitable for
	GDPR/HIPAA compliance pipelines, log scrubbing, and document redaction at production scale.

	* Developed by: Surpem
	* Model Type: Token Classifier (BIO tagging)
	* Architecture: Custom Decoder-style Transformer
	* Base Model: Trained from scratch — no pretrained base
	* License: Apache 2.0
	* Languages: en, de, fr, it, es, nl, pt, pl, cs, da, fi, sv (+ en_GB, en_CA, en_AU)

	## Architecture

	\| Component \| Detail \|
	\| :--- \| :--- \|
	\| Parameters \| ~300M \|
	\| Layers \| 18 transformer layers \|
	\| Hidden size \| 1024 \|
	\| Attention \| Grouped Query Attention (16Q / 4KV heads) \|
	\| FFN \| SwiGLU (2730 intermediate) \|
	\| Positional encoding \| RoPE (θ = 500,000) \|
	\| Normalisation \| RMSNorm (no bias) \|
	\| Tokeniser \| GPT-2 BPE (vocab 50,257) \|
	\| Precision \| bfloat16 \|

	## Entity Types

	Sarden1 detects 12 PII entity types using BIO span labelling:

	\| Category \| Entity Types \|
	\| :--- \| :--- \|
	\| Identity \| `PERSON`, `USERNAME`, `DATE` \|
	\| Contact \| `EMAIL`, `PHONE`, `ADDRESS` \|
	\| Financial \| `CREDITCARD`, `SSN` \|
	\| Documents \| `PASSPORT`, `DRIVERSLICENSE` \|
	\| Technical \| `IP` \|
	\| Organisational \| `ORG` \|

	## Get Started

	```python
	import json, torch
	from safetensors.torch import load_file
	from transformers import AutoTokenizer

	# Load weights and config
	sd = load_file("model.safetensors")
	cfg = json.load(open("config.json"))
	id2label = {int(k): v for k, v in cfg["id2label"].items()}

	# Load tokeniser
	tok = AutoTokenizer.from_pretrained(".")

	# (Rebuild model from architecture, then:)
	model.load_state_dict(sd)
	model.eval()

	# Inference
	text = "Hi, I'm Jane Smith. Reach me at jane@example.com or 555-1234."
	enc = tok(text, return_offsets_mapping=True, return_tensors="pt")
	with torch.no_grad():
	logits = model(enc["input_ids"])["logits"]

	preds = logits.argmax(-1)[0].tolist()
	offsets = enc["offset_mapping"][0].tolist()

	for pred, (cs, ce) in zip(preds, offsets):
	if cs != ce and id2label.get(pred, "O") != "O":
	print(f"{id2label[pred]:<20} {repr(text[cs:ce])}")
	```

	Example output:
	```
	PERSON 'Jane Smith'
	EMAIL 'jane@example.com'
	PHONE '555-1234'
	```

	## Citation

	```bibtex
	@misc{surpem2026sarden1,
	title = {Sarden1-300M: Multilingual PII Detection \& Redaction Model},
	author = {Surpem},
	year = {2026},
	url = {https://huggingface.co/surpem/sarden1-300m},
	}
	```