Ill-Ness commited on
Commit
ad0a5d1
·
verified ·
1 Parent(s): 1b9e208

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -3
README.md CHANGED
@@ -1,3 +1,119 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - de
6
+ - fr
7
+ - it
8
+ - es
9
+ - nl
10
+ - pt
11
+ - pl
12
+ - cs
13
+ - da
14
+ - fi
15
+ - sv
16
+ pipeline_tag: token-classification
17
+ tags:
18
+ - pii
19
+ - ner
20
+ - privacy
21
+ - token-classification
22
+ - transformers
23
+ - pytorch
24
+ - safetensors
25
+ ---
26
+
27
+ # Sarden1-300M: Multilingual PII Detection & Redaction Model
28
+
29
+ ## Model Description
30
+
31
+ Sarden1-300M is a high-performance token classification model built from scratch for
32
+ personally identifiable information (PII) detection and redaction. It identifies and
33
+ labels sensitive entity spans in text across 15 locales, making it suitable for
34
+ GDPR/HIPAA compliance pipelines, log scrubbing, and document redaction at production scale.
35
+
36
+ * **Developed by:** Surpem
37
+ * **Model Type:** Token Classifier (BIO tagging)
38
+ * **Architecture:** Custom Decoder-style Transformer
39
+ * **Base Model:** Trained from scratch — no pretrained base
40
+ * **License:** Apache 2.0
41
+ * **Languages:** en, de, fr, it, es, nl, pt, pl, cs, da, fi, sv (+ en_GB, en_CA, en_AU)
42
+
43
+ ## Architecture
44
+
45
+ | Component | Detail |
46
+ | :--- | :--- |
47
+ | Parameters | ~300M |
48
+ | Layers | 18 transformer layers |
49
+ | Hidden size | 1024 |
50
+ | Attention | Grouped Query Attention (16Q / 4KV heads) |
51
+ | FFN | SwiGLU (2730 intermediate) |
52
+ | Positional encoding | RoPE (θ = 500,000) |
53
+ | Normalisation | RMSNorm (no bias) |
54
+ | Tokeniser | GPT-2 BPE (vocab 50,257) |
55
+ | Precision | bfloat16 |
56
+
57
+ ## Entity Types
58
+
59
+ Sarden1-300M detects 12 PII entity types using BIO span labelling:
60
+
61
+ | Category | Entity Types |
62
+ | :--- | :--- |
63
+ | Identity | `PERSON`, `USERNAME`, `DATE` |
64
+ | Contact | `EMAIL`, `PHONE`, `ADDRESS` |
65
+ | Financial | `CREDITCARD`, `SSN` |
66
+ | Documents | `PASSPORT`, `DRIVERSLICENSE` |
67
+ | Technical | `IP` |
68
+ | Organisational | `ORG` |
69
+
70
+ ## Get Started
71
+
72
+ ```python
73
+ import json, torch
74
+ from safetensors.torch import load_file
75
+ from transformers import AutoTokenizer
76
+
77
+ # Load weights and config
78
+ sd = load_file("model.safetensors")
79
+ cfg = json.load(open("config.json"))
80
+ id2label = {int(k): v for k, v in cfg["id2label"].items()}
81
+
82
+ # Load tokeniser
83
+ tok = AutoTokenizer.from_pretrained(".")
84
+
85
+ # (Rebuild model from architecture, then:)
86
+ model.load_state_dict(sd)
87
+ model.eval()
88
+
89
+ # Inference
90
+ text = "Hi, I'm Jane Smith. Reach me at jane@example.com or 555-1234."
91
+ enc = tok(text, return_offsets_mapping=True, return_tensors="pt")
92
+ with torch.no_grad():
93
+ logits = model(enc["input_ids"])["logits"]
94
+
95
+ preds = logits.argmax(-1)[0].tolist()
96
+ offsets = enc["offset_mapping"][0].tolist()
97
+
98
+ for pred, (cs, ce) in zip(preds, offsets):
99
+ if cs != ce and id2label.get(pred, "O") != "O":
100
+ print(f"{id2label[pred]:<20} {repr(text[cs:ce])}")
101
+ ```
102
+
103
+ Example output:
104
+ ```
105
+ PERSON 'Jane Smith'
106
+ EMAIL 'jane@example.com'
107
+ PHONE '555-1234'
108
+ ```
109
+
110
+ ## Citation
111
+
112
+ ```bibtex
113
+ @misc{surpem2026sarden1,
114
+ title = {Sarden1-300M: Multilingual PII Detection \& Redaction Model},
115
+ author = {Surpem},
116
+ year = {2026},
117
+ url = {https://huggingface.co/surpem/sarden1-300m},
118
+ }
119
+ ```