Update README.md

c60afc5 verified 3 days ago

8.06 kB

	---
	license: apache-2.0
	base_model: OpenMed/privacy-filter-multilingual
	datasets:
	- ai4privacy/pii-masking-200k
	- ai4privacy/pii-masking-400k
	- ai4privacy/open-pii-masking-500k-ai4privacy
	pipeline_tag: token-classification
	library_name: openmed
	tags:
	- openmed
	- mlx
	- apple-silicon
	- token-classification
	- pii
	- de-identification
	- privacy-filter
	- multilingual
	language:
	- ar
	- bn
	- de
	- en
	- es
	- fr
	- hi
	- it
	- ja
	- ko
	- nl
	- pt
	- te
	- tr
	- vi
	- zh
	---

	# OpenMed Privacy Filter (Multilingual) — MLX 8-bit

	A native [MLX](https://github.com/maziyarpanahi/openmed/) port of
	[`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual)
	for fast, on-device fine-grained PII detection across 54 categories
	and 16 languages on Apple Silicon.
	This 8-bit affine-quantized artifact reduces download size and resident memory; for the full-precision sibling see [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx).

	> Family at a glance. Same architecture and training data, three runtimes:
	> - PyTorch — [`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual) — CPU + CUDA.
	> - MLX BF16 — [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx) — Apple Silicon, full precision (~2.6 GB).
	> - MLX 8-bit (this repo) — [`OpenMed/privacy-filter-multilingual-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx-8bit) — Apple Silicon, ~1.4 GB.

	## What it does

	The model is a token classifier built on the OpenAI Privacy Filter
	architecture (`openai_privacy_filter`). It tags each token with a BIOES
	label across 54 PII span classes, then a Viterbi pass over the BIOES
	grammar yields clean entity spans. Languages covered: Arabic, Bengali,
	Chinese, Dutch, English, French, German, Hindi, Italian, Japanese,
	Korean, Portuguese, Spanish, Telugu, Turkish, Vietnamese.

	<details>
	<summary>Full label schema (217 BIOES labels)</summary>

	The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 54
	span classes (4 × 54 + 1 = 217). The runtime `PrivacyFilterMLXPipeline`
	runs Viterbi over this BIOES grammar, so the consumer sees clean grouped
	entities rather than raw token tags. The full `id2label` mapping is
	shipped alongside the weights in this repo.
	</details>

	For per-label accuracy, training recipe, and dataset details, see the
	[base PyTorch checkpoint](https://huggingface.co/OpenMed/privacy-filter-multilingual).

	## Architecture

	\| Field \| Value \|
	\| --- \| --- \|
	\| Source model type \| `openai_privacy_filter` \|
	\| Source architecture \| `OpenAIPrivacyFilterForTokenClassification` \|
	\| Hidden size \| 640 \|
	\| Transformer layers \| 8 \|
	\| Attention \| Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks \|
	\| FFN \| Sparse Mixture-of-Experts — 128 experts, top-4 routing, SwiGLU \|
	\| Position encoding \| YARN-scaled RoPE (`rope_theta=150_000`, factor=32) \|
	\| Context length \| 131,072 tokens (initial 4,096) \|
	\| Tokenizer \| `o200k_base` (tiktoken) — vocab 200,064 \|
	\| Output head \| Linear(640 → 217) with bias \|

	## File set

	\| File \| Size \| Purpose \|
	\| --- \| --- \| --- \|
	\| `weights.safetensors` \| ~1.4 GB \| Model weights in OpenMed-MLX layout \|
	\| `config.json` \| ~19 KB \| Model + MLX runtime config \|
	\| `id2label.json` \| ~5 KB \| Numeric ID → BIOES label string \|
	\| `openmed-mlx.json` \| ~1 KB \| OpenMed MLX manifest (task, family, runtime hints) \|
	\| `tokenizer.json`, `tokenizer_config.json` \| ~28 MB \| Source tokenizer files (kept for reference) \|

	The MLX runtime uses `tiktoken` `o200k_base` directly for tokenization;
	the `tokenizer.json` is kept so consumers can inspect or re-tokenize via
	`transformers` if desired.

	## Label space (54 categories)

	\| Category \| Typical examples \|
	\|---\|---\|
	\| Identity \| `FIRSTNAME`, `MIDDLENAME`, `LASTNAME`, `PREFIX`, `AGE`, `GENDER`, `SEX`, `EYECOLOR`, `HEIGHT`, `USERNAME`, `OCCUPATION`, `JOBTITLE`, `JOBDEPARTMENT`, `ORGANIZATION`, `USERAGENT` \|
	\| Contact \| `EMAIL`, `PHONE`, `URL` \|
	\| Address \| `STREET`, `BUILDINGNUMBER`, `SECONDARYADDRESS`, `CITY`, `COUNTY`, `STATE`, `ZIPCODE`, `GPSCOORDINATES`, `ORDINALDIRECTION` \|
	\| Dates & time \| `DATE`, `DATEOFBIRTH`, `TIME` \|
	\| Government IDs \| `SSN` \|
	\| Financial \| `ACCOUNTNAME`, `BANKACCOUNT`, `IBAN`, `BIC`, `CREDITCARD`, `CREDITCARDISSUER`, `CVV`, `PIN`, `MASKEDNUMBER`, `AMOUNT`, `CURRENCY`, `CURRENCYCODE`, `CURRENCYNAME`, `CURRENCYSYMBOL` \|
	\| Crypto \| `BITCOINADDRESS`, `ETHEREUMADDRESS`, `LITECOINADDRESS` \|
	\| Vehicle \| `VIN`, `VRM` \|
	\| Digital \| `IPADDRESS`, `MACADDRESS`, `IMEI` \|
	\| Auth \| `PASSWORD` \|


	## Quick start

	### With [OpenMed](https://github.com/maziyarpanahi/openmed) — recommended

	OpenMed gives you a single `extract_pii()` / `deidentify()` API that
	auto-selects MLX on Apple Silicon and PyTorch elsewhere — same code on
	every host.

	```bash
	pip install -U "openmed[mlx]"
	```

	```python
	from openmed import extract_pii, deidentify

	text = (
	"Patient Sarah Johnson (DOB 03/15/1985), phone 415-555-0123, email sarah.johnson@example.com."
	)

	# Extract grouped entity spans (runs on MLX here, PyTorch fallback elsewhere)
	result = extract_pii(text, model_name="OpenMed/privacy-filter-multilingual-mlx-8bit")
	for ent in result.entities:
	print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")

	# De-identify
	masked = deidentify(text, method="mask",
	model_name="OpenMed/privacy-filter-multilingual-mlx-8bit")
	fake = deidentify(
	text,
	method="replace",
	model_name="OpenMed/privacy-filter-multilingual-mlx-8bit",
	consistent=True,
	seed=42, # deterministic locale-aware Faker surrogates
	)
	```

	When MLX isn't available (Linux, Windows, Intel Mac, missing `mlx` package),
	this exact same call automatically falls back to the PyTorch checkpoint
	[`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual) with a one-time warning. Family-aware fallback: a Multilingual
	MLX request never substitutes an unrelated baseline.

	### Direct MLX usage (lower-level)

	```python
	from huggingface_hub import snapshot_download
	from openmed.mlx.inference import PrivacyFilterMLXPipeline

	model_path = snapshot_download("OpenMed/privacy-filter-multilingual-mlx-8bit")
	pipe = PrivacyFilterMLXPipeline(model_path)

	print(pipe("Email me at alice.smith@example.com after 5pm."))
	# [{'entity_group': 'EMAIL',
	# 'score': 0.92,
	# 'word': 'alice.smith@example.com',
	# 'start': 12,
	# 'end': 35}]
	```

	The pipeline returns a list of dicts with `entity_group`, `score`, `word`,
	`start`, and `end` (character offsets into the input string).

	## Hardware notes

	- Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower.
	- Tested on macOS with `mlx>=0.18`. The MLX runtime in this repo is
	independent of `mlx_lm` (token classification, not causal LM).
	- Lower latency / smaller memory than the BF16 sibling.

	## Credits & Acknowledgements

	This artifact wouldn't exist without two open-source releases — sincere
	thanks to both teams:

	- OpenAI for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
	(architecture, modeling code, and `opf` training/eval CLI). The MLX
	port in this repo runs that same architecture under Apple's MLX
	framework.
	- AI4Privacy for releasing the multilingual PII masking datasets
	used to fine-tune the source PyTorch checkpoint:
	[`pii-masking-200k`](https://huggingface.co/datasets/ai4privacy/pii-masking-200k),
	[`pii-masking-400k`](https://huggingface.co/datasets/ai4privacy/pii-masking-400k),
	and [`open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy).

	Additional thanks to Apple for [MLX](https://github.com/ml-explore/mlx)
	and the HuggingFace team for the model-distribution ecosystem.

	## License

	Apache 2.0.