--- license: apache-2.0 base_model: OpenMed/privacy-filter-multilingual datasets: - ai4privacy/pii-masking-200k - ai4privacy/pii-masking-400k - ai4privacy/open-pii-masking-500k-ai4privacy pipeline_tag: token-classification library_name: openmed tags: - openmed - mlx - apple-silicon - token-classification - pii - de-identification - privacy-filter - multilingual language: - ar - bn - de - en - es - fr - hi - it - ja - ko - nl - pt - te - tr - vi - zh --- # OpenMed Privacy Filter (Multilingual) — MLX 8-bit A native [MLX](https://github.com/maziyarpanahi/openmed/) port of [`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual) for fast, on-device fine-grained PII detection across **54 categories** and **16 languages** on Apple Silicon. This 8-bit affine-quantized artifact reduces download size and resident memory; for the full-precision sibling see [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx). > **Family at a glance.** Same architecture and training data, three runtimes: > - **PyTorch** — [`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual) — CPU + CUDA. > - **MLX BF16** — [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx) — Apple Silicon, full precision (~2.6 GB). > - **MLX 8-bit** (this repo) — [`OpenMed/privacy-filter-multilingual-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx-8bit) — Apple Silicon, ~1.4 GB. ## What it does The model is a token classifier built on the OpenAI Privacy Filter architecture (`openai_privacy_filter`). It tags each token with a BIOES label across **54 PII span classes**, then a Viterbi pass over the BIOES grammar yields clean entity spans. Languages covered: Arabic, Bengali, Chinese, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Spanish, Telugu, Turkish, Vietnamese.
Full label schema (217 BIOES labels) The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 54 span classes (4 × 54 + 1 = 217). The runtime `PrivacyFilterMLXPipeline` runs Viterbi over this BIOES grammar, so the consumer sees clean grouped entities rather than raw token tags. The full `id2label` mapping is shipped alongside the weights in this repo.
For per-label accuracy, training recipe, and dataset details, see the [base PyTorch checkpoint](https://huggingface.co/OpenMed/privacy-filter-multilingual). ## Architecture | Field | Value | | --- | --- | | Source model type | `openai_privacy_filter` | | Source architecture | `OpenAIPrivacyFilterForTokenClassification` | | Hidden size | 640 | | Transformer layers | 8 | | Attention | Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks | | FFN | Sparse Mixture-of-Experts — 128 experts, top-4 routing, SwiGLU | | Position encoding | YARN-scaled RoPE (`rope_theta=150_000`, factor=32) | | Context length | 131,072 tokens (initial 4,096) | | Tokenizer | `o200k_base` (tiktoken) — vocab 200,064 | | Output head | Linear(640 → 217) with bias | ## File set | File | Size | Purpose | | --- | --- | --- | | `weights.safetensors` | ~1.4 GB | Model weights in OpenMed-MLX layout | | `config.json` | ~19 KB | Model + MLX runtime config | | `id2label.json` | ~5 KB | Numeric ID → BIOES label string | | `openmed-mlx.json` | ~1 KB | OpenMed MLX manifest (task, family, runtime hints) | | `tokenizer.json`, `tokenizer_config.json` | ~28 MB | Source tokenizer files (kept for reference) | The MLX runtime uses `tiktoken` `o200k_base` directly for tokenization; the `tokenizer.json` is kept so consumers can inspect or re-tokenize via `transformers` if desired. ## Label space (54 categories) | Category | Typical examples | |---|---| | **Identity** | `FIRSTNAME`, `MIDDLENAME`, `LASTNAME`, `PREFIX`, `AGE`, `GENDER`, `SEX`, `EYECOLOR`, `HEIGHT`, `USERNAME`, `OCCUPATION`, `JOBTITLE`, `JOBDEPARTMENT`, `ORGANIZATION`, `USERAGENT` | | **Contact** | `EMAIL`, `PHONE`, `URL` | | **Address** | `STREET`, `BUILDINGNUMBER`, `SECONDARYADDRESS`, `CITY`, `COUNTY`, `STATE`, `ZIPCODE`, `GPSCOORDINATES`, `ORDINALDIRECTION` | | **Dates & time** | `DATE`, `DATEOFBIRTH`, `TIME` | | **Government IDs** | `SSN` | | **Financial** | `ACCOUNTNAME`, `BANKACCOUNT`, `IBAN`, `BIC`, `CREDITCARD`, `CREDITCARDISSUER`, `CVV`, `PIN`, `MASKEDNUMBER`, `AMOUNT`, `CURRENCY`, `CURRENCYCODE`, `CURRENCYNAME`, `CURRENCYSYMBOL` | | **Crypto** | `BITCOINADDRESS`, `ETHEREUMADDRESS`, `LITECOINADDRESS` | | **Vehicle** | `VIN`, `VRM` | | **Digital** | `IPADDRESS`, `MACADDRESS`, `IMEI` | | **Auth** | `PASSWORD` | ## Quick start ### With [OpenMed](https://github.com/maziyarpanahi/openmed) — recommended OpenMed gives you a single `extract_pii()` / `deidentify()` API that auto-selects MLX on Apple Silicon and PyTorch elsewhere — same code on every host. ```bash pip install -U "openmed[mlx]" ``` ```python from openmed import extract_pii, deidentify text = ( "Patient Sarah Johnson (DOB 03/15/1985), phone 415-555-0123, email sarah.johnson@example.com." ) # Extract grouped entity spans (runs on MLX here, PyTorch fallback elsewhere) result = extract_pii(text, model_name="OpenMed/privacy-filter-multilingual-mlx-8bit") for ent in result.entities: print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}") # De-identify masked = deidentify(text, method="mask", model_name="OpenMed/privacy-filter-multilingual-mlx-8bit") fake = deidentify( text, method="replace", model_name="OpenMed/privacy-filter-multilingual-mlx-8bit", consistent=True, seed=42, # deterministic locale-aware Faker surrogates ) ``` When MLX isn't available (Linux, Windows, Intel Mac, missing `mlx` package), this exact same call automatically falls back to the PyTorch checkpoint [`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual) with a one-time warning. Family-aware fallback: a Multilingual MLX request never substitutes an unrelated baseline. ### Direct MLX usage (lower-level) ```python from huggingface_hub import snapshot_download from openmed.mlx.inference import PrivacyFilterMLXPipeline model_path = snapshot_download("OpenMed/privacy-filter-multilingual-mlx-8bit") pipe = PrivacyFilterMLXPipeline(model_path) print(pipe("Email me at alice.smith@example.com after 5pm.")) # [{'entity_group': 'EMAIL', # 'score': 0.92, # 'word': 'alice.smith@example.com', # 'start': 12, # 'end': 35}] ``` The pipeline returns a list of dicts with `entity_group`, `score`, `word`, `start`, and `end` (character offsets into the input string). ## Hardware notes - Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower. - Tested on macOS with `mlx>=0.18`. The MLX runtime in this repo is independent of `mlx_lm` (token classification, not causal LM). - Lower latency / smaller memory than the BF16 sibling. ## Credits & Acknowledgements This artifact wouldn't exist without two open-source releases — sincere thanks to both teams: - **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter) (architecture, modeling code, and `opf` training/eval CLI). The MLX port in this repo runs that same architecture under Apple's MLX framework. - **AI4Privacy** for releasing the multilingual PII masking datasets used to fine-tune the source PyTorch checkpoint: [`pii-masking-200k`](https://huggingface.co/datasets/ai4privacy/pii-masking-200k), [`pii-masking-400k`](https://huggingface.co/datasets/ai4privacy/pii-masking-400k), and [`open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy). Additional thanks to **Apple** for [MLX](https://github.com/ml-explore/mlx) and the **HuggingFace** team for the model-distribution ecosystem. ## License Apache 2.0.