--- license: apache-2.0 language: - en - de - fr base_model: - openai/privacy-filter tags: - GLiNER - PII - privacy-filter --- # Model description **GLiNERized** privacy-filter built from the [OpenAI privacy-filter model](https://huggingface.co/openai/privacy-filter). The model uses its encder backbone and swapped the linear token classifier with a GLiNER head to transcend from limited classification of 8 pre-defined PII labels to zero shot capabilities. This provides more versatility in production settings where the set of filtered labels is subject to change during runtime. ## Background GLiNER GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoders (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios. ## Background Privacy-filter OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workflows where teams need a model that they can run on-premises that is fast, context-aware, and tunable. OpenAI Privacy Filter is pretrained autoregressively to arrive at a checkpoint with similar architecture to gpt-oss, albeit of a smaller size. We then converted that checkpoint into a bidirectional token classifier over a privacy label taxonomy, and post-trained with a supervised classification loss. (For architecture details about gpt-oss, please see the gpt-oss model card.) Instead of generating text token-by-token, this model labels an input sequence in a single forward pass, then decodes coherent spans with a constrained Viterbi procedure. For each input token, the model predicts a probability distribution over the label taxonomy which consists of 8 output categories described below. ### Installation & Usage Install gliner package: ```bash pip install gliner ``` Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using `GLiNER.from_pretrained` and predict entities with `predict_entities`. ```python from gliner import GLiNER model = GLiNER.from_pretrained("latence/privacy-filter-GLiNER-v0.1") text = """ . | Timestamp ( UTC ) | Attributed Sub-Publisher | Click IP Address | Install IP Address | IP Owner / Type | Geolocation Claimed | | - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - | | 2023-10-26 14 : 22 : 01 | subpub_xyz789 | 35 . 172 . 110 . 45 | ` ae1d : dbf7 : 0adf : 7adc : 9fa4 : 8abe : fd32 : d0ff ` | Amazon AWS | United States | | 2023-10-26 14 : 22 : 34 | subpub_xyz789 | 104 . 198 . 15 . 201 | 104 . 198 . 15 . 201 | Google Cloud | United States | | 2023-10-26 14 : 23 : 11 | subpub_abc123 | 159 . 65 . 92 . 203 | 159 . 65 . 92 . 203 | DigitalOcean | United States | | 2023-10-26 14 : 24 : 05 | subpub_xyz789 | 52 . 54 . 10 . 88 | 52 . 54 . 10 . 88 | Amazon AWS | United States | * * Conclusion : * * The use of datacenter IPs like ` ae1d : dbf7 : 0adf : 7adc : 9fa4 : 8abe : fd32 : d0ff ` , which resolves to a known server farm , is a clear violation and proof of NHT . Legitimate users do not install and play mobile games from AWS or Google Cloud servers . The fraudster ' s attempt to spoof the ` France ` is noted but ultimately negated by the IP ownership analysis . # # # # * * 4 . 3 Inconsistent and Fraudulent Device & User Agent Data * * Further analysis of the device-level data for the fraudulent cohort reveals widespread signs of emulation and parameter spoofing . Genuine devices provide a consistent and logical set of data points . Fraudulent installs often show illogical combinations or repeated identifiers . Please review the following sample records from the fraudulent install list : | Advertising ID ( GAID ) | Device Username | Device User Agent | Notes / Red Flags | | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | | ` AC01601C-7C7B-4C30-B84A-D9F938D8DB38 ` | ` laine . achille ` | Dalvik / 2 . 1 . 0 ( Linux ; U ; Android 9 ; SM-G960F Build / PPR1 . 180610 . 011 ) | User Agent is for a Samsung S9 on Android 9 . This GAID has been seen with 15 different User Agents in the last 24 hours . Classic sign of spoofing . | | 2b8c541f-3a21-4f9e-8e3b-1d7c0a9b4f6d | generic_x86 | ` Mozilla / 5 . 0 ( Windows NT 6 . 2 ; Win64 ; x64 ) AppleWebKit / 578 . 88 ( KHTML , like Gecko ) Chrome / 77 . 3 . 13 . 7 Safari / 572 . 64 Edg / 129 . 5 . 4 . 11 ` | This user agent string for a ' Pixel 4 ' is missing several key tokens , indicating a poorly configured emulator . The ` Mozilla / 5 . 0 ( Windows NT 6 . 2 ; Win64 ; x64 ) AppleWebKit / 578 . 88 ( KHTML , like Gecko ) Chrome / 77 . 3 . 13 . 7 Safari / 572 . 64 Edg / 129 . 5 . 4 . 11 ` string is incomplete . | | c4e5a6b7-8d9c-0f1e-2a3b-4c5d6e7f8a9b | android-user | Dalvik / 2 . 1 . 0 ( Linux ; U ; Android 11 ; sdk_gphone_x86 Build / RSR1 . 201013 . 001 ) | ' sdk_gphone_x86 ' is the default device name for the standard Android Studio emulator . This is clearly not a real user device . | | e5f6a7b8-9c0d-1e2f-3a4b-5c6d7e8f9a0b | ` laine . achille ` | Dalvik / 1 . 6 . 0 ( Linux ; U ; Android 4 . 4 . 2 ; GT-I9505 Build / KOT49H ) | This represents an ancient Android 4 . 4 . 2 device . Our app ' s minimum requirement is Android 6 . 0 . The install should not have been possible""" labels = ['username', 'google_gaid', 'ip_address', 'user_agent', 'country'] entities = model.predict_entities(text, labels, threshold=0.3) for entity in entities[0]: print(entity["text"], "=>", entity["label"] , "=>", entity["score"]) ``` ``` 35 . 172 . 110 . 45 => ip_address => 0.458984375 United States => country => 0.404296875 United States => country => 0.33203125 subpub_abc123 => google_gaid => 0.353515625 159 . 65 . 92 . 203 => ip_address => 0.3984375 159 . 65 . 92 . 203 => ip_address => 0.322265625 52 . 54 . 10 . 88 => ip_address => 0.37890625 52 . 54 . 10 . 88 => ip_address => 0.33984375 France => country => 0.75390625 ``` ## Training Training started with a small "GLiNERization" warmup on a general multilingual NER dataset followed by finetuning on a curated PII dataset covering english, german and french. Beside a variety of long-tail pii labels the dataset focuses on 78 GDPR relevant labels. ## Input:
**Input Type(s):** Text
**Input Format:** UTF-8 string(s)
**Input Parameters:** One-Dimensional (1D)
**Other Properties Related to Input:** supports structured and unstructured text
## Output:
**Output Type(s):** Text
**Output Format:** String
**Output Parameters:** One-Dimensional (1D)
**Other Properties Related to Output:** List of dictionaries with keys {text, label, start, end, score}
## Software Integration: **Runtime Engine(s):** * PyTorch, GLiNER Python library
## Limitations This is an early checkpoint. The already finetuned encoder backbone and the architecture itself makes the model behave differently than general pretrained DeBerta, MT5 or ModernBert backbones used with a GLiNER head. Benchmarks will follow with later checkpoints.