| --- |
| license: apache-2.0 |
| language: |
| - en |
| - de |
| - fr |
| base_model: |
| - openai/privacy-filter |
| tags: |
| - GLiNER |
| - PII |
| - privacy-filter |
| --- |
| # Model description |
| **GLiNERized** privacy-filter built from the [OpenAI privacy-filter model](https://huggingface.co/openai/privacy-filter). The model uses its encder backbone and swapped the linear token classifier with a GLiNER head to transcend from limited classification of 8 pre-defined PII labels to zero shot capabilities. This provides more versatility in production settings where the set of filtered labels is subject to change during runtime. |
|
|
| ## Background GLiNER |
| GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoders (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios. |
|
|
| ## Background Privacy-filter |
| OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workflows where teams need a model that they can run on-premises that is fast, context-aware, and tunable. |
|
|
| OpenAI Privacy Filter is pretrained autoregressively to arrive at a checkpoint with similar architecture to gpt-oss, albeit of a smaller size. We then converted that checkpoint into a bidirectional token classifier over a privacy label taxonomy, and post-trained with a supervised classification loss. (For architecture details about gpt-oss, please see the gpt-oss model card.) Instead of generating text token-by-token, this model labels an input sequence in a single forward pass, then decodes coherent spans with a constrained Viterbi procedure. For each input token, the model predicts a probability distribution over the label taxonomy which consists of 8 output categories described below. |
|
|
| ### Installation & Usage |
| Install gliner package: |
| ```bash |
| pip install gliner |
| ``` |
| Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using `GLiNER.from_pretrained` and predict entities with `predict_entities`. |
|
|
| ```python |
| from gliner import GLiNER |
| model = GLiNER.from_pretrained("latence/privacy-filter-GLiNER-v0.1") |
| text = """ |
| . | Timestamp ( UTC ) | Attributed Sub-Publisher | Click IP Address | Install IP Address | IP Owner / Type | Geolocation Claimed | | - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - |
| | | 2023-10-26 14 : 22 : 01 | subpub_xyz789 | 35 . 172 . 110 . 45 | ` ae1d : dbf7 : 0adf : 7adc : 9fa4 : 8abe : fd32 : d0ff ` | Amazon AWS | United States | | 2023-10-26 14 : 22 : 34 | subpub_xyz789 | 104 . 198 . 15 . 201 | 104 . 198 . 15 . 201 | Google Cloud | United States | | 2023-10-26 14 : 23 : 11 | subpub_abc123 | 159 . 65 . 92 . 203 | 159 . 65 . 92 . 203 | DigitalOcean | United States | | 2023-10-26 14 : 24 : 05 | subpub_xyz789 | 52 . 54 . 10 . 88 | 52 . 54 . 10 . 88 | Amazon AWS | United States | * * Conclusion : * * The use of datacenter IPs like ` ae1d : dbf7 : 0adf : 7adc : 9fa4 : 8abe : fd32 : d0ff ` , |
| which resolves to a known server farm , is a clear violation and proof of NHT . Legitimate users do not install and play mobile games from AWS or Google Cloud servers . The fraudster ' s attempt to spoof the ` France ` is noted but ultimately negated by the IP ownership analysis . # # # # * * 4 . 3 Inconsistent and Fraudulent Device & User Agent Data * * Further analysis of the device-level data for the fraudulent cohort reveals widespread signs of emulation and parameter spoofing . |
| Genuine devices provide a consistent and logical set of data points . Fraudulent installs often show illogical combinations or repeated identifiers . Please review the following sample records from the fraudulent install list : | Advertising ID ( GAID ) | Device Username | Device User Agent | Notes / Red Flags | | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | | ` |
| AC01601C-7C7B-4C30-B84A-D9F938D8DB38 ` | ` laine . achille ` | Dalvik / 2 . 1 . 0 ( Linux ; U ; Android 9 ; SM-G960F Build / PPR1 . 180610 . 011 ) | User Agent is for a Samsung S9 on Android 9 . This GAID has been seen with 15 different User Agents in the last 24 hours . Classic sign of spoofing . | | 2b8c541f-3a21-4f9e-8e3b-1d7c0a9b4f6d | generic_x86 | ` Mozilla / 5 . 0 ( Windows NT 6 . 2 ; Win64 ; x64 ) AppleWebKit / 578 . 88 ( KHTML , like Gecko ) Chrome / 77 . 3 . 13 . 7 Safari / 572 . 64 Edg / 129 . 5 . 4 . 11 ` | |
| This user agent string for a ' Pixel 4 ' is missing several key tokens , indicating a poorly configured emulator . The ` Mozilla / 5 . 0 ( Windows NT 6 . 2 ; Win64 ; x64 ) AppleWebKit / 578 . 88 ( KHTML , like Gecko ) Chrome / 77 . 3 . 13 . 7 Safari / 572 . 64 Edg / 129 . 5 . 4 . 11 ` string is incomplete . | | c4e5a6b7-8d9c-0f1e-2a3b-4c5d6e7f8a9b | android-user | Dalvik / 2 . 1 . 0 ( Linux ; U ; Android 11 ; sdk_gphone_x86 Build / RSR1 . 201013 . 001 ) | ' sdk_gphone_x86 ' is the default device name for the standard Android Studio emulator . This is clearly not a real user device . | | e5f6a7b8-9c0d-1e2f-3a4b-5c6d7e8f9a0b | ` laine . achille ` | Dalvik / 1 . 6 . 0 ( Linux ; U ; Android 4 . 4 . 2 ; GT-I9505 Build / KOT49H ) | This represents an ancient Android 4 . 4 . 2 device . Our app ' s minimum requirement is Android 6 . 0 . The install should not have been possible""" |
| |
| labels = ['username', 'google_gaid', 'ip_address', 'user_agent', 'country'] |
| |
| entities = model.predict_entities(text, labels, threshold=0.3) |
| for entity in entities[0]: |
| print(entity["text"], "=>", entity["label"] , "=>", entity["score"]) |
| ``` |
|
|
| ``` |
| 35 . 172 . 110 . 45 => ip_address => 0.458984375 |
| United States => country => 0.404296875 |
| United States => country => 0.33203125 |
| subpub_abc123 => google_gaid => 0.353515625 |
| 159 . 65 . 92 . 203 => ip_address => 0.3984375 |
| 159 . 65 . 92 . 203 => ip_address => 0.322265625 |
| 52 . 54 . 10 . 88 => ip_address => 0.37890625 |
| 52 . 54 . 10 . 88 => ip_address => 0.33984375 |
| France => country => 0.75390625 |
| ``` |
|
|
| ## Training |
| Training started with a small "GLiNERization" warmup on a general multilingual NER dataset followed by finetuning on a curated PII dataset covering english, german and french. Beside a variety of long-tail pii labels the dataset focuses on 78 GDPR relevant labels. |
|
|
| ## Input: <br> |
| **Input Type(s):** Text <br> |
| **Input Format:** UTF-8 string(s) <br> |
| **Input Parameters:** One-Dimensional (1D) <br> |
| **Other Properties Related to Input:** supports structured and unstructured text <br> |
|
|
| ## Output: <br> |
| **Output Type(s):** Text <br> |
| **Output Format:** String <br> |
| **Output Parameters:** One-Dimensional (1D) <br> |
| **Other Properties Related to Output:** List of dictionaries with keys {text, label, start, end, score} <br> |
|
|
| ## Software Integration: |
| **Runtime Engine(s):** |
| * PyTorch, GLiNER Python library <br> |
|
|
| ## Limitations |
| This is an early checkpoint. The already finetuned encoder backbone and the architecture itself makes the model behave differently than general pretrained DeBerta, MT5 or ModernBert backbones used with a GLiNER head. |
| Benchmarks will follow with later checkpoints. |
|
|