Token Classification
Transformers.js
ONNX
roberta
distil
pii
security
shield
small
cpu
fast
open
open-source
lh-tech
bert
Instructions to use onnx-community/Shield-82M-ONNX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use onnx-community/Shield-82M-ONNX with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('token-classification', 'onnx-community/Shield-82M-ONNX');
| datasets: | |
| - ai4privacy/pii-masking-200k | |
| base_model: | |
| - LH-Tech-AI/Shield-82M | |
| pipeline_tag: token-classification | |
| tags: | |
| - distil | |
| - pii | |
| - security | |
| - shield | |
| - small | |
| - cpu | |
| - fast | |
| - open | |
| - open-source | |
| - lh-tech | |
| - bert | |
| - roberta | |
| library_name: transformers.js | |
| # Shield-82M (ONNX) | |
| This is an ONNX version of [LH-Tech-AI/Shield-82M](https://huggingface.co/LH-Tech-AI/Shield-82M). It was automatically converted and uploaded using [this Hugging Face Space](https://huggingface.co/spaces/onnx-community/convert-to-onnx). | |
| ## Usage with Transformers.js | |
| See the pipeline documentation for `token-classification`: https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TokenClassificationPipeline | |
| --- | |
| # 🛡️ Shield 82M | |
| Welcome to Shield 82M, a model designed to filter PII out of texts in any language. | |
| ## Classes | |
| This model has the following PII classes: | |
| ```plaintext | |
| ['O', 'ACCOUNTNAME', 'ACCOUNTNUMBER', 'AGE', 'AMOUNT', 'BIC', 'BITCOINADDRESS', 'BUILDINGNUMBER', 'CITY', 'COMPANYNAME', 'COUNTY', 'CREDITCARDCVV', 'CREDITCARDISSUER', 'CREDITCARDNUMBER', 'CURRENCY', 'CURRENCYCODE', 'CURRENCYNAME', 'CURRENCYSYMBOL', 'DATE', 'DOB', 'EMAIL', 'ETHEREUMADDRESS', 'EYECOLOR', 'FIRSTNAME', 'GENDER', 'HEIGHT', 'IBAN', 'IP', 'IPV4', 'IPV6', 'JOBAREA', 'JOBTITLE', 'JOBTYPE', 'LASTNAME', 'LITECOINADDRESS', 'MAC', 'MASKEDNUMBER', 'MIDDLENAME', 'NEARBYGPSCOORDINATE', 'ORDINALDIRECTION', 'PASSWORD', 'PHONEIMEI', 'PHONENUMBER', 'PIN', 'PREFIX', 'SECONDARYADDRESS', 'SEX', 'SSN', 'STATE', 'STREET', 'TIME', 'URL', 'USERAGENT', 'USERNAME', 'VEHICLEVIN', 'VEHICLEVRM', 'ZIPCODE'] | |
| ``` | |
| # Base model | |
| This model is based on distilroberta-base. | |
| # Examples | |
| The model has an accuracy score of ~96% (0.961206). | |
| <br> | |
| Here are a few examples: | |
| ### Test with name, email and phone | |
| ```plaintext | |
| Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678. | |
| Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE]. | |
| ``` | |
| ### Basic test | |
| ```plaintext | |
| Original: I live in Cambridge | |
| Protected: I live in [ADDRESS] | |
| ``` | |
| ### French test (multilingual) | |
| ```plaintext | |
| Original: Mon e-mail est jean.dupont@example.fr et mon téléphone est +33 6 12 34 56 78. | |
| Protected: Mon e-mail est [EMAIL] et mon téléphone est [PHONE]. | |
| ``` | |
| ## Quickstart | |
| To use this model, just download `use.py` from this repo and launch it: | |
| ```bash | |
| mkdir Shield-82M | |
| cd Shield-82M | |
| wget https://huggingface.co/LH-Tech-AI/Shield-82M/resolve/main/use.py | |
| python3 use.py | |
| ``` | |
| This outputs something like: | |
| ```bash | |
| Loading Shield-82M from LH-Tech-AI/Shield-82M... | |
| Loading weights: 100% | |
| 103/103 [00:00<00:00, 773.65it/s, Materializing param=roberta.encoder.layer.5.output.dense.weight] | |
| Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678. | |
| Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE]. | |
| ``` | |
| To use it with your own text, you'll have to adjust this line of code in `use.py`: | |
| ```python | |
| sample = "My name is John Doe. Email: john@example.com. Phone: +49 123 45678." | |
| ``` | |
| ## Training data | |
| This model was trained on the first 20,000 samples of [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k/tree/main) for 3 epochs. | |
| ## Training details | |
| - Epochs: 3 | |
| - Max Lenght: 512 | |
| - Base model: [distilroberta-base](https://huggingface.co/distilbert/distilroberta-base) | |
| - Data: first 20,000 samples of [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k/tree/main) | |
| - GPU: 2x Kaggle T4 | |
| - Training time: 06:38 min | |
| - Engine: HF Transformers | |
| The following table shows the training process: | |
| | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | | |
| | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | |
| | 1 | 1.048266 | 0.250184 | 0.904065 | 0.932844 | 0.918229 | 0.949456 | | |
| | 2 | 0.257664 | 0.193614 | 0.939548 | 0.949651 | 0.944572 | 0.959521 | | |
| | 3 | 0.199425 | 0.181754 | 0.939833 | 0.952215 | 0.945983 | 0.961206 | | |
| You can find the full training code in `train.ipynb`. Runs on 2x Kaggle T4 in ~7mins. | |