Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Shield 82M
|
| 2 |
+
Welcome to Shield 82M, a model designed to filter PII out of texts in any language.
|
| 3 |
+
|
| 4 |
+
## Classes
|
| 5 |
+
This model has the following PII classes:
|
| 6 |
+
```plaintext
|
| 7 |
+
['O', 'ACCOUNTNAME', 'ACCOUNTNUMBER', 'AGE', 'AMOUNT', 'BIC', 'BITCOINADDRESS', 'BUILDINGNUMBER', 'CITY', 'COMPANYNAME', 'COUNTY', 'CREDITCARDCVV', 'CREDITCARDISSUER', 'CREDITCARDNUMBER', 'CURRENCY', 'CURRENCYCODE', 'CURRENCYNAME', 'CURRENCYSYMBOL', 'DATE', 'DOB', 'EMAIL', 'ETHEREUMADDRESS', 'EYECOLOR', 'FIRSTNAME', 'GENDER', 'HEIGHT', 'IBAN', 'IP', 'IPV4', 'IPV6', 'JOBAREA', 'JOBTITLE', 'JOBTYPE', 'LASTNAME', 'LITECOINADDRESS', 'MAC', 'MASKEDNUMBER', 'MIDDLENAME', 'NEARBYGPSCOORDINATE', 'ORDINALDIRECTION', 'PASSWORD', 'PHONEIMEI', 'PHONENUMBER', 'PIN', 'PREFIX', 'SECONDARYADDRESS', 'SEX', 'SSN', 'STATE', 'STREET', 'TIME', 'URL', 'USERAGENT', 'USERNAME', 'VEHICLEVIN', 'VEHICLEVRM', 'ZIPCODE']
|
| 8 |
+
```
|
| 9 |
+
|
| 10 |
+
# Base model
|
| 11 |
+
This model is based on distilroberta-base.
|
| 12 |
+
|
| 13 |
+
# Examples
|
| 14 |
+
The model has an accuracy score of ~96% (0.961206).
|
| 15 |
+
<br>
|
| 16 |
+
Here are a few examples:
|
| 17 |
+
### Test with name, email and phone
|
| 18 |
+
```plaintext
|
| 19 |
+
Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678.
|
| 20 |
+
Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE].
|
| 21 |
+
```
|
| 22 |
+
### Basic test
|
| 23 |
+
```plaintext
|
| 24 |
+
Original: I live in Cambridge
|
| 25 |
+
Protected: I live in [ADDRESS]
|
| 26 |
+
```
|
| 27 |
+
### French test (multilingual)
|
| 28 |
+
```plaintext
|
| 29 |
+
Original: Mon e-mail est jean.dupont@example.fr et mon téléphone est +33 6 12 34 56 78.
|
| 30 |
+
Protected: Mon e-mail est [EMAIL] et mon téléphone est [PHONE].
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
## Quickstart
|
| 34 |
+
To use this model, just download `use.py` from this repo and launch it:
|
| 35 |
+
```bash
|
| 36 |
+
mkdir Shield-82M
|
| 37 |
+
cd Shield-82M
|
| 38 |
+
wget https://huggingface.co/LH-Tech-AI/Shield-82M/resolve/main/use.py
|
| 39 |
+
python3 use.py
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
This outputs something like:
|
| 43 |
+
```bash
|
| 44 |
+
Loading Shield-82M from LH-Tech-AI/Shield-82M...
|
| 45 |
+
|
| 46 |
+
Loading weights: 100%
|
| 47 |
+
103/103 [00:00<00:00, 773.65it/s, Materializing param=roberta.encoder.layer.5.output.dense.weight]
|
| 48 |
+
|
| 49 |
+
Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678.
|
| 50 |
+
Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE].
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
To use it with your own text, you'll have to adjust this line of code in `use.py`:
|
| 54 |
+
```python
|
| 55 |
+
sample = "My name is John Doe. Email: john@example.com. Phone: +49 123 45678."
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
## Training data
|
| 59 |
+
This model was trained on the first 20,000 samples of [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k/tree/main) for 3 epochs.
|
| 60 |
+
|
| 61 |
+
## Training details
|
| 62 |
+
- Epochs: 3
|
| 63 |
+
- Max Lenght: 512
|
| 64 |
+
- Base model: [distilroberta-base](https://huggingface.co/distilbert/distilroberta-base)
|
| 65 |
+
- Data: first 20,000 samples of [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k/tree/main)
|
| 66 |
+
- GPU: 2x Kaggle T4
|
| 67 |
+
- Training time: 06:38 min
|
| 68 |
+
- Engine: HF Transformers
|
| 69 |
+
|
| 70 |
+
The following table shows the training process:
|
| 71 |
+
|
| 72 |
+
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|
| 73 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
| 74 |
+
| 1 | 1.048266 | 0.250184 | 0.904065 | 0.932844 | 0.918229 | 0.949456 |
|
| 75 |
+
| 2 | 0.257664 | 0.193614 | 0.939548 | 0.949651 | 0.944572 | 0.959521 |
|
| 76 |
+
| 3 | 0.199425 | 0.181754 | 0.939833 | 0.952215 | 0.945983 | 0.961206 |
|
| 77 |
+
|
| 78 |
+
You can find the full training code in `train.ipynb`. Runs on 2x Kaggle T4 in ~7mins.
|