LH-Tech-AI commited on
Commit
14f0b77
·
verified ·
1 Parent(s): 87b860e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -0
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Shield 82M
2
+ Welcome to Shield 82M, a model designed to filter PII out of texts in any language.
3
+
4
+ ## Classes
5
+ This model has the following PII classes:
6
+ ```plaintext
7
+ ['O', 'ACCOUNTNAME', 'ACCOUNTNUMBER', 'AGE', 'AMOUNT', 'BIC', 'BITCOINADDRESS', 'BUILDINGNUMBER', 'CITY', 'COMPANYNAME', 'COUNTY', 'CREDITCARDCVV', 'CREDITCARDISSUER', 'CREDITCARDNUMBER', 'CURRENCY', 'CURRENCYCODE', 'CURRENCYNAME', 'CURRENCYSYMBOL', 'DATE', 'DOB', 'EMAIL', 'ETHEREUMADDRESS', 'EYECOLOR', 'FIRSTNAME', 'GENDER', 'HEIGHT', 'IBAN', 'IP', 'IPV4', 'IPV6', 'JOBAREA', 'JOBTITLE', 'JOBTYPE', 'LASTNAME', 'LITECOINADDRESS', 'MAC', 'MASKEDNUMBER', 'MIDDLENAME', 'NEARBYGPSCOORDINATE', 'ORDINALDIRECTION', 'PASSWORD', 'PHONEIMEI', 'PHONENUMBER', 'PIN', 'PREFIX', 'SECONDARYADDRESS', 'SEX', 'SSN', 'STATE', 'STREET', 'TIME', 'URL', 'USERAGENT', 'USERNAME', 'VEHICLEVIN', 'VEHICLEVRM', 'ZIPCODE']
8
+ ```
9
+
10
+ # Base model
11
+ This model is based on distilroberta-base.
12
+
13
+ # Examples
14
+ The model has an accuracy score of ~96% (0.961206).
15
+ <br>
16
+ Here are a few examples:
17
+ ### Test with name, email and phone
18
+ ```plaintext
19
+ Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678.
20
+ Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE].
21
+ ```
22
+ ### Basic test
23
+ ```plaintext
24
+ Original: I live in Cambridge
25
+ Protected: I live in [ADDRESS]
26
+ ```
27
+ ### French test (multilingual)
28
+ ```plaintext
29
+ Original: Mon e-mail est jean.dupont@example.fr et mon téléphone est +33 6 12 34 56 78.
30
+ Protected: Mon e-mail est [EMAIL] et mon téléphone est [PHONE].
31
+ ```
32
+
33
+ ## Quickstart
34
+ To use this model, just download `use.py` from this repo and launch it:
35
+ ```bash
36
+ mkdir Shield-82M
37
+ cd Shield-82M
38
+ wget https://huggingface.co/LH-Tech-AI/Shield-82M/resolve/main/use.py
39
+ python3 use.py
40
+ ```
41
+
42
+ This outputs something like:
43
+ ```bash
44
+ Loading Shield-82M from LH-Tech-AI/Shield-82M...
45
+
46
+ Loading weights: 100%
47
+  103/103 [00:00<00:00, 773.65it/s, Materializing param=roberta.encoder.layer.5.output.dense.weight]
48
+
49
+ Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678.
50
+ Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE].
51
+ ```
52
+
53
+ To use it with your own text, you'll have to adjust this line of code in `use.py`:
54
+ ```python
55
+ sample = "My name is John Doe. Email: john@example.com. Phone: +49 123 45678."
56
+ ```
57
+
58
+ ## Training data
59
+ This model was trained on the first 20,000 samples of [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k/tree/main) for 3 epochs.
60
+
61
+ ## Training details
62
+ - Epochs: 3
63
+ - Max Lenght: 512
64
+ - Base model: [distilroberta-base](https://huggingface.co/distilbert/distilroberta-base)
65
+ - Data: first 20,000 samples of [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k/tree/main)
66
+ - GPU: 2x Kaggle T4
67
+ - Training time: 06:38 min
68
+ - Engine: HF Transformers
69
+
70
+ The following table shows the training process:
71
+
72
+ | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
73
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
74
+ | 1 | 1.048266 | 0.250184 | 0.904065 | 0.932844 | 0.918229 | 0.949456 |
75
+ | 2 | 0.257664 | 0.193614 | 0.939548 | 0.949651 | 0.944572 | 0.959521 |
76
+ | 3 | 0.199425 | 0.181754 | 0.939833 | 0.952215 | 0.945983 | 0.961206 |
77
+
78
+ You can find the full training code in `train.ipynb`. Runs on 2x Kaggle T4 in ~7mins.