gkl-arch commited on
Commit
479068c
·
verified ·
1 Parent(s): ea732d8

Upload 11 files

Browse files
README.md ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - el
5
+ library_name: transformers
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - text-classification
9
+ - bert
10
+ - greek
11
+ - document-classification
12
+ - page-classification
13
+ - nlp
14
+ - contrastive-learning
15
+ base_model: nlpaueb/bert-base-greek-uncased-v1
16
+ metrics:
17
+ - accuracy
18
+ - f1
19
+ ---
20
+
21
+ # Arch-L3869-PageClassification
22
+
23
+ ## Model Details
24
+
25
+ ### Model Description
26
+
27
+ This is a **Greek text classification model** for categorizing document pages into 18 different classes. The model was trained using a two-phase approach:
28
+
29
+ 1. **Phase 1 (Contrastive Learning):** Further pre-training of the base BERT model using Supervised Contrastive Learning (SCL) to create better document embeddings.
30
+ 2. **Phase 2 (Classification):** Fine-tuning with Asymmetric Loss for handling class imbalance.
31
+
32
+ - **Developed by:** Archeiothiki S.A. - AI Services Team
33
+ - **Model type:** BertForSequenceClassification
34
+ - **Language(s):** Greek (el)
35
+ - **Finetuned from model:** [nlpaueb/bert-base-greek-uncased-v1](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1)
36
+
37
+ ### Model Architecture
38
+
39
+ - **Base Model:** nlpaueb/bert-base-greek-uncased-v1
40
+ - **Pruned Layers:** [0, 2, 4, 6, 8, 11] (6 layers kept for efficiency)
41
+ - **Hidden Size:** 768
42
+ - **Attention Heads:** 12
43
+ - **Max Position Embeddings:** 512
44
+ - **Vocab Size:** 35,000
45
+
46
+ ## Uses
47
+
48
+ ### Direct Use
49
+
50
+ This model classifies document pages (text extracted via OCR) into one of 18 categories:
51
+
52
+ | ID | Class Label | Description |
53
+ |----|-------------|-------------|
54
+ | 0 | AA_AADE_OTHER | Other AADE documents |
55
+ | 1 | AA_Certificate_of_Current_Image_of_Entity | Business/Partnership Certificates |
56
+ | 2 | AA_ENERGY | Energy bills |
57
+ | 3 | AA_Employer's_Certificate/Payroll | Employment certificates |
58
+ | 4 | AA_ID_Card | Identity cards |
59
+ | 5 | AA_INCOME_TAX_RETURN_-_E1 | Income tax return (E1 form) |
60
+ | 6 | AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS | Legal entity tax returns (N form) |
61
+ | 7 | AA_LEGAL_ENTITY_MINUTES | General Assembly/Board minutes |
62
+ | 8 | AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION | Articles of association |
63
+ | 9 | AA_LEGAL_ENT_CERTIFICATE | Commercial Registry certificates |
64
+ | 10 | AA_NEW_POLICE_IDENTITY_CARD | New police ID cards |
65
+ | 11 | AA_Natural_Person_Information_Form | Ownership certificates |
66
+ | 12 | AA_Pension_Certificate | Pension certificates |
67
+ | 13 | AA_Personal_Income_Tax_(FEP) | Personal income tax (FEP) |
68
+ | 14 | AA_SOLEMN_DECLARATION | Solemn declarations |
69
+ | 15 | AA_TELEPHONY | Phone bills |
70
+ | 16 | BB_Other_Documents | Other identifiable documents |
71
+ | 17 | Other | Unclassified pages |
72
+
73
+ ## How to Get Started with the Model
74
+
75
+ ### Prerequisites
76
+
77
+ ```bash
78
+ pip install transformers torch
79
+ ```
80
+
81
+ ### Preprocessing Function (Required!)
82
+
83
+ ⚠️ **IMPORTANT:** This preprocessing MUST be applied to all texts before inference. The model was trained with this preprocessing.
84
+
85
+ ```python
86
+ import re
87
+ import unicodedata
88
+
89
+ # Same symbols removed during training
90
+ SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]"
91
+
92
+ def strip_accents_and_lowercase(text: str) -> str:
93
+ """Remove accents and convert to lowercase."""
94
+ return "".join(
95
+ c for c in unicodedata.normalize("NFD", text)
96
+ if unicodedata.category(c) != "Mn"
97
+ ).lower()
98
+
99
+ def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
100
+ """
101
+ Main preprocessing function.
102
+
103
+ Steps:
104
+ 1. Remove special symbols
105
+ 2. Collapse multiple dots into single dot
106
+ 3. Remove accents + lowercase
107
+ 4. Normalize whitespace
108
+ """
109
+ if symbols_to_remove:
110
+ text = re.sub(symbols_to_remove, " ", text)
111
+
112
+ text = re.sub(r"\.{2,}", ". ", text)
113
+ text = strip_accents_and_lowercase(text)
114
+ text = re.sub(r"\s+", " ", text).strip()
115
+ return text
116
+
117
+ def preprocess_text(text: str) -> str:
118
+ return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)
119
+ ```
120
+
121
+ ### Inference Code Snippet (includes preprocessing + dummy strings)
122
+
123
+ ```python
124
+ import json
125
+ import re
126
+ import unicodedata
127
+ import torch
128
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
129
+
130
+ # Preprocessing (REQUIRED!)
131
+ SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]"
132
+
133
+ def strip_accents_and_lowercase(text: str) -> str:
134
+ return "".join(
135
+ c for c in unicodedata.normalize("NFD", text)
136
+ if unicodedata.category(c) != "Mn"
137
+ ).lower()
138
+
139
+ def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
140
+ if symbols_to_remove:
141
+ text = re.sub(symbols_to_remove, " ", text)
142
+ text = re.sub(r"\.{2,}", ". ", text)
143
+ text = strip_accents_and_lowercase(text)
144
+ text = re.sub(r"\s+", " ", text).strip()
145
+ return text
146
+
147
+ def preprocess_text(text: str) -> str:
148
+ return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)
149
+
150
+ # Load model and tokenizer
151
+ MODEL_PATH = "path/to/model"
152
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
153
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
154
+ model.eval()
155
+
156
+ # Load label mapping
157
+ with open(f"{MODEL_PATH}/id2label.json", "r", encoding="utf-8") as f:
158
+ id2label = json.load(f)
159
+
160
+ # Dummy texts (examples)
161
+ texts = [
162
+ "ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ",
163
+ "ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024",
164
+ ]
165
+
166
+ # Preprocess texts
167
+ preprocessed_texts = [preprocess_text(t) for t in texts]
168
+
169
+ # Tokenize
170
+ inputs = tokenizer(
171
+ preprocessed_texts,
172
+ truncation=True,
173
+ padding="max_length",
174
+ max_length=512,
175
+ return_tensors="pt"
176
+ )
177
+
178
+ # Inference
179
+ with torch.no_grad():
180
+ outputs = model(**inputs)
181
+ logits = outputs.logits
182
+ probabilities = torch.sigmoid(logits) # Multi-label sigmoid
183
+ predictions = probabilities.argmax(dim=1)
184
+
185
+ # Get labels
186
+ for i, pred in enumerate(predictions):
187
+ label = id2label[str(pred.item())]
188
+ confidence = probabilities[i][pred].item()
189
+ print(f"Text: {texts[i][:50]}...")
190
+ print(f"Prediction: {label} (confidence: {confidence:.4f})")
191
+ print()
192
+ ```
193
+
194
+ ### Expected Output
195
+
196
+ ```
197
+ Text: ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ...
198
+ Prediction: AA_ID_Card (confidence: 0.9842)
199
+
200
+ Text: ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024...
201
+ Prediction: AA_INCOME_TAX_RETURN_-_E1 (confidence: 0.9567)
202
+ ```
203
+
204
+ ## Training Details
205
+
206
+ ### Training Data
207
+
208
+ - **Dataset:** Internal annotated document dataset
209
+ - **Total Samples:** ~6,600 (train + validation)
210
+ - **Test Samples:** 1,336
211
+ - **Classes:** 18 (imbalanced distribution)
212
+ - **Largest Class:** Other (571 test samples, ~43%)
213
+ - **Smallest Class:** AA_LEGAL_ENTITY_MINUTES (7 test samples, ~0.5%)
214
+
215
+ ### Training Procedure
216
+
217
+ #### Phase 1: Contrastive Learning
218
+ - **Base Model:** nlpaueb/bert-base-greek-uncased-v1
219
+ - **Loss Function:** Supervised Contrastive Loss (SCL)
220
+ - **Epochs:** 200
221
+ - **Learning Rate:** 2e-5
222
+ - **Batch Size:** 32
223
+ - **Layer Pruning:** Kept layers [0, 2, 4, 6, 8, 11]
224
+
225
+ #### Phase 2: Classification
226
+ - **Base Model:** Output of Phase 1 (26_01_2026_15_00_12)
227
+ - **Loss Function:** Asymmetric Loss (gamma=4)
228
+ - **Epochs:** 50
229
+ - **Learning Rate:** 1e-4
230
+ - **Batch Size:** 32
231
+ - **Gradient Accumulation:** 2
232
+ - **Warmup Ratio:** 0.1
233
+ - **LR Scheduler:** Cosine
234
+ - **Oversampling:** BB_Other_Documents (x2)
235
+
236
+ ### Framework Versions
237
+
238
+ - **Python:** 3.9.0
239
+ - **PyTorch:** 2.x
240
+ - **Transformers:** 4.38.2
241
+ - **Datasets:** 2.x
242
+
243
+ ## Evaluation Results
244
+
245
+ ### Overall Metrics (Test Set: 1,336 samples)
246
+
247
+ | Metric | Score |
248
+ |--------|-------|
249
+ | **Accuracy** | 0.94 |
250
+ | **Macro F1** | 0.92 |
251
+ | **Weighted F1** | 0.94 |
252
+
253
+ ### Per-Class Performance
254
+
255
+ | Class | Precision | Recall | F1-Score | Support |
256
+ |-------|-----------|--------|----------|---------|
257
+ | AA_AADE_OTHER | 0.89 | 0.89 | 0.89 | 9 |
258
+ | AA_Certificate_of_Current_Image | 1.00 | 1.00 | 1.00 | 10 |
259
+ | AA_ENERGY | 0.92 | 0.89 | 0.91 | 27 |
260
+ | AA_Employer's_Certificate/Payroll | 0.86 | 0.97 | 0.92 | 39 |
261
+ | AA_ID_Card | 1.00 | 0.99 | 1.00 | 190 |
262
+ | AA_INCOME_TAX_RETURN_-_E1 | 0.92 | 0.86 | 0.89 | 77 |
263
+ | AA_INCOME_TAX_RETURN_LEGAL | 1.00 | 1.00 | 1.00 | 8 |
264
+ | AA_LEGAL_ENTITY_MINUTES | 1.00 | 1.00 | 1.00 | 7 |
265
+ | AA_LEGAL_ENT_ARTICLES | 0.80 | 1.00 | 0.89 | 8 |
266
+ | AA_LEGAL_ENT_CERTIFICATE | 0.71 | 0.88 | 0.79 | 17 |
267
+ | AA_NEW_POLICE_IDENTITY_CARD | 0.96 | 1.00 | 0.98 | 26 |
268
+ | AA_Natural_Person_Form | 0.90 | 0.93 | 0.92 | 30 |
269
+ | AA_Pension_Certificate | 0.92 | 0.95 | 0.93 | 74 |
270
+ | AA_Personal_Income_Tax_(FEP) | 1.00 | 0.94 | 0.97 | 147 |
271
+ | AA_SOLEMN_DECLARATION | 0.80 | 0.89 | 0.84 | 9 |
272
+ | AA_TELEPHONY | 0.97 | 0.92 | 0.94 | 65 |
273
+ | **BB_Other_Documents** | **0.82** | **0.64** | **0.72** | 22 |
274
+ | **Other** | **0.94** | **0.95** | **0.95** | 571 |
275
+
276
+ ### Key Performance Highlights
277
+
278
+ - ✅ **Other class:** F1=0.95 (excellent handling of the majority class)
279
+ - ✅ **BB_Other_Documents:** F1=0.72 (best among all trained models for this rare class)
280
+ - ✅ **High-confidence classes:** AA_ID_Card, AA_Certificate, AA_Legal_Entity_Minutes all achieve 1.00 F1
281
+ - ⚠️ **Lower performance:** AA_LEGAL_ENT_CERTIFICATE (F1=0.79) - needs more training data
282
+
283
+ ## Model Files
284
+
285
+ | File | Description | Required |
286
+ |------|-------------|----------|
287
+ | `model.safetensors` | Model weights | ✅ Yes |
288
+ | `config.json` | Model architecture + id2label/label2id | ✅ Yes |
289
+ | `tokenizer.json` | Tokenizer | ✅ Yes |
290
+ | `tokenizer_config.json` | Tokenizer config | ✅ Yes |
291
+ | `vocab.txt` | Vocabulary | ✅ Yes |
292
+ | `special_tokens_map.json` | Special tokens | ✅ Yes |
293
+ | `id2label.json` | ID to label mapping | ✅ Yes |
294
+ | `label2id.json` | Label to ID mapping | ✅ Yes |
295
+ | `test_report.txt` | Classification report | Optional |
296
+
297
+ ## Model Card Authors
298
+
299
+ AI Services Team - Archeiothiki S.A.
300
+
301
+ ## Model Card Contact
302
+
303
+ Internal use only.
config.json ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/home/kgz/Arch-TextClassification/scripts/../inference/contrastive_learning/l3869_page_classification/26_01_2026_15_00_12",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "AA_AADE_OTHER",
13
+ "1": "AA_Certificate_of_Current_Image_of_Entity/Business_Members/Partners",
14
+ "10": "AA_NEW_POLICE_IDENTITY_CARD",
15
+ "11": "AA_Natural_Person_Information_Form/CERTIFICATE_OF_OWNERSHIP/National_Contact_Register",
16
+ "12": "AA_Pension_Certificate",
17
+ "13": "AA_Personal_Income_Tax_(FEP)",
18
+ "14": "AA_SOLEMN_DECLARATION",
19
+ "15": "AA_TELEPHONY",
20
+ "16": "BB_Other_Documents",
21
+ "17": "Other",
22
+ "2": "AA_ENERGY",
23
+ "3": "AA_Employer's_Certificate/Payroll",
24
+ "4": "AA_ID_Card",
25
+ "5": "AA_INCOME_TAX_RETURN_-_E1",
26
+ "6": "AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS_AND_LEGAL_ENTITIES_-_N",
27
+ "7": "AA_LEGAL_ENTITY_MINUTES_OF_THE_GENERAL_ASSEMBLY/SPECIAL/BOARD_OF_DIRECTORS",
28
+ "8": "AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION/AMENDMENT",
29
+ "9": "AA_LEGAL_ENT_CERTIFICATE/ANNOUNCEMENT_OF_THE_GENERAL_COMMERCIAL_REGISTRY"
30
+ },
31
+ "initializer_range": 0.02,
32
+ "intermediate_size": 3072,
33
+ "label2id": {
34
+ "AA_AADE_OTHER": 0,
35
+ "AA_Certificate_of_Current_Image_of_Entity/Business_Members/Partners": 1,
36
+ "AA_ENERGY": 2,
37
+ "AA_Employer's_Certificate/Payroll": 3,
38
+ "AA_ID_Card": 4,
39
+ "AA_INCOME_TAX_RETURN_-_E1": 5,
40
+ "AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS_AND_LEGAL_ENTITIES_-_N": 6,
41
+ "AA_LEGAL_ENTITY_MINUTES_OF_THE_GENERAL_ASSEMBLY/SPECIAL/BOARD_OF_DIRECTORS": 7,
42
+ "AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION/AMENDMENT": 8,
43
+ "AA_LEGAL_ENT_CERTIFICATE/ANNOUNCEMENT_OF_THE_GENERAL_COMMERCIAL_REGISTRY": 9,
44
+ "AA_NEW_POLICE_IDENTITY_CARD": 10,
45
+ "AA_Natural_Person_Information_Form/CERTIFICATE_OF_OWNERSHIP/National_Contact_Register": 11,
46
+ "AA_Pension_Certificate": 12,
47
+ "AA_Personal_Income_Tax_(FEP)": 13,
48
+ "AA_SOLEMN_DECLARATION": 14,
49
+ "AA_TELEPHONY": 15,
50
+ "BB_Other_Documents": 16,
51
+ "Other": 17
52
+ },
53
+ "layer_norm_eps": 1e-12,
54
+ "max_position_embeddings": 512,
55
+ "model_type": "bert",
56
+ "num_attention_heads": 12,
57
+ "num_hidden_layers": 6,
58
+ "output_past": true,
59
+ "pad_token_id": 0,
60
+ "position_embedding_type": "absolute",
61
+ "problem_type": "multi_label_classification",
62
+ "torch_dtype": "float32",
63
+ "transformers_version": "4.38.2",
64
+ "type_vocab_size": 2,
65
+ "use_cache": true,
66
+ "vocab_size": 35000
67
+ }
id2label.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "0": "AA_AADE_OTHER",
3
+ "1": "AA_Certificate_of_Current_Image_of_Entity/Business_Members/Partners",
4
+ "2": "AA_ENERGY",
5
+ "3": "AA_Employer's_Certificate/Payroll",
6
+ "4": "AA_ID_Card",
7
+ "5": "AA_INCOME_TAX_RETURN_-_E1",
8
+ "6": "AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS_AND_LEGAL_ENTITIES_-_N",
9
+ "7": "AA_LEGAL_ENTITY_MINUTES_OF_THE_GENERAL_ASSEMBLY/SPECIAL/BOARD_OF_DIRECTORS",
10
+ "8": "AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION/AMENDMENT",
11
+ "9": "AA_LEGAL_ENT_CERTIFICATE/ANNOUNCEMENT_OF_THE_GENERAL_COMMERCIAL_REGISTRY",
12
+ "10": "AA_NEW_POLICE_IDENTITY_CARD",
13
+ "11": "AA_Natural_Person_Information_Form/CERTIFICATE_OF_OWNERSHIP/National_Contact_Register",
14
+ "12": "AA_Pension_Certificate",
15
+ "13": "AA_Personal_Income_Tax_(FEP)",
16
+ "14": "AA_SOLEMN_DECLARATION",
17
+ "15": "AA_TELEPHONY",
18
+ "16": "BB_Other_Documents",
19
+ "17": "Other"
20
+ }
label2id.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "AA_AADE_OTHER": 0,
3
+ "AA_Certificate_of_Current_Image_of_Entity/Business_Members/Partners": 1,
4
+ "AA_ENERGY": 2,
5
+ "AA_Employer's_Certificate/Payroll": 3,
6
+ "AA_ID_Card": 4,
7
+ "AA_INCOME_TAX_RETURN_-_E1": 5,
8
+ "AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS_AND_LEGAL_ENTITIES_-_N": 6,
9
+ "AA_LEGAL_ENTITY_MINUTES_OF_THE_GENERAL_ASSEMBLY/SPECIAL/BOARD_OF_DIRECTORS": 7,
10
+ "AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION/AMENDMENT": 8,
11
+ "AA_LEGAL_ENT_CERTIFICATE/ANNOUNCEMENT_OF_THE_GENERAL_COMMERCIAL_REGISTRY": 9,
12
+ "AA_NEW_POLICE_IDENTITY_CARD": 10,
13
+ "AA_Natural_Person_Information_Form/CERTIFICATE_OF_OWNERSHIP/National_Contact_Register": 11,
14
+ "AA_Pension_Certificate": 12,
15
+ "AA_Personal_Income_Tax_(FEP)": 13,
16
+ "AA_SOLEMN_DECLARATION": 14,
17
+ "AA_TELEPHONY": 15,
18
+ "BB_Other_Documents": 16,
19
+ "Other": 17
20
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d03eff45358b8a628f9fce8fd84cecaf152e898df4d2f8e5615c7ec2cd45efb3
3
+ size 281644032
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
test_report.txt ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ precision recall f1-score support
2
+
3
+ AA_AADE_OTHER 0.89 0.89 0.89 9
4
+ AA_Certificate_of_Current_Image_of_Entity/Business_Members/Partners 1.00 1.00 1.00 10
5
+ AA_ENERGY 0.92 0.89 0.91 27
6
+ AA_Employer's_Certificate/Payroll 0.86 0.97 0.92 39
7
+ AA_ID_Card 1.00 0.99 1.00 190
8
+ AA_INCOME_TAX_RETURN_-_E1 0.92 0.86 0.89 77
9
+ AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS_AND_LEGAL_ENTITIES_-_N 1.00 1.00 1.00 8
10
+ AA_LEGAL_ENTITY_MINUTES_OF_THE_GENERAL_ASSEMBLY/SPECIAL/BOARD_OF_DIRECTORS 1.00 1.00 1.00 7
11
+ AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION/AMENDMENT 0.80 1.00 0.89 8
12
+ AA_LEGAL_ENT_CERTIFICATE/ANNOUNCEMENT_OF_THE_GENERAL_COMMERCIAL_REGISTRY 0.71 0.88 0.79 17
13
+ AA_NEW_POLICE_IDENTITY_CARD 0.96 1.00 0.98 26
14
+ AA_Natural_Person_Information_Form/CERTIFICATE_OF_OWNERSHIP/National_Contact_Register 0.90 0.93 0.92 30
15
+ AA_Pension_Certificate 0.92 0.95 0.93 74
16
+ AA_Personal_Income_Tax_(FEP) 1.00 0.94 0.97 147
17
+ AA_SOLEMN_DECLARATION 0.80 0.89 0.84 9
18
+ AA_TELEPHONY 0.97 0.92 0.94 65
19
+ BB_Other_Documents 0.82 0.64 0.72 22
20
+ Other 0.94 0.95 0.95 571
21
+
22
+ accuracy 0.94 1336
23
+ macro avg 0.91 0.93 0.92 1336
24
+ weighted avg 0.95 0.94 0.94 1336
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "max_length": 512,
50
+ "model_max_length": 1000000000000000019884624838656,
51
+ "never_split": null,
52
+ "pad_to_multiple_of": null,
53
+ "pad_token": "[PAD]",
54
+ "pad_token_type_id": 0,
55
+ "padding_side": "right",
56
+ "sep_token": "[SEP]",
57
+ "stride": 0,
58
+ "strip_accents": null,
59
+ "tokenize_chinese_chars": true,
60
+ "tokenizer_class": "BertTokenizer",
61
+ "truncation_side": "right",
62
+ "truncation_strategy": "longest_first",
63
+ "unk_token": "[UNK]"
64
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2148e7ab014cebda456a2ef4b40f5302f53e9584be5ded3e7c70aca424295392
3
+ size 5521
vocab.txt ADDED
The diff for this file is too large to render. See raw diff