Upload 11 files

Browse files

Files changed (11) hide show

README.md +303 -0
config.json +67 -0
id2label.json +20 -0
label2id.json +20 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
test_report.txt +24 -0
tokenizer.json +0 -0
tokenizer_config.json +64 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,303 @@

+---
+license: other
+language:
+- el
+library_name: transformers
+pipeline_tag: text-classification
+tags:
+- text-classification
+- bert
+- greek
+- document-classification
+- page-classification
+- nlp
+- contrastive-learning
+base_model: nlpaueb/bert-base-greek-uncased-v1
+metrics:
+- accuracy
+- f1
+---
+# Arch-L3869-PageClassification
+## Model Details
+### Model Description
+This is a **Greek text classification model** for categorizing document pages into 18 different classes. The model was trained using a two-phase approach:
+1. **Phase 1 (Contrastive Learning):** Further pre-training of the base BERT model using Supervised Contrastive Learning (SCL) to create better document embeddings.
+2. **Phase 2 (Classification):** Fine-tuning with Asymmetric Loss for handling class imbalance.
+- **Developed by:** Archeiothiki S.A. - AI Services Team
+- **Model type:** BertForSequenceClassification
+- **Language(s):** Greek (el)
+- **Finetuned from model:** [nlpaueb/bert-base-greek-uncased-v1](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1)
+### Model Architecture
+- **Base Model:** nlpaueb/bert-base-greek-uncased-v1
+- **Pruned Layers:** [0, 2, 4, 6, 8, 11] (6 layers kept for efficiency)
+- **Hidden Size:** 768
+- **Attention Heads:** 12
+- **Max Position Embeddings:** 512
+- **Vocab Size:** 35,000
+## Uses
+### Direct Use
+This model classifies document pages (text extracted via OCR) into one of 18 categories:
+| ID | Class Label | Description |
+|----|-------------|-------------|
+| 0 | AA_AADE_OTHER | Other AADE documents |
+| 1 | AA_Certificate_of_Current_Image_of_Entity | Business/Partnership Certificates |
+| 2 | AA_ENERGY | Energy bills |
+| 3 | AA_Employer's_Certificate/Payroll | Employment certificates |
+| 4 | AA_ID_Card | Identity cards |
+| 5 | AA_INCOME_TAX_RETURN_-_E1 | Income tax return (E1 form) |
+| 6 | AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS | Legal entity tax returns (N form) |
+| 7 | AA_LEGAL_ENTITY_MINUTES | General Assembly/Board minutes |
+| 8 | AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION | Articles of association |
+| 9 | AA_LEGAL_ENT_CERTIFICATE | Commercial Registry certificates |
+| 10 | AA_NEW_POLICE_IDENTITY_CARD | New police ID cards |
+| 11 | AA_Natural_Person_Information_Form | Ownership certificates |
+| 12 | AA_Pension_Certificate | Pension certificates |
+| 13 | AA_Personal_Income_Tax_(FEP) | Personal income tax (FEP) |
+| 14 | AA_SOLEMN_DECLARATION | Solemn declarations |
+| 15 | AA_TELEPHONY | Phone bills |
+| 16 | BB_Other_Documents | Other identifiable documents |
+| 17 | Other | Unclassified pages |
+## How to Get Started with the Model
+### Prerequisites
+```bash
+pip install transformers torch
+```
+### Preprocessing Function (Required!)
+⚠️ **IMPORTANT:** This preprocessing MUST be applied to all texts before inference. The model was trained with this preprocessing.
+```python
+import re
+import unicodedata
+# Same symbols removed during training
+SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]"
+def strip_accents_and_lowercase(text: str) -> str:
+    """Remove accents and convert to lowercase."""
+    return "".join(
+        c for c in unicodedata.normalize("NFD", text)
+        if unicodedata.category(c) != "Mn"
+    ).lower()
+def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
+    """
+    Main preprocessing function.
+    Steps:
+        1. Remove special symbols
+        2. Collapse multiple dots into single dot
+        3. Remove accents + lowercase
+        4. Normalize whitespace
+    """
+    if symbols_to_remove:
+        text = re.sub(symbols_to_remove, " ", text)
+    text = re.sub(r"\.{2,}", ". ", text)
+    text = strip_accents_and_lowercase(text)
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+def preprocess_text(text: str) -> str:
+    return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)
+```
+### Inference Code Snippet (includes preprocessing + dummy strings)
+```python
+import json
+import re
+import unicodedata
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+# Preprocessing (REQUIRED!)
+SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]"
+def strip_accents_and_lowercase(text: str) -> str:
+    return "".join(
+        c for c in unicodedata.normalize("NFD", text)
+        if unicodedata.category(c) != "Mn"
+    ).lower()
+def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
+    if symbols_to_remove:
+        text = re.sub(symbols_to_remove, " ", text)
+    text = re.sub(r"\.{2,}", ". ", text)
+    text = strip_accents_and_lowercase(text)
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+def preprocess_text(text: str) -> str:
+    return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)
+# Load model and tokenizer
+MODEL_PATH = "path/to/model"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
+model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
+model.eval()
+# Load label mapping
+with open(f"{MODEL_PATH}/id2label.json", "r", encoding="utf-8") as f:
+    id2label = json.load(f)
+# Dummy texts (examples)
+texts = [
+    "ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ",
+    "ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024",
+]
+# Preprocess texts
+preprocessed_texts = [preprocess_text(t) for t in texts]
+# Tokenize
+inputs = tokenizer(
+    preprocessed_texts,
+    truncation=True,
+    padding="max_length",
+    max_length=512,
+    return_tensors="pt"
+)
+# Inference
+with torch.no_grad():
+    outputs = model(**inputs)
+    logits = outputs.logits
+    probabilities = torch.sigmoid(logits)  # Multi-label sigmoid
+    predictions = probabilities.argmax(dim=1)
+# Get labels
+for i, pred in enumerate(predictions):
+    label = id2label[str(pred.item())]
+    confidence = probabilities[i][pred].item()
+    print(f"Text: {texts[i][:50]}...")
+    print(f"Prediction: {label} (confidence: {confidence:.4f})")
+    print()
+```
+### Expected Output
+```
+Text: ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ...
+Prediction: AA_ID_Card (confidence: 0.9842)
+Text: ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024...
+Prediction: AA_INCOME_TAX_RETURN_-_E1 (confidence: 0.9567)
+```
+## Training Details
+### Training Data
+- **Dataset:** Internal annotated document dataset
+- **Total Samples:** ~6,600 (train + validation)
+- **Test Samples:** 1,336
+- **Classes:** 18 (imbalanced distribution)
+- **Largest Class:** Other (571 test samples, ~43%)
+- **Smallest Class:** AA_LEGAL_ENTITY_MINUTES (7 test samples, ~0.5%)
+### Training Procedure
+#### Phase 1: Contrastive Learning
+- **Base Model:** nlpaueb/bert-base-greek-uncased-v1
+- **Loss Function:** Supervised Contrastive Loss (SCL)
+- **Epochs:** 200
+- **Learning Rate:** 2e-5
+- **Batch Size:** 32
+- **Layer Pruning:** Kept layers [0, 2, 4, 6, 8, 11]
+#### Phase 2: Classification
+- **Base Model:** Output of Phase 1 (26_01_2026_15_00_12)
+- **Loss Function:** Asymmetric Loss (gamma=4)
+- **Epochs:** 50
+- **Learning Rate:** 1e-4
+- **Batch Size:** 32
+- **Gradient Accumulation:** 2
+- **Warmup Ratio:** 0.1
+- **LR Scheduler:** Cosine
+- **Oversampling:** BB_Other_Documents (x2)
+### Framework Versions
+- **Python:** 3.9.0
+- **PyTorch:** 2.x
+- **Transformers:** 4.38.2
+- **Datasets:** 2.x
+## Evaluation Results
+### Overall Metrics (Test Set: 1,336 samples)
+| Metric | Score |
+|--------|-------|
+| **Accuracy** | 0.94 |
+| **Macro F1** | 0.92 |
+| **Weighted F1** | 0.94 |
+### Per-Class Performance
+| Class | Precision | Recall | F1-Score | Support |
+|-------|-----------|--------|----------|---------|
+| AA_AADE_OTHER | 0.89 | 0.89 | 0.89 | 9 |
+| AA_Certificate_of_Current_Image | 1.00 | 1.00 | 1.00 | 10 |
+| AA_ENERGY | 0.92 | 0.89 | 0.91 | 27 |
+| AA_Employer's_Certificate/Payroll | 0.86 | 0.97 | 0.92 | 39 |
+| AA_ID_Card | 1.00 | 0.99 | 1.00 | 190 |
+| AA_INCOME_TAX_RETURN_-_E1 | 0.92 | 0.86 | 0.89 | 77 |
+| AA_INCOME_TAX_RETURN_LEGAL | 1.00 | 1.00 | 1.00 | 8 |
+| AA_LEGAL_ENTITY_MINUTES | 1.00 | 1.00 | 1.00 | 7 |
+| AA_LEGAL_ENT_ARTICLES | 0.80 | 1.00 | 0.89 | 8 |
+| AA_LEGAL_ENT_CERTIFICATE | 0.71 | 0.88 | 0.79 | 17 |
+| AA_NEW_POLICE_IDENTITY_CARD | 0.96 | 1.00 | 0.98 | 26 |
+| AA_Natural_Person_Form | 0.90 | 0.93 | 0.92 | 30 |
+| AA_Pension_Certificate | 0.92 | 0.95 | 0.93 | 74 |
+| AA_Personal_Income_Tax_(FEP) | 1.00 | 0.94 | 0.97 | 147 |
+| AA_SOLEMN_DECLARATION | 0.80 | 0.89 | 0.84 | 9 |
+| AA_TELEPHONY | 0.97 | 0.92 | 0.94 | 65 |
+| **BB_Other_Documents** | **0.82** | **0.64** | **0.72** | 22 |
+| **Other** | **0.94** | **0.95** | **0.95** | 571 |
+### Key Performance Highlights
+- ✅ **Other class:** F1=0.95 (excellent handling of the majority class)
+- ✅ **BB_Other_Documents:** F1=0.72 (best among all trained models for this rare class)
+- ✅ **High-confidence classes:** AA_ID_Card, AA_Certificate, AA_Legal_Entity_Minutes all achieve 1.00 F1
+- ⚠️ **Lower performance:** AA_LEGAL_ENT_CERTIFICATE (F1=0.79) - needs more training data
+## Model Files
+| File | Description | Required |
+|------|-------------|----------|
+| `model.safetensors` | Model weights | ✅ Yes |
+| `config.json` | Model architecture + id2label/label2id | ✅ Yes |
+| `tokenizer.json` | Tokenizer | ✅ Yes |
+| `tokenizer_config.json` | Tokenizer config | ✅ Yes |
+| `vocab.txt` | Vocabulary | ✅ Yes |
+| `special_tokens_map.json` | Special tokens | ✅ Yes |
+| `id2label.json` | ID to label mapping | ✅ Yes |
+| `label2id.json` | Label to ID mapping | ✅ Yes |
+| `test_report.txt` | Classification report | Optional |
+## Model Card Authors
+AI Services Team - Archeiothiki S.A.
+## Model Card Contact
+Internal use only.

config.json ADDED Viewed

	@@ -0,0 +1,67 @@

+{
+  "_name_or_path": "/home/kgz/Arch-TextClassification/scripts/../inference/contrastive_learning/l3869_page_classification/26_01_2026_15_00_12",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "AA_AADE_OTHER",
+    "1": "AA_Certificate_of_Current_Image_of_Entity/Business_Members/Partners",
+    "10": "AA_NEW_POLICE_IDENTITY_CARD",
+    "11": "AA_Natural_Person_Information_Form/CERTIFICATE_OF_OWNERSHIP/National_Contact_Register",
+    "12": "AA_Pension_Certificate",
+    "13": "AA_Personal_Income_Tax_(FEP)",
+    "14": "AA_SOLEMN_DECLARATION",
+    "15": "AA_TELEPHONY",
+    "16": "BB_Other_Documents",
+    "17": "Other",
+    "2": "AA_ENERGY",
+    "3": "AA_Employer's_Certificate/Payroll",
+    "4": "AA_ID_Card",
+    "5": "AA_INCOME_TAX_RETURN_-_E1",
+    "6": "AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS_AND_LEGAL_ENTITIES_-_N",
+    "7": "AA_LEGAL_ENTITY_MINUTES_OF_THE_GENERAL_ASSEMBLY/SPECIAL/BOARD_OF_DIRECTORS",
+    "8": "AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION/AMENDMENT",
+    "9": "AA_LEGAL_ENT_CERTIFICATE/ANNOUNCEMENT_OF_THE_GENERAL_COMMERCIAL_REGISTRY"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "AA_AADE_OTHER": 0,
+    "AA_Certificate_of_Current_Image_of_Entity/Business_Members/Partners": 1,
+    "AA_ENERGY": 2,
+    "AA_Employer's_Certificate/Payroll": 3,
+    "AA_ID_Card": 4,
+    "AA_INCOME_TAX_RETURN_-_E1": 5,
+    "AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS_AND_LEGAL_ENTITIES_-_N": 6,
+    "AA_LEGAL_ENTITY_MINUTES_OF_THE_GENERAL_ASSEMBLY/SPECIAL/BOARD_OF_DIRECTORS": 7,
+    "AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION/AMENDMENT": 8,
+    "AA_LEGAL_ENT_CERTIFICATE/ANNOUNCEMENT_OF_THE_GENERAL_COMMERCIAL_REGISTRY": 9,
+    "AA_NEW_POLICE_IDENTITY_CARD": 10,
+    "AA_Natural_Person_Information_Form/CERTIFICATE_OF_OWNERSHIP/National_Contact_Register": 11,
+    "AA_Pension_Certificate": 12,
+    "AA_Personal_Income_Tax_(FEP)": 13,
+    "AA_SOLEMN_DECLARATION": 14,
+    "AA_TELEPHONY": 15,
+    "BB_Other_Documents": 16,
+    "Other": 17
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "output_past": true,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "multi_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.38.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 35000
+}

id2label.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+    "0": "AA_AADE_OTHER",
+    "1": "AA_Certificate_of_Current_Image_of_Entity/Business_Members/Partners",
+    "2": "AA_ENERGY",
+    "3": "AA_Employer's_Certificate/Payroll",
+    "4": "AA_ID_Card",
+    "5": "AA_INCOME_TAX_RETURN_-_E1",
+    "6": "AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS_AND_LEGAL_ENTITIES_-_N",
+    "7": "AA_LEGAL_ENTITY_MINUTES_OF_THE_GENERAL_ASSEMBLY/SPECIAL/BOARD_OF_DIRECTORS",
+    "8": "AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION/AMENDMENT",
+    "9": "AA_LEGAL_ENT_CERTIFICATE/ANNOUNCEMENT_OF_THE_GENERAL_COMMERCIAL_REGISTRY",
+    "10": "AA_NEW_POLICE_IDENTITY_CARD",
+    "11": "AA_Natural_Person_Information_Form/CERTIFICATE_OF_OWNERSHIP/National_Contact_Register",
+    "12": "AA_Pension_Certificate",
+    "13": "AA_Personal_Income_Tax_(FEP)",
+    "14": "AA_SOLEMN_DECLARATION",
+    "15": "AA_TELEPHONY",
+    "16": "BB_Other_Documents",
+    "17": "Other"
+}

label2id.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+    "AA_AADE_OTHER": 0,
+    "AA_Certificate_of_Current_Image_of_Entity/Business_Members/Partners": 1,
+    "AA_ENERGY": 2,
+    "AA_Employer's_Certificate/Payroll": 3,
+    "AA_ID_Card": 4,
+    "AA_INCOME_TAX_RETURN_-_E1": 5,
+    "AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS_AND_LEGAL_ENTITIES_-_N": 6,
+    "AA_LEGAL_ENTITY_MINUTES_OF_THE_GENERAL_ASSEMBLY/SPECIAL/BOARD_OF_DIRECTORS": 7,
+    "AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION/AMENDMENT": 8,
+    "AA_LEGAL_ENT_CERTIFICATE/ANNOUNCEMENT_OF_THE_GENERAL_COMMERCIAL_REGISTRY": 9,
+    "AA_NEW_POLICE_IDENTITY_CARD": 10,
+    "AA_Natural_Person_Information_Form/CERTIFICATE_OF_OWNERSHIP/National_Contact_Register": 11,
+    "AA_Pension_Certificate": 12,
+    "AA_Personal_Income_Tax_(FEP)": 13,
+    "AA_SOLEMN_DECLARATION": 14,
+    "AA_TELEPHONY": 15,
+    "BB_Other_Documents": 16,
+    "Other": 17
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d03eff45358b8a628f9fce8fd84cecaf152e898df4d2f8e5615c7ec2cd45efb3
+size 281644032

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

test_report.txt ADDED Viewed

	@@ -0,0 +1,24 @@

+                                                                                       precision    recall  f1-score   support
+                                                                        AA_AADE_OTHER       0.89      0.89      0.89         9
+                  AA_Certificate_of_Current_Image_of_Entity/Business_Members/Partners       1.00      1.00      1.00        10
+                                                                            AA_ENERGY       0.92      0.89      0.91        27
+                                                    AA_Employer's_Certificate/Payroll       0.86      0.97      0.92        39
+                                                                           AA_ID_Card       1.00      0.99      1.00       190
+                                                            AA_INCOME_TAX_RETURN_-_E1       0.92      0.86      0.89        77
+                         AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS_AND_LEGAL_ENTITIES_-_N       1.00      1.00      1.00         8
+           AA_LEGAL_ENTITY_MINUTES_OF_THE_GENERAL_ASSEMBLY/SPECIAL/BOARD_OF_DIRECTORS       1.00      1.00      1.00         7
+                                       AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION/AMENDMENT       0.80      1.00      0.89         8
+             AA_LEGAL_ENT_CERTIFICATE/ANNOUNCEMENT_OF_THE_GENERAL_COMMERCIAL_REGISTRY       0.71      0.88      0.79        17
+                                                          AA_NEW_POLICE_IDENTITY_CARD       0.96      1.00      0.98        26
+AA_Natural_Person_Information_Form/CERTIFICATE_OF_OWNERSHIP/National_Contact_Register       0.90      0.93      0.92        30
+                                                               AA_Pension_Certificate       0.92      0.95      0.93        74
+                                                         AA_Personal_Income_Tax_(FEP)       1.00      0.94      0.97       147
+                                                                AA_SOLEMN_DECLARATION       0.80      0.89      0.84         9
+                                                                         AA_TELEPHONY       0.97      0.92      0.94        65
+                                                                   BB_Other_Documents       0.82      0.64      0.72        22
+                                                                                Other       0.94      0.95      0.95       571
+                                                                             accuracy                           0.94      1336
+                                                                            macro avg       0.91      0.93      0.92      1336
+                                                                         weighted avg       0.95      0.94      0.94      1336

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2148e7ab014cebda456a2ef4b40f5302f53e9584be5ded3e7c70aca424295392
+size 5521

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff