--- language: - de license: apache-2.0 tags: - austrian-german - dialect - classification - dach - text-classification library_name: transformers pipeline_tag: text-classification base_model: bert-base-german-cased extra_gated_prompt: >- This model is in validation phase. Access is granted to verified researchers and organizations. Please describe your intended use case. extra_gated_fields: Full name: text Organization: text Intended use: text I agree to use this data for research purposes only: type: checkbox --- # DACH Dialect Classifier Classifies German text as Austrian (AT), German (DE), or Swiss (CH). Fine-tuned `bert-base-german-cased` on 1500 synthetic examples. ## Results Trained for 5 epochs on a RTX 3090 Ti. Test set (150 examples, held out): | | Precision | Recall | F1 | |--|-----------|--------|-----| | AT | 0.96 | 0.96 | 0.96 | | DE | 0.96 | 1.00 | 0.98 | | CH | 0.98 | 0.94 | 0.96 | | **Macro avg** | **0.97** | **0.97** | **0.97** | Accuracy: **96.7%** ## Usage ```python from transformers import pipeline clf = pipeline("text-classification", model="Laborator/dach-dialect-classifier") clf("I hob ma gestern a Semmerl und an Leberkas gholt.") # [{'label': 'AT', 'score': 0.98}] clf("Ich habe mir gestern ein Broetchen geholt.") # [{'label': 'DE', 'score': 0.97}] clf("Ich ha mir geschter es Broetli gholt.") # [{'label': 'CH', 'score': 0.95}] ``` ## Training data 1500 synthetic examples — 500 each for AT, DE, CH. The texts use real lexical markers for each variety: **Austrian:** leiwand, Oida, Beisl, Sackerl, Bim, Semmel, Erdapfel, Topfen, Paradeiser, heuer, Perfekt instead of Praeteritum **German:** Tuete, Broetchen, Kartoffel, Strassenbahn, Buergeramt, consistent Praeteritum ("ich ging", "ich sah") **Swiss:** isch, haet, gmacht, Velo, Natel, Znueni, Muesli, Ruebli, Haerdoepfel, Grueezi ## Limitations This was trained on synthetic data. It catches vocabulary differences well but might miss subtler dialectal features or code-switching. Adding real text from parliament protocols, news, and forums woud improve it. ## License Apache 2.0