| --- |
| tags: |
| - word2vec |
| language: de |
| license: mit |
| datasets: |
| - wikipedia |
| --- |
| |
| ## Description |
| German word embedding model trained by Müller with the following parameter configuration: |
| - a corpus as big as possible (and as diverse as possible without being informal) filtering of punctuation and stopwords |
| - forming bigramm tokens |
| - using skip-gram as training algorithm with hierarchical softmax |
| - window size between 5 and 10 |
| - dimensionality of feature vectors of 300 or more |
| - using negative sampling with 10 samples |
| - ignoring all words with total frequency lower than 50 |
|
|
| For more information, see [https://devmount.github.io/GermanWordEmbeddings/](https://devmount.github.io/GermanWordEmbeddings/) |
|
|
| ## How to use? |
|
|
| ``` |
| from gensim.models import KeyedVectors |
| from huggingface_hub import hf_hub_download |
| model = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="Word2vec/german_model", filename="german.model"), binary=True, unicode_errors="ignore") |
| ``` |
|
|
| ## Citation |
|
|
| ``` |
| @thesis{mueller2015, |
| author = {{Müller}, Andreas}, |
| title = "{Analyse von Wort-Vektoren deutscher Textkorpora}", |
| school = {Technische Universität Berlin}, |
| year = 2015, |
| month = jun, |
| type = {Bachelor's Thesis}, |
| url = {https://devmount.github.io/GermanWordEmbeddings} |
| } |
| ``` |