Word2vec
/

german_model

Model card Files Files and versions

german_model / README.md

lbourdois's picture

Update README.md

1dfd63a almost 3 years ago

|

history blame contribute delete

1.28 kB

	---
	tags:
	- word2vec
	language: de
	license: mit
	datasets:
	- wikipedia
	---

	## Description
	German word embedding model trained by Müller with the following parameter configuration:
	- a corpus as big as possible (and as diverse as possible without being informal) filtering of punctuation and stopwords
	- forming bigramm tokens
	- using skip-gram as training algorithm with hierarchical softmax
	- window size between 5 and 10
	- dimensionality of feature vectors of 300 or more
	- using negative sampling with 10 samples
	- ignoring all words with total frequency lower than 50

	For more information, see [https://devmount.github.io/GermanWordEmbeddings/](https://devmount.github.io/GermanWordEmbeddings/)

	## How to use?

	```
	from gensim.models import KeyedVectors
	from huggingface_hub import hf_hub_download
	model = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="Word2vec/german_model", filename="german.model"), binary=True, unicode_errors="ignore")
	```

	## Citation

	```
	@thesis{mueller2015,
	author = {{Müller}, Andreas},
	title = "{Analyse von Wort-Vektoren deutscher Textkorpora}",
	school = {Technische Universität Berlin},
	year = 2015,
	month = jun,
	type = {Bachelor's Thesis},
	url = {https://devmount.github.io/GermanWordEmbeddings}
	}
	```