Commit ·
d0d8d15
1
Parent(s): fb27ada
Update README.md
Browse files
README.md
CHANGED
|
@@ -24,6 +24,7 @@ language:
|
|
| 24 |
- ta
|
| 25 |
- te
|
| 26 |
- yo
|
|
|
|
| 27 |
tags:
|
| 28 |
- kenlm
|
| 29 |
- perplexity
|
|
@@ -37,11 +38,14 @@ datasets:
|
|
| 37 |
duplicated_from: edugp/kenlm
|
| 38 |
---
|
| 39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
# KenLM models
|
| 41 |
This repo contains several KenLM models trained on different tokenized datasets and languages.
|
| 42 |
KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
|
| 43 |
|
| 44 |
-
At the root of this repo you will find different directories named after the dataset models were trained on (e.g. `wikipedia`, `oscar`). Within each directory, you will find several models trained on different language subsets of the dataset (e.g. `en (English)`, `es (Spanish)`, `fr (French)`). For each language you will find three different files
|
| 45 |
* `{language}.arpa.bin`: The trained KenLM model binary
|
| 46 |
* `{language}.sp.model`: The trained SentencePiece model used for tokenization
|
| 47 |
* `{language}.sp.vocab`: The vocabulary file for the SentencePiece model
|
|
|
|
| 24 |
- ta
|
| 25 |
- te
|
| 26 |
- yo
|
| 27 |
+
- de
|
| 28 |
tags:
|
| 29 |
- kenlm
|
| 30 |
- perplexity
|
|
|
|
| 38 |
duplicated_from: edugp/kenlm
|
| 39 |
---
|
| 40 |
|
| 41 |
+
# Fork of `edugp/kenlm`
|
| 42 |
+
|
| 43 |
+
* adds German wikipedia model.
|
| 44 |
+
|
| 45 |
# KenLM models
|
| 46 |
This repo contains several KenLM models trained on different tokenized datasets and languages.
|
| 47 |
KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
|
| 48 |
|
|
|
|
| 49 |
* `{language}.arpa.bin`: The trained KenLM model binary
|
| 50 |
* `{language}.sp.model`: The trained SentencePiece model used for tokenization
|
| 51 |
* `{language}.sp.vocab`: The vocabulary file for the SentencePiece model
|