Slovenian 5-gram KenLM Mixed Language Model
A versatile 5-gram language model for Slovenian trained on Wikipedia, forum discussions, and social media data using KenLM.
Model Description
This is a statistical n-gram language model designed for perplexity-based filtering and quality assessment of Slovenian text. The model was trained using the KenLM toolkit on a diverse mix of formal and informal Slovenian text sources.
Model Details
- Model Type: 5-gram statistical language model
- Language: Slovenian (sl)
- Training Data:
- Wikipedia (formal, encyclopedic): ~134k sentences from 5,000 articles
- Janes-Forum (informal healthcare forums): ~28k sentences from 5,000 posts
- Janes-Tag (social media, tweets, blogs): ~20k sentences of CMC data
- Total Training Data: ~182k sentences
- Model Size: ~50-100 MB (binary format with pruning)
- Format: KenLM binary format
- Preprocessing: Lowercased, cleaned of special characters, sentence-segmented (min 5 words)
Intended Use
This model is primarily intended for:
- Text quality filtering: Identifying well-formed Slovenian text based on perplexity scores
- Data cleaning: Filtering web-scraped corpora (e.g., OSCAR, Common Crawl)
- Language detection: Distinguishing Slovenian from other languages
- Quality assessment: Evaluating fluency and naturalness of generated text
Versatile Coverage
Unlike models trained only on Wikipedia, this mixed model handles:
- ✅ Formal/encyclopedic language (Wikipedia style)
- ✅ Forum discussions and questions
- ✅ Conversational Slovenian (everyday topics)
- ✅ Common slang and informal expressions
- ✅ Social media style text
Example Use Case
The model can be used to filter large text corpora by computing perplexity scores and keeping only texts within a reasonable range (25-5000), effectively removing:
- Non-Slovenian or mixed-language content (high perplexity > 10,000)
- Repetitive or boilerplate text (very low perplexity < 25)
- Malformed or noisy text (high perplexity > 5,000)
Perplexity Benchmarks
Examples of perplexity scores on different text types:
| Text Type | Example | Perplexity |
|---|---|---|
| Formal/encyclopedic | "V Sloveniji živi približno dva milijona ljudi." | 44 |
| Forum question | "Ima kdo izkušnje s tem programom." | 74 |
| Everyday conversation | "Danes je lep sončen dan." | 318 |
| Common slang | "Ma dej no to ni resno." | 416 |
| Informal praise | "Haha to je bilo smešno res." | 1,523 |
| English text | "the quick brown fox jumps" | 37,960 |
| Gibberish | "asdf qwer zxcv tyui hjkl" | 408,383 |
Usage
import kenlm
from huggingface_hub import hf_hub_download
# Download model from HuggingFace
model_path = hf_hub_download(
repo_id="zustmartin/slovenian-kenlm-5gram-mixed",
filename="slovenian_5gram_mixed.binary"
)
# Load the model
model = kenlm.Model(model_path)
# Compute perplexity for a text
text = "to je primer slovenskega besedila"
perplexity = model.perplexity(text.lower())
print(f"Perplexity: {perplexity}")
# Score individual sentences
score = model.score(text.lower())
print(f"Log10 probability: {score}")
# Filter a corpus
def is_good_slovenian(text, lower=25, upper=5000):
"""Check if text is good quality Slovenian."""
ppl = model.perplexity(text.lower().strip())
return lower <= ppl <= upper
texts = [
"Danes je lep sončen dan.", # Good
"asdf qwer zxcv", # Bad (gibberish)
"the quick brown fox", # Bad (English)
]
for text in texts:
ppl = model.perplexity(text.lower())
status = "✅ KEEP" if is_good_slovenian(text) else "❌ FILTER"
print(f"{status} - {text} (perplexity: {ppl:.0f})")
Training Details
Training Data
Three diverse sources:
Wikipedia (formal, encyclopedic)
- Source: Wikimedia Wikipedia (
wikimedia/wikipediadataset, version20231101.sl) - 5,000 articles, ~134k sentences
- Source: Wikimedia Wikipedia (
Janes-Forum (informal healthcare forums)
- Source: Janes-Forum 1.0 corpus from CLARIN.SI
- 5,000 forum posts, ~28k sentences
Janes-Tag (social media, tweets, blogs)
- Source: Janes-Tag 3.0 corpus from CLARIN.SI
- Computer-Mediated Communication (CMC) data, ~20k sentences
Processing Pipeline:
- Removed markup (Wikipedia, XML tags)
- Lowercased all text
- Kept only Slovenian characters:
a-z,čšžćđ, digits, and basic punctuation - Split into sentences (minimum 5 words per sentence)
- Normalized whitespace
- Mixed and shuffled all sources
Training Procedure
The model was trained using KenLM with pruning for smaller model size:
- N-gram order: 5
- Smoothing: Modified Kneser-Ney (KenLM default)
- Pruning:
--prune 0 1 1 2 2(removes rare n-grams) - Output format: Binary (for efficient loading and querying)
- Training command:
lmplz -o 5 --prune 0 1 1 2 2 < slovenian_mixed_for_kenlm.txt > slovenian_5gram_mixed.arpa build_binary slovenian_5gram_mixed.arpa slovenian_5gram_mixed.binary
Limitations
- Technical vocabulary: Limited coverage of IT/technical jargon (words like "procesor", "gigaherce")
- Food vocabulary: May assign high perplexity to food-related discussions
- Heavy slang: Some very informal slang words (e.g., "kul") may not be well-represented
- Case insensitive: All text is lowercased, so the model does not capture case information
- Punctuation: Heavily filtered during preprocessing
- Statistical model: Does not understand semantic meaning, only n-gram patterns
- Domain coverage: Healthcare forums are well-represented, but other specialized domains may not be
Files
slovenian_5gram_mixed.binary- Binary KenLM model (recommended for production use)
Installation
To use this model, install the KenLM Python bindings:
pip install https://github.com/kpu/kenlm/archive/master.zip
Citation
Please cite the training data sources:
@misc{janesTag,
title={Training corpus Janes-Tag 3.0},
author={Erjavec, Toma{\v z} and Fišer, Darja and Krek, Simon and Ledinek, Nina and Arhar Holdt, {\v S}pela},
url={http://hdl.handle.net/11356/1732},
note={Slovenian language resource repository {CLARIN}.{SI}},
year={2022}
}
@misc{janesForum,
title={Janes-Forum 1.0: Corpus of Slovene forum posts},
author={Fišer, Darja and Erjavec, Toma{\v z} and Ljubešić, Nikola},
url={http://hdl.handle.net/11356/1139},
note={Slovenian language resource repository {CLARIN}.{SI}},
year={2017}
}
License
The model weights are released under CC-BY-SA 4.0, compatible with the training data licenses:
- Wikipedia content: CC-BY-SA 3.0
- Janes-Forum: CC-BY-SA 4.0
- Janes-Tag: CC-BY-SA 4.0
Acknowledgments
- KenLM by Kenneth Heafield
- CLARIN.SI for the Janes corpus collection
- Wikimedia Foundation for Wikipedia data
- HuggingFace for model hosting infrastructure
Model Performance
Tested on diverse Slovenian text types with perplexity scores ranging from:
- 44-500: Excellent (natural formal and conversational Slovenian)
- 500-2,000: Good (informal, some slang)
- 2,000-5,000: Acceptable (heavy slang, specialized domains)
- >5,000: Problematic (likely non-Slovenian or malformed)
Recommended filtering threshold for web-scraped corpora: 25-5,000