Slovenian 5-gram KenLM Mixed Language Model

A versatile 5-gram language model for Slovenian trained on Wikipedia, forum discussions, and social media data using KenLM.

Model Description

This is a statistical n-gram language model designed for perplexity-based filtering and quality assessment of Slovenian text. The model was trained using the KenLM toolkit on a diverse mix of formal and informal Slovenian text sources.

Model Details

  • Model Type: 5-gram statistical language model
  • Language: Slovenian (sl)
  • Training Data:
    • Wikipedia (formal, encyclopedic): ~134k sentences from 5,000 articles
    • Janes-Forum (informal healthcare forums): ~28k sentences from 5,000 posts
    • Janes-Tag (social media, tweets, blogs): ~20k sentences of CMC data
  • Total Training Data: ~182k sentences
  • Model Size: ~50-100 MB (binary format with pruning)
  • Format: KenLM binary format
  • Preprocessing: Lowercased, cleaned of special characters, sentence-segmented (min 5 words)

Intended Use

This model is primarily intended for:

  • Text quality filtering: Identifying well-formed Slovenian text based on perplexity scores
  • Data cleaning: Filtering web-scraped corpora (e.g., OSCAR, Common Crawl)
  • Language detection: Distinguishing Slovenian from other languages
  • Quality assessment: Evaluating fluency and naturalness of generated text

Versatile Coverage

Unlike models trained only on Wikipedia, this mixed model handles:

  • Formal/encyclopedic language (Wikipedia style)
  • Forum discussions and questions
  • Conversational Slovenian (everyday topics)
  • Common slang and informal expressions
  • Social media style text

Example Use Case

The model can be used to filter large text corpora by computing perplexity scores and keeping only texts within a reasonable range (25-5000), effectively removing:

  • Non-Slovenian or mixed-language content (high perplexity > 10,000)
  • Repetitive or boilerplate text (very low perplexity < 25)
  • Malformed or noisy text (high perplexity > 5,000)

Perplexity Benchmarks

Examples of perplexity scores on different text types:

Text Type Example Perplexity
Formal/encyclopedic "V Sloveniji živi približno dva milijona ljudi." 44
Forum question "Ima kdo izkušnje s tem programom." 74
Everyday conversation "Danes je lep sončen dan." 318
Common slang "Ma dej no to ni resno." 416
Informal praise "Haha to je bilo smešno res." 1,523
English text "the quick brown fox jumps" 37,960
Gibberish "asdf qwer zxcv tyui hjkl" 408,383

Usage

import kenlm
from huggingface_hub import hf_hub_download

# Download model from HuggingFace
model_path = hf_hub_download(
    repo_id="zustmartin/slovenian-kenlm-5gram-mixed",
    filename="slovenian_5gram_mixed.binary"
)

# Load the model
model = kenlm.Model(model_path)

# Compute perplexity for a text
text = "to je primer slovenskega besedila"
perplexity = model.perplexity(text.lower())
print(f"Perplexity: {perplexity}")

# Score individual sentences
score = model.score(text.lower())
print(f"Log10 probability: {score}")

# Filter a corpus
def is_good_slovenian(text, lower=25, upper=5000):
    """Check if text is good quality Slovenian."""
    ppl = model.perplexity(text.lower().strip())
    return lower <= ppl <= upper

texts = [
    "Danes je lep sončen dan.",  # Good
    "asdf qwer zxcv",  # Bad (gibberish)
    "the quick brown fox",  # Bad (English)
]

for text in texts:
    ppl = model.perplexity(text.lower())
    status = "✅ KEEP" if is_good_slovenian(text) else "❌ FILTER"
    print(f"{status} - {text} (perplexity: {ppl:.0f})")

Training Details

Training Data

Three diverse sources:

  1. Wikipedia (formal, encyclopedic)

    • Source: Wikimedia Wikipedia (wikimedia/wikipedia dataset, version 20231101.sl)
    • 5,000 articles, ~134k sentences
  2. Janes-Forum (informal healthcare forums)

    • Source: Janes-Forum 1.0 corpus from CLARIN.SI
    • 5,000 forum posts, ~28k sentences
  3. Janes-Tag (social media, tweets, blogs)

    • Source: Janes-Tag 3.0 corpus from CLARIN.SI
    • Computer-Mediated Communication (CMC) data, ~20k sentences

Processing Pipeline:

  1. Removed markup (Wikipedia, XML tags)
  2. Lowercased all text
  3. Kept only Slovenian characters: a-z, čšžćđ, digits, and basic punctuation
  4. Split into sentences (minimum 5 words per sentence)
  5. Normalized whitespace
  6. Mixed and shuffled all sources

Training Procedure

The model was trained using KenLM with pruning for smaller model size:

  • N-gram order: 5
  • Smoothing: Modified Kneser-Ney (KenLM default)
  • Pruning: --prune 0 1 1 2 2 (removes rare n-grams)
  • Output format: Binary (for efficient loading and querying)
  • Training command:
    lmplz -o 5 --prune 0 1 1 2 2 < slovenian_mixed_for_kenlm.txt > slovenian_5gram_mixed.arpa
    build_binary slovenian_5gram_mixed.arpa slovenian_5gram_mixed.binary
    

Limitations

  • Technical vocabulary: Limited coverage of IT/technical jargon (words like "procesor", "gigaherce")
  • Food vocabulary: May assign high perplexity to food-related discussions
  • Heavy slang: Some very informal slang words (e.g., "kul") may not be well-represented
  • Case insensitive: All text is lowercased, so the model does not capture case information
  • Punctuation: Heavily filtered during preprocessing
  • Statistical model: Does not understand semantic meaning, only n-gram patterns
  • Domain coverage: Healthcare forums are well-represented, but other specialized domains may not be

Files

  • slovenian_5gram_mixed.binary - Binary KenLM model (recommended for production use)

Installation

To use this model, install the KenLM Python bindings:

pip install https://github.com/kpu/kenlm/archive/master.zip

Citation

Please cite the training data sources:

@misc{janesTag,
  title={Training corpus Janes-Tag 3.0},
  author={Erjavec, Toma{\v z} and Fišer, Darja and Krek, Simon and Ledinek, Nina and Arhar Holdt, {\v S}pela},
  url={http://hdl.handle.net/11356/1732},
  note={Slovenian language resource repository {CLARIN}.{SI}},
  year={2022}
}

@misc{janesForum,
  title={Janes-Forum 1.0: Corpus of Slovene forum posts},
  author={Fišer, Darja and Erjavec, Toma{\v z} and Ljubešić, Nikola},
  url={http://hdl.handle.net/11356/1139},
  note={Slovenian language resource repository {CLARIN}.{SI}},
  year={2017}
}

License

The model weights are released under CC-BY-SA 4.0, compatible with the training data licenses:

  • Wikipedia content: CC-BY-SA 3.0
  • Janes-Forum: CC-BY-SA 4.0
  • Janes-Tag: CC-BY-SA 4.0

Acknowledgments

Model Performance

Tested on diverse Slovenian text types with perplexity scores ranging from:

  • 44-500: Excellent (natural formal and conversational Slovenian)
  • 500-2,000: Good (informal, some slang)
  • 2,000-5,000: Acceptable (heavy slang, specialized domains)
  • >5,000: Problematic (likely non-Slovenian or malformed)

Recommended filtering threshold for web-scraped corpora: 25-5,000

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train zustmartin/slovenian-kenlm-5gram-mixed