---
title: README
emoji: 🐨
colorFrom: gray
colorTo: pink
sdk: static
pinned: false
---

# 📦 ViSoNorm Toolkit — Vietnamese Text Normalization & Processing

**ViSoNorm** is a specialized toolkit for **Vietnamese text normalization and processing**, optimized for **NLP** environments and easily installable via **PyPI**. Resources (datasets, models) are stored and managed directly on **Hugging Face Hub** and **GitHub Releases**.

---
## 🚀 Key Features

### 1. 🔧 **BasicNormalizer** — Basic Text Normalization

* **Case folding**: convert entire text to lowercase/uppercase/capitalize.
* **Tone normalization**: normalize Vietnamese tone marks.
* **Basic preprocessing**: remove extra whitespace, special characters, sentence formatting.

### 2. 😀 **EmojiHandler** — Emoji Processing

* **Detect emojis**: detect emojis in text.
* **Split emoji text**: separate emojis from sentences.
* **Remove emojis**: remove all emojis.

### 3. ✏️ **Lexical Normalization** — Social Media Text Normalization

* **ViSoLexNormalizer**: Normalize text using deep learning models from HuggingFace.
* **NswDetector**: Detect non-standard words (NSW).
* **detect_nsw()**: Utility function to detect NSW.
* **normalize_sentence()**: Utility function to normalize sentences.

### 4. 📊 **Resource Management** — Dataset Management

* `list_datasets()` — List available datasets.
* `load_dataset()` — Load dataset from GitHub Releases.
* `get_dataset_info()` — View detailed dataset information.

### 5. 🧠 **Task Models** — Task Processing Models

* **SpamReviewDetection** — Spam detection.
* **HateSpeechDetection** — Hate speech detection.
* **HateSpeechSpanDetection** — Hate speech span detection.
* **EmotionRecognition** — Emotion recognition.
* **AspectSentimentAnalysis** — Aspect-based sentiment analysis.

---

## 📥 Installation

### Install from PyPI (Recommended)

```bash
pip install visonorm
```

## 📝 Citation

ViSoLex is developed at the University of Information Technology, Vietnam National University Ho Chi Minh City (UIT, VNU-HCM). If you use ViSoLex in your research, please **CITE**:

```
@article{nguyen_weakly_2025,
	title = {A {Weakly} {Supervised} {Data} {Labeling} {Framework} for {Machine} {Lexical} {Normalization} in {Vietnamese} {Social} {Media}},
	volume = {17},
	issn = {1866-9964},
	url = {https://doi.org/10.1007/s12559-024-10356-3},
	doi = {10.1007/s12559-024-10356-3},
	number = {1},
	journal = {Cognitive Computation},
	author = {Nguyen, Dung Ha and Nguyen, Anh Thi Hoang and Van Nguyen, Kiet},
	month = jan,
	year = {2025},
	pages = {57},
}
```

```
@inproceedings{nguyen-etal-2025-visolex,
    title = "{V}i{S}o{L}ex: An Open-Source Repository for {V}ietnamese Social Media Lexical Normalization",
    author = "Nguyen, Anh Thi-Hoang  and
      Nguyen, Dung Ha  and
      Nguyen, Kiet Van",
    editor = "Rambow, Owen  and
      Wanner, Leo  and
      Apidianaki, Marianna  and
      Al-Khalifa, Hend  and
      Eugenio, Barbara Di  and
      Schockaert, Steven  and
      Mather, Brodie  and
      Dras, Mark",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-demos.18/",
    pages = "183--188",
}
```