--- title: README emoji: 🐨 colorFrom: gray colorTo: pink sdk: static pinned: false --- # 📦 ViSoNorm Toolkit — Vietnamese Text Normalization & Processing **ViSoNorm** is a specialized toolkit for **Vietnamese text normalization and processing**, optimized for **NLP** environments and easily installable via **PyPI**. Resources (datasets, models) are stored and managed directly on **Hugging Face Hub** and **GitHub Releases**. --- ## 🚀 Key Features ### 1. 🔧 **BasicNormalizer** — Basic Text Normalization * **Case folding**: convert entire text to lowercase/uppercase/capitalize. * **Tone normalization**: normalize Vietnamese tone marks. * **Basic preprocessing**: remove extra whitespace, special characters, sentence formatting. ### 2. 😀 **EmojiHandler** — Emoji Processing * **Detect emojis**: detect emojis in text. * **Split emoji text**: separate emojis from sentences. * **Remove emojis**: remove all emojis. ### 3. ✏️ **Lexical Normalization** — Social Media Text Normalization * **ViSoLexNormalizer**: Normalize text using deep learning models from HuggingFace. * **NswDetector**: Detect non-standard words (NSW). * **detect_nsw()**: Utility function to detect NSW. * **normalize_sentence()**: Utility function to normalize sentences. ### 4. 📊 **Resource Management** — Dataset Management * `list_datasets()` — List available datasets. * `load_dataset()` — Load dataset from GitHub Releases. * `get_dataset_info()` — View detailed dataset information. ### 5. 🧠 **Task Models** — Task Processing Models * **SpamReviewDetection** — Spam detection. * **HateSpeechDetection** — Hate speech detection. * **HateSpeechSpanDetection** — Hate speech span detection. * **EmotionRecognition** — Emotion recognition. * **AspectSentimentAnalysis** — Aspect-based sentiment analysis. --- ## 📥 Installation ### Install from PyPI (Recommended) ```bash pip install visonorm ``` ## 📝 Citation ViSoLex is developed at the University of Information Technology, Vietnam National University Ho Chi Minh City (UIT, VNU-HCM). If you use ViSoLex in your research, please **CITE**: ``` @article{nguyen_weakly_2025, title = {A {Weakly} {Supervised} {Data} {Labeling} {Framework} for {Machine} {Lexical} {Normalization} in {Vietnamese} {Social} {Media}}, volume = {17}, issn = {1866-9964}, url = {https://doi.org/10.1007/s12559-024-10356-3}, doi = {10.1007/s12559-024-10356-3}, number = {1}, journal = {Cognitive Computation}, author = {Nguyen, Dung Ha and Nguyen, Anh Thi Hoang and Van Nguyen, Kiet}, month = jan, year = {2025}, pages = {57}, } ``` ``` @inproceedings{nguyen-etal-2025-visolex, title = "{V}i{S}o{L}ex: An Open-Source Repository for {V}ietnamese Social Media Lexical Normalization", author = "Nguyen, Anh Thi-Hoang and Nguyen, Dung Ha and Nguyen, Kiet Van", editor = "Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven and Mather, Brodie and Dras, Mark", booktitle = "Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations", month = jan, year = "2025", address = "Abu Dhabi, UAE", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.coling-demos.18/", pages = "183--188", } ```