EN-VI Parallel Sense Tagger
This project is a token classification model for predicting English-Vietnamese word senses.
It is developed as a student project for NLP coursework, aiming to explore cross-lingual sense tagging.
Project Overview
Objective:
The goal of this project is to build a model that can accurately identify the sense of words in English and Vietnamese sentences, using token-level classification. This is especially useful in tasks like machine translation, semantic understanding, and multilingual NLP applications.
Dataset:
- A custom dataset combining English and Vietnamese texts.
- Labels include:
- PAD: Padding tokens
- O: Tokens not associated with a sense
- Sense labels: Specific word senses in English and Vietnamese
- Number of English labels: 6946
- Number of Vietnamese labels: 7029
Model Architecture:
- Base model: XLM-Roberta (cross-lingual transformer)
- Task: Token Classification
- Hidden size: 768
- Number of layers: 12
- Attention heads: 12
- Tokenizer: XLM-Roberta tokenizer
- Special tokens: PAD, BOS, EOS
Training
- The model is trained using PyTorch and the Hugging Face Transformers library.
- Optimized for cross-lingual word sense tagging.
- Supports batching, GPU acceleration, and token-level predictions.
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
# Load the tokenizer and model from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("kytrungchauwork/en-vi-parallel-sense-tagger")
model = AutoModelForTokenClassification.from_pretrained("kytrungchauwork/en-vi-parallel-sense-tagger")
# Example sentence
text = "Your input sentence here"
inputs = tokenizer(text, return_tensors="pt")
# Model inference
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for kytrungchauwork/eng_viet_parrallel_sense_tagger
Base model
FacebookAI/xlm-roberta-base